DuckDB in Action

DuckDB Multi-File Reading & Data Quality Analysis: From 50 Chaotic CSVs to One-Click Insights

Facing dozens of CSV files with inconsistent column names and formats? Learn how DuckDB merges them with one SQL query, maps mismatched columns using COALESCE, detects anomalies with Z-scores, and exports to Parquet — all in pure SQL.

DuckDB Multi-File Reading & Data Quality Analysis: From 50 Chaotic CSVs to One-Click Insights

Summary: Your data is rarely in a single clean table. It’s scattered across dozens of CSV files with inconsistent column names, mixed encodings, and missing fields. Manual merging with Excel? Impractical. Writing Python scripts with glob + concat? High maintenance cost. DuckDB handles all of this with a single SQL query — and does data quality checks and anomaly detection along the way.

1. The Scenario: This Is Not Just Your Problem

You’ve been there — your boss asks you to “summarize last month’s sales data from all branches.” You open your email and receive 50 Excel files, each with slightly different column names: some use “日期”, others use “date”; some have a “store” column, others don’t.

So you start: open each file → copy data → paste into the master sheet → adjust formatting → verify numbers → realize you missed two rows → start over. Three hours later, you’re still not sure if it’s correct.

Your goal: Let people solve problems the way they’re used to (Excel-style), but power it with DuckDB underneath.

The following roles spend at least 2–5 hours per week on “merging spreadsheets”:

Role	Typical Scenario
Finance	Monthly consolidation of subsidiary reports
Operations	Aggregate multi-channel campaign data
HR	Combine department attendance and performance records
Marketing	Collect regional campaign results
Supply Chain	Compare quotes from multiple vendors

2. Data Setup: Simulating 50 Messy Store Sales CSVs

Imagine you have 50 CSV files, each representing a branch’s monthly sales data:

sales_data/
├── store_001_sales.csv
├── store_002_sales.csv
├── ...
└── store_050_sales.csv

Each file has a similar but not identical structure:

日期,商品名称,类别,数量,单价,销售额,门店
2026-05-01,Keyboard,Peripherals,10,299,2990,Beijing
2026-05-01,Mouse,Peripherals,20,99,1980,Beijing
2026-05-02,Monitor,Hardware,5,2599,12995,Beijing

Note: Some files use Chinese column names, some use English, and some mix both.

3. Core Solution: Merge 50 Files in One Line of Code

Method 1: Glob Wildcards — The Simplest Approach

import duckdb

con = duckdb.connect(":memory:")

# One line: read all CSVs in the directory and auto-merge
result = con.execute("""
    SELECT * FROM read_csv_auto('sales_data/store_*.csv')
""").fetchdf()

print(f"Merged {len(result)} records")

That’s it. DuckDB automatically:

Identifies all matching files
Infers column names and data types
Aligns fields with different column names
Handles missing columns (fills with NULL)

Method 2: Handling Mismatched Column Names with COALESCE

When column names vary significantly across files, use COALESCE for unified mapping:

result = con.execute("""
    SELECT 
        COALESCE("日期", date) AS sale_date,
        COALESCE("商品名称", product) AS product_name,
        COALESCE("类别", category) AS category,
        COALESCE("数量", qty) AS quantity,
        COALESCE("单价", unit_price) AS unit_price,
        COALESCE("销售额", amount) AS amount,
        COALESCE("门店", store) AS store
    FROM read_csv_auto('sales_data/store_*.csv')
""").fetchdf()

Here, COALESCE tries the Chinese column name first; if unavailable, falls back to the English name. Regardless of what language the original files use, everything merges correctly.

Method 3: Wrap It Into a Reusable Utility Function

import duckdb
from pathlib import Path
from typing import Optional

def merge_csv_glob(
    pattern: str,
    output_format: str = "dataframe",
    columns: Optional[dict] = None
):
    """
    Merge all CSV files matching a glob pattern
    
    Args:
        pattern: glob pattern, e.g., 'sales_data/*.csv'
        output_format: 'dataframe', 'table', or 'parquet'
        columns: column name mapping dict, {'old_name': 'new_name'}
    
    Returns:
        pandas DataFrame or None
    """
    con = duckdb.connect(":memory:")
    
    base_query = f"SELECT * FROM read_csv_auto('{pattern}')"
    
    if columns:
        col_mapping = ", ".join(
            f"COALESCE({old!r}, {new!r}) AS {new}" 
            for old, new in columns.items()
        )
        query = f"SELECT {col_mapping} FROM ({base_query})"
    else:
        query = base_query
    
    if output_format == "dataframe":
        return con.execute(query).fetchdf()
    elif output_format == "parquet":
        output_file = pattern.replace('*', 'merged_output')
        con.execute(f"COPY ({query}) TO '{output_file}' (FORMAT PARQUET)")
        return None
    
    con.close()


# Usage example
df = merge_csv_glob(
    'sales_data/store_*.csv',
    output_format='dataframe',
    columns={
        '日期': 'sale_date',
        '商品名称': 'product_name',
        '类别': 'category',
        '数量': 'quantity',
        '单价': 'unit_price',
        '销售额': 'amount',
        '门店': 'store'
    }
)

print(f"Merged: {len(df)} records")

4. Data Analysis After Merging: Where the Real Value Lies

Merging data is just the first step. The real value comes from analysis.

Analysis 1: Monthly Sales Ranking by Store

ranking_sql = """
SELECT 
    store,
    COUNT(*) AS transaction_count,
    ROUND(SUM(amount), 2) AS total_sales,
    ROUND(AVG(amount), 2) AS avg_order_value,
    ROUND(SUM(amount) * 100.0 / SUM(SUM(amount)) OVER (), 1) AS share_pct
FROM read_csv_auto('sales_data/store_*.csv')
GROUP BY store
ORDER BY total_sales DESC
LIMIT 10;
"""

con.execute(ranking_sql).print()

Example output:

+---------+-------------------+-------------+-----------------+------------+
|  store  | transaction_count | total_sales | avg_order_value | share_pct  |
+---------+-------------------+-------------+-----------------+------------+
|  Shanghai  |       3,842       |  2,847,392  |     741.18      |   12.3     |
|  Beijing   |       3,621       |  2,651,208  |     732.18      |   11.5     |
|  Guangzhou |       3,105       |  2,398,567  |     772.49      |   10.4     |
|  Shenzhen  |       2,987       |  2,287,431  |     765.55      |    9.9     |
|  Chengdu   |       2,756       |  2,156,892  |     782.63      |    9.3     |
+---------+-------------------+-------------+-----------------+------------+

Analysis 2: Z-Score Anomaly Detection — Find Abnormal Sales Spikes

This is the highlight of this article: statistical anomaly detection using pure SQL — no Python, no scikit-learn required.

spike_sql = """
WITH daily_sales AS (
    SELECT 
        COALESCE("日期", date) AS sale_date,
        ROUND(SUM(amount), 2) AS daily_total
    FROM read_csv_auto('sales_data/store_*.csv')
    GROUP BY 1
),
with_stats AS (
    SELECT 
        sale_date,
        daily_total,
        AVG(daily_total) OVER () AS avg_daily,
        STDDEV_SAMP(daily_total) OVER () AS std_daily
    FROM daily_sales
)
SELECT 
    sale_date,
    daily_total,
    ROUND((daily_total - avg_daily) / NULLIF(std_daily, 0), 2) AS z_score,
    CASE 
        WHEN z_score > 2 THEN 'Anomaly Peak'
        WHEN z_score < -2 THEN 'Anomaly Dip'
        ELSE 'Normal'
    END AS status
FROM with_stats
WHERE ABS(z_score) > 2
ORDER BY z_score DESC;
"""

con.execute(spike_sql).print()

This uses window functions AVG() OVER() and STDDEV_SAMP() OVER() to compute the Z-score of each date relative to the overall average. Values beyond ±2 standard deviations are flagged as anomalies.

Analysis 3: Category Cross-Analysis — Which Category Contributes Most Revenue?

category_sql = """
SELECT 
    COALESCE("类别", category) AS category,
    COUNT(DISTINCT COALESCE("门店", store)) AS store_count,
    ROUND(SUM(COALESCE("数量", qty) * COALESCE("单价", unit_price)), 2) AS gross_revenue,
    ROUND(AVG(COALESCE("数量", qty) * COALESCE("单价", unit_price)), 2) AS avg_transaction,
    ROUND(100.0 * SUM(COALESCE("数量", qty) * COALESCE("单价", unit_price)) / 
          NULLIF(SUM(COALESCE("数量", qty) * COALESCE("单价", unit_price)) OVER (), 0), 1) AS revenue_share
FROM read_csv_auto('sales_data/store_*.csv')
GROUP BY 1
ORDER BY gross_revenue DESC;
"""

con.execute(category_sql).print()

5. Data Quality Checks: Inspect Before You Analyze

Before diving into analysis, check the data quality first:

# Check null distribution
quality_sql = """
SELECT 
    'NULL_COUNT' AS metric,
    SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_count,
    SUM(CASE WHEN quantity IS NULL THEN 1 ELSE 0 END) AS qty_null,
    SUM(CASE WHEN store IS NULL THEN 1 ELSE 0 END) AS store_null,
    COUNT(*) AS total_rows
FROM read_csv_auto('sales_data/store_*.csv')

UNION ALL

SELECT 
    'VALUE_RANGE' AS metric,
    MIN(amount)::VARCHAR,
    MAX(amount)::VARCHAR,
    MIN(quantity)::VARCHAR,
    MAX(quantity)::VARCHAR
FROM read_csv_auto('sales_data/store_*.csv');
"""

con.execute(quality_sql).print()

Example output:

+-------------+---------------------+------------------+---------------+--------------+
|   metric    |       null_count    |     qty_null     |   store_null  | total_rows   |
+-------------+---------------------+------------------+---------------+--------------+
| NULL_COUNT  |          23         |       5          |      12       |    187,500   |
| VALUE_RANGE |      0.00           |     999,999.00   |     1         |    500.00    |
+-------------+---------------------+------------------+---------------+--------------+

This quickly reveals which fields have high null rates and whether data ranges are reasonable.

6. Performance Comparison: DuckDB vs Excel vs Pandas

Data Scale	Excel	Pandas	DuckDB
10 files × 1,000 rows	~30 sec (manual)	0.5 sec	0.05 sec
50 files × 5,000 rows	Frozen	3 sec	0.1 sec
200 files × 10,000 rows	Impossible	45 sec	0.3 sec
Memory Usage	High (Excel独占)	High (2x RAM)	Low (streaming)

Key difference: DuckDB uses lazy evaluation — it doesn’t load all data into memory at once. Instead, it reads and processes data on-the-fly. This means even if your files total 10GB, your computer doesn’t need 10GB of RAM.

# Directly write to Parquet — smaller file, faster reads
con.execute("""
    COPY (
        SELECT 
            COALESCE("日期", date) AS sale_date,
            COALESCE("商品名称", product) AS product_name,
            COALESCE("类别", category) AS category,
            COALESCE("数量", qty) AS quantity,
            COALESCE("单价", unit_price) AS unit_price,
            COALESCE("销售额", amount) AS amount,
            COALESCE("门店", store) AS store
        FROM read_csv_auto('sales_data/store_*.csv')
    ) TO 'merged_all_stores.parquet' (FORMAT PARQUET);
""")

import os
print(f"Parquet file size: {os.path.getsize('merged_all_stores.parquet') / 1024:.1f} KB")

Parquet files are typically 30–50% smaller than CSVs, preserve data types, and don’t require type inference when re-read by DuckDB.

If your files are in the cloud (e.g., Alibaba Cloud OSS or AWS S3), DuckDB can read them directly using s3:// or oss:// protocols — no need to download locally.

8. Monetization Advice

With these skills, here’s how you can make money:

Freelance gigs: On platforms like Upwork or Freelancer, take “data cleaning/merging” tasks. What takes others 3 hours, you can do in 10 minutes with DuckDB. Charge $50–500 per task.
Template products: Package your merge_csv_glob() function into a Streamlit web tool, host it on Streamlit Community Cloud for free, and monetize through ads or premium features.
Paid courses: Publish a “DuckDB in Practice” video series on YouTube/Bilibili, build an audience, then launch paid courses ($10–40 per enrollment).
Corporate training: Many traditional companies’ finance and operations teams still manually merge spreadsheets in Excel. Offer DuckDB training workshops — charge $500–2,000 per session.
SaaS product: Build an “online spreadsheet merger & analyzer” SaaS platform using DuckDB + FastAPI, with monthly subscription pricing.

The money isn’t in the technology itself — it’s in how much pain you solve with it.

Conclusion

What used to take half a day now takes 10 seconds. What really determines work efficiency is often not the flashiest technology, but how much of your repetitive daily work you’ve automated.

Next time you face a “merge dozens of files” task — boring but time-consuming — think about whether DuckDB can handle it in one line of SQL. You’ll find you’ve gained several hours back.

Architecture diagram

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.