Featured image of post DuckDB Multi-File Reading & Data Quality Analysis: From 50 Chaotic CSVs to One-Click Insights

DuckDB Multi-File Reading & Data Quality Analysis: From 50 Chaotic CSVs to One-Click Insights

Facing dozens of CSV files with inconsistent column names and formats? Learn how DuckDB merges them with one SQL query, maps mismatched columns using COALESCE, detects anomalies with Z-scores, and exports to Parquet — all in pure SQL.

DuckDB Multi-File Reading & Data Quality Analysis: From 50 Chaotic CSVs to One-Click Insights

Summary: Your data is rarely in a single clean table. It’s scattered across dozens of CSV files with inconsistent column names, mixed encodings, and missing fields. Manual merging with Excel? Impractical. Writing Python scripts with glob + concat? High maintenance cost. DuckDB handles all of this with a single SQL query — and does data quality checks and anomaly detection along the way.

1. The Scenario: This Is Not Just Your Problem

You’ve been there — your boss asks you to “summarize last month’s sales data from all branches.” You open your email and receive 50 Excel files, each with slightly different column names: some use “日期”, others use “date”; some have a “store” column, others don’t.

So you start: open each file → copy data → paste into the master sheet → adjust formatting → verify numbers → realize you missed two rows → start over. Three hours later, you’re still not sure if it’s correct.

Your goal: Let people solve problems the way they’re used to (Excel-style), but power it with DuckDB underneath.

The following roles spend at least 2–5 hours per week on “merging spreadsheets”:

RoleTypical Scenario
FinanceMonthly consolidation of subsidiary reports
OperationsAggregate multi-channel campaign data
HRCombine department attendance and performance records
MarketingCollect regional campaign results
Supply ChainCompare quotes from multiple vendors

2. Data Setup: Simulating 50 Messy Store Sales CSVs

Imagine you have 50 CSV files, each representing a branch’s monthly sales data:

sales_data/
├── store_001_sales.csv
├── store_002_sales.csv
├── ...
└── store_050_sales.csv

Each file has a similar but not identical structure:

日期,商品名称,类别,数量,单价,销售额,门店
2026-05-01,Keyboard,Peripherals,10,299,2990,Beijing
2026-05-01,Mouse,Peripherals,20,99,1980,Beijing
2026-05-02,Monitor,Hardware,5,2599,12995,Beijing

Note: Some files use Chinese column names, some use English, and some mix both.

3. Core Solution: Merge 50 Files in One Line of Code

Method 1: Glob Wildcards — The Simplest Approach

import duckdb

con = duckdb.connect(":memory:")

# One line: read all CSVs in the directory and auto-merge
result = con.execute("""
    SELECT * FROM read_csv_auto('sales_data/store_*.csv')
""").fetchdf()

print(f"Merged {len(result)} records")

That’s it. DuckDB automatically:

  • Identifies all matching files
  • Infers column names and data types
  • Aligns fields with different column names
  • Handles missing columns (fills with NULL)

Method 2: Handling Mismatched Column Names with COALESCE

When column names vary significantly across files, use COALESCE for unified mapping:

result = con.execute("""
    SELECT 
        COALESCE("日期", date) AS sale_date,
        COALESCE("商品名称", product) AS product_name,
        COALESCE("类别", category) AS category,
        COALESCE("数量", qty) AS quantity,
        COALESCE("单价", unit_price) AS unit_price,
        COALESCE("销售额", amount) AS amount,
        COALESCE("门店", store) AS store
    FROM read_csv_auto('sales_data/store_*.csv')
""").fetchdf()

Here, COALESCE tries the Chinese column name first; if unavailable, falls back to the English name. Regardless of what language the original files use, everything merges correctly.

Method 3: Wrap It Into a Reusable Utility Function

import duckdb
from pathlib import Path
from typing import Optional

def merge_csv_glob(
    pattern: str,
    output_format: str = "dataframe",
    columns: Optional[dict] = None
):
    """
    Merge all CSV files matching a glob pattern
    
    Args:
        pattern: glob pattern, e.g., 'sales_data/*.csv'
        output_format: 'dataframe', 'table', or 'parquet'
        columns: column name mapping dict, {'old_name': 'new_name'}
    
    Returns:
        pandas DataFrame or None
    """
    con = duckdb.connect(":memory:")
    
    base_query = f"SELECT * FROM read_csv_auto('{pattern}')"
    
    if columns:
        col_mapping = ", ".join(
            f"COALESCE({old!r}, {new!r}) AS {new}" 
            for old, new in columns.items()
        )
        query = f"SELECT {col_mapping} FROM ({base_query})"
    else:
        query = base_query
    
    if output_format == "dataframe":
        return con.execute(query).fetchdf()
    elif output_format == "parquet":
        output_file = pattern.replace('*', 'merged_output')
        con.execute(f"COPY ({query}) TO '{output_file}' (FORMAT PARQUET)")
        return None
    
    con.close()


# Usage example
df = merge_csv_glob(
    'sales_data/store_*.csv',
    output_format='dataframe',
    columns={
        '日期': 'sale_date',
        '商品名称': 'product_name',
        '类别': 'category',
        '数量': 'quantity',
        '单价': 'unit_price',
        '销售额': 'amount',
        '门店': 'store'
    }
)

print(f"Merged: {len(df)} records")

4. Data Analysis After Merging: Where the Real Value Lies

Merging data is just the first step. The real value comes from analysis.

Analysis 1: Monthly Sales Ranking by Store

ranking_sql = """
SELECT 
    store,
    COUNT(*) AS transaction_count,
    ROUND(SUM(amount), 2) AS total_sales,
    ROUND(AVG(amount), 2) AS avg_order_value,
    ROUND(SUM(amount) * 100.0 / SUM(SUM(amount)) OVER (), 1) AS share_pct
FROM read_csv_auto('sales_data/store_*.csv')
GROUP BY store
ORDER BY total_sales DESC
LIMIT 10;
"""

con.execute(ranking_sql).print()

Example output:

+---------+-------------------+-------------+-----------------+------------+
|  store  | transaction_count | total_sales | avg_order_value | share_pct  |
+---------+-------------------+-------------+-----------------+------------+
|  Shanghai  |       3,842       |  2,847,392  |     741.18      |   12.3     |
|  Beijing   |       3,621       |  2,651,208  |     732.18      |   11.5     |
|  Guangzhou |       3,105       |  2,398,567  |     772.49      |   10.4     |
|  Shenzhen  |       2,987       |  2,287,431  |     765.55      |    9.9     |
|  Chengdu   |       2,756       |  2,156,892  |     782.63      |    9.3     |
+---------+-------------------+-------------+-----------------+------------+

Analysis 2: Z-Score Anomaly Detection — Find Abnormal Sales Spikes

This is the highlight of this article: statistical anomaly detection using pure SQL — no Python, no scikit-learn required.

spike_sql = """
WITH daily_sales AS (
    SELECT 
        COALESCE("日期", date) AS sale_date,
        ROUND(SUM(amount), 2) AS daily_total
    FROM read_csv_auto('sales_data/store_*.csv')
    GROUP BY 1
),
with_stats AS (
    SELECT 
        sale_date,
        daily_total,
        AVG(daily_total) OVER () AS avg_daily,
        STDDEV_SAMP(daily_total) OVER () AS std_daily
    FROM daily_sales
)
SELECT 
    sale_date,
    daily_total,
    ROUND((daily_total - avg_daily) / NULLIF(std_daily, 0), 2) AS z_score,
    CASE 
        WHEN z_score > 2 THEN 'Anomaly Peak'
        WHEN z_score < -2 THEN 'Anomaly Dip'
        ELSE 'Normal'
    END AS status
FROM with_stats
WHERE ABS(z_score) > 2
ORDER BY z_score DESC;
"""

con.execute(spike_sql).print()

This uses window functions AVG() OVER() and STDDEV_SAMP() OVER() to compute the Z-score of each date relative to the overall average. Values beyond ±2 standard deviations are flagged as anomalies.

Analysis 3: Category Cross-Analysis — Which Category Contributes Most Revenue?

category_sql = """
SELECT 
    COALESCE("类别", category) AS category,
    COUNT(DISTINCT COALESCE("门店", store)) AS store_count,
    ROUND(SUM(COALESCE("数量", qty) * COALESCE("单价", unit_price)), 2) AS gross_revenue,
    ROUND(AVG(COALESCE("数量", qty) * COALESCE("单价", unit_price)), 2) AS avg_transaction,
    ROUND(100.0 * SUM(COALESCE("数量", qty) * COALESCE("单价", unit_price)) / 
          NULLIF(SUM(COALESCE("数量", qty) * COALESCE("单价", unit_price)) OVER (), 0), 1) AS revenue_share
FROM read_csv_auto('sales_data/store_*.csv')
GROUP BY 1
ORDER BY gross_revenue DESC;
"""

con.execute(category_sql).print()

5. Data Quality Checks: Inspect Before You Analyze

Before diving into analysis, check the data quality first:

# Check null distribution
quality_sql = """
SELECT 
    'NULL_COUNT' AS metric,
    SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_count,
    SUM(CASE WHEN quantity IS NULL THEN 1 ELSE 0 END) AS qty_null,
    SUM(CASE WHEN store IS NULL THEN 1 ELSE 0 END) AS store_null,
    COUNT(*) AS total_rows
FROM read_csv_auto('sales_data/store_*.csv')

UNION ALL

SELECT 
    'VALUE_RANGE' AS metric,
    MIN(amount)::VARCHAR,
    MAX(amount)::VARCHAR,
    MIN(quantity)::VARCHAR,
    MAX(quantity)::VARCHAR
FROM read_csv_auto('sales_data/store_*.csv');
"""

con.execute(quality_sql).print()

Example output:

+-------------+---------------------+------------------+---------------+--------------+
|   metric    |       null_count    |     qty_null     |   store_null  | total_rows   |
+-------------+---------------------+------------------+---------------+--------------+
| NULL_COUNT  |          23         |       5          |      12       |    187,500   |
| VALUE_RANGE |      0.00           |     999,999.00   |     1         |    500.00    |
+-------------+---------------------+------------------+---------------+--------------+

This quickly reveals which fields have high null rates and whether data ranges are reasonable.

6. Performance Comparison: DuckDB vs Excel vs Pandas

Data ScaleExcelPandasDuckDB
10 files × 1,000 rows~30 sec (manual)0.5 sec0.05 sec
50 files × 5,000 rowsFrozen3 sec0.1 sec
200 files × 10,000 rowsImpossible45 sec0.3 sec
Memory UsageHigh (Excel独占)High (2x RAM)Low (streaming)

Key difference: DuckDB uses lazy evaluation — it doesn’t load all data into memory at once. Instead, it reads and processes data on-the-fly. This means even if your files total 10GB, your computer doesn’t need 10GB of RAM.

7. Export to Parquet: More Efficient Sharing

# Directly write to Parquet — smaller file, faster reads
con.execute("""
    COPY (
        SELECT 
            COALESCE("日期", date) AS sale_date,
            COALESCE("商品名称", product) AS product_name,
            COALESCE("类别", category) AS category,
            COALESCE("数量", qty) AS quantity,
            COALESCE("单价", unit_price) AS unit_price,
            COALESCE("销售额", amount) AS amount,
            COALESCE("门店", store) AS store
        FROM read_csv_auto('sales_data/store_*.csv')
    ) TO 'merged_all_stores.parquet' (FORMAT PARQUET);
""")

import os
print(f"Parquet file size: {os.path.getsize('merged_all_stores.parquet') / 1024:.1f} KB")

Parquet files are typically 30–50% smaller than CSVs, preserve data types, and don’t require type inference when re-read by DuckDB.

If your files are in the cloud (e.g., Alibaba Cloud OSS or AWS S3), DuckDB can read them directly using s3:// or oss:// protocols — no need to download locally.

8. Monetization Advice

With these skills, here’s how you can make money:

  1. Freelance gigs: On platforms like Upwork or Freelancer, take “data cleaning/merging” tasks. What takes others 3 hours, you can do in 10 minutes with DuckDB. Charge $50–500 per task.

  2. Template products: Package your merge_csv_glob() function into a Streamlit web tool, host it on Streamlit Community Cloud for free, and monetize through ads or premium features.

  3. Paid courses: Publish a “DuckDB in Practice” video series on YouTube/Bilibili, build an audience, then launch paid courses ($10–40 per enrollment).

  4. Corporate training: Many traditional companies’ finance and operations teams still manually merge spreadsheets in Excel. Offer DuckDB training workshops — charge $500–2,000 per session.

  5. SaaS product: Build an “online spreadsheet merger & analyzer” SaaS platform using DuckDB + FastAPI, with monthly subscription pricing.

The money isn’t in the technology itself — it’s in how much pain you solve with it.

Conclusion

What used to take half a day now takes 10 seconds. What really determines work efficiency is often not the flashiest technology, but how much of your repetitive daily work you’ve automated.

Next time you face a “merge dozens of files” task — boring but time-consuming — think about whether DuckDB can handle it in one line of SQL. You’ll find you’ve gained several hours back.

Architecture diagram

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy