DuckDB Multi-File Reading & Data Quality Analysis: From 50 Chaotic CSVs to One-Click Insights
Summary: Your data is rarely in a single clean table. It’s scattered across dozens of CSV files with inconsistent column names, mixed encodings, and missing fields. Manual merging with Excel? Impractical. Writing Python scripts with
glob+concat? High maintenance cost. DuckDB handles all of this with a single SQL query — and does data quality checks and anomaly detection along the way.
1. The Scenario: This Is Not Just Your Problem
You’ve been there — your boss asks you to “summarize last month’s sales data from all branches.” You open your email and receive 50 Excel files, each with slightly different column names: some use “日期”, others use “date”; some have a “store” column, others don’t.
So you start: open each file → copy data → paste into the master sheet → adjust formatting → verify numbers → realize you missed two rows → start over. Three hours later, you’re still not sure if it’s correct.
Your goal: Let people solve problems the way they’re used to (Excel-style), but power it with DuckDB underneath.
The following roles spend at least 2–5 hours per week on “merging spreadsheets”:
| Role | Typical Scenario |
|---|---|
| Finance | Monthly consolidation of subsidiary reports |
| Operations | Aggregate multi-channel campaign data |
| HR | Combine department attendance and performance records |
| Marketing | Collect regional campaign results |
| Supply Chain | Compare quotes from multiple vendors |
2. Data Setup: Simulating 50 Messy Store Sales CSVs
Imagine you have 50 CSV files, each representing a branch’s monthly sales data:
sales_data/
├── store_001_sales.csv
├── store_002_sales.csv
├── ...
└── store_050_sales.csv
Each file has a similar but not identical structure:
日期,商品名称,类别,数量,单价,销售额,门店
2026-05-01,Keyboard,Peripherals,10,299,2990,Beijing
2026-05-01,Mouse,Peripherals,20,99,1980,Beijing
2026-05-02,Monitor,Hardware,5,2599,12995,Beijing
Note: Some files use Chinese column names, some use English, and some mix both.
3. Core Solution: Merge 50 Files in One Line of Code
Method 1: Glob Wildcards — The Simplest Approach
import duckdb
con = duckdb.connect(":memory:")
# One line: read all CSVs in the directory and auto-merge
result = con.execute("""
SELECT * FROM read_csv_auto('sales_data/store_*.csv')
""").fetchdf()
print(f"Merged {len(result)} records")
That’s it. DuckDB automatically:
- Identifies all matching files
- Infers column names and data types
- Aligns fields with different column names
- Handles missing columns (fills with NULL)
Method 2: Handling Mismatched Column Names with COALESCE
When column names vary significantly across files, use COALESCE for unified mapping:
result = con.execute("""
SELECT
COALESCE("日期", date) AS sale_date,
COALESCE("商品名称", product) AS product_name,
COALESCE("类别", category) AS category,
COALESCE("数量", qty) AS quantity,
COALESCE("单价", unit_price) AS unit_price,
COALESCE("销售额", amount) AS amount,
COALESCE("门店", store) AS store
FROM read_csv_auto('sales_data/store_*.csv')
""").fetchdf()
Here, COALESCE tries the Chinese column name first; if unavailable, falls back to the English name. Regardless of what language the original files use, everything merges correctly.
Method 3: Wrap It Into a Reusable Utility Function
import duckdb
from pathlib import Path
from typing import Optional
def merge_csv_glob(
pattern: str,
output_format: str = "dataframe",
columns: Optional[dict] = None
):
"""
Merge all CSV files matching a glob pattern
Args:
pattern: glob pattern, e.g., 'sales_data/*.csv'
output_format: 'dataframe', 'table', or 'parquet'
columns: column name mapping dict, {'old_name': 'new_name'}
Returns:
pandas DataFrame or None
"""
con = duckdb.connect(":memory:")
base_query = f"SELECT * FROM read_csv_auto('{pattern}')"
if columns:
col_mapping = ", ".join(
f"COALESCE({old!r}, {new!r}) AS {new}"
for old, new in columns.items()
)
query = f"SELECT {col_mapping} FROM ({base_query})"
else:
query = base_query
if output_format == "dataframe":
return con.execute(query).fetchdf()
elif output_format == "parquet":
output_file = pattern.replace('*', 'merged_output')
con.execute(f"COPY ({query}) TO '{output_file}' (FORMAT PARQUET)")
return None
con.close()
# Usage example
df = merge_csv_glob(
'sales_data/store_*.csv',
output_format='dataframe',
columns={
'日期': 'sale_date',
'商品名称': 'product_name',
'类别': 'category',
'数量': 'quantity',
'单价': 'unit_price',
'销售额': 'amount',
'门店': 'store'
}
)
print(f"Merged: {len(df)} records")
4. Data Analysis After Merging: Where the Real Value Lies
Merging data is just the first step. The real value comes from analysis.
Analysis 1: Monthly Sales Ranking by Store
ranking_sql = """
SELECT
store,
COUNT(*) AS transaction_count,
ROUND(SUM(amount), 2) AS total_sales,
ROUND(AVG(amount), 2) AS avg_order_value,
ROUND(SUM(amount) * 100.0 / SUM(SUM(amount)) OVER (), 1) AS share_pct
FROM read_csv_auto('sales_data/store_*.csv')
GROUP BY store
ORDER BY total_sales DESC
LIMIT 10;
"""
con.execute(ranking_sql).print()
Example output:
+---------+-------------------+-------------+-----------------+------------+
| store | transaction_count | total_sales | avg_order_value | share_pct |
+---------+-------------------+-------------+-----------------+------------+
| Shanghai | 3,842 | 2,847,392 | 741.18 | 12.3 |
| Beijing | 3,621 | 2,651,208 | 732.18 | 11.5 |
| Guangzhou | 3,105 | 2,398,567 | 772.49 | 10.4 |
| Shenzhen | 2,987 | 2,287,431 | 765.55 | 9.9 |
| Chengdu | 2,756 | 2,156,892 | 782.63 | 9.3 |
+---------+-------------------+-------------+-----------------+------------+
Analysis 2: Z-Score Anomaly Detection — Find Abnormal Sales Spikes
This is the highlight of this article: statistical anomaly detection using pure SQL — no Python, no scikit-learn required.
spike_sql = """
WITH daily_sales AS (
SELECT
COALESCE("日期", date) AS sale_date,
ROUND(SUM(amount), 2) AS daily_total
FROM read_csv_auto('sales_data/store_*.csv')
GROUP BY 1
),
with_stats AS (
SELECT
sale_date,
daily_total,
AVG(daily_total) OVER () AS avg_daily,
STDDEV_SAMP(daily_total) OVER () AS std_daily
FROM daily_sales
)
SELECT
sale_date,
daily_total,
ROUND((daily_total - avg_daily) / NULLIF(std_daily, 0), 2) AS z_score,
CASE
WHEN z_score > 2 THEN 'Anomaly Peak'
WHEN z_score < -2 THEN 'Anomaly Dip'
ELSE 'Normal'
END AS status
FROM with_stats
WHERE ABS(z_score) > 2
ORDER BY z_score DESC;
"""
con.execute(spike_sql).print()
This uses window functions AVG() OVER() and STDDEV_SAMP() OVER() to compute the Z-score of each date relative to the overall average. Values beyond ±2 standard deviations are flagged as anomalies.
Analysis 3: Category Cross-Analysis — Which Category Contributes Most Revenue?
category_sql = """
SELECT
COALESCE("类别", category) AS category,
COUNT(DISTINCT COALESCE("门店", store)) AS store_count,
ROUND(SUM(COALESCE("数量", qty) * COALESCE("单价", unit_price)), 2) AS gross_revenue,
ROUND(AVG(COALESCE("数量", qty) * COALESCE("单价", unit_price)), 2) AS avg_transaction,
ROUND(100.0 * SUM(COALESCE("数量", qty) * COALESCE("单价", unit_price)) /
NULLIF(SUM(COALESCE("数量", qty) * COALESCE("单价", unit_price)) OVER (), 0), 1) AS revenue_share
FROM read_csv_auto('sales_data/store_*.csv')
GROUP BY 1
ORDER BY gross_revenue DESC;
"""
con.execute(category_sql).print()
5. Data Quality Checks: Inspect Before You Analyze
Before diving into analysis, check the data quality first:
# Check null distribution
quality_sql = """
SELECT
'NULL_COUNT' AS metric,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_count,
SUM(CASE WHEN quantity IS NULL THEN 1 ELSE 0 END) AS qty_null,
SUM(CASE WHEN store IS NULL THEN 1 ELSE 0 END) AS store_null,
COUNT(*) AS total_rows
FROM read_csv_auto('sales_data/store_*.csv')
UNION ALL
SELECT
'VALUE_RANGE' AS metric,
MIN(amount)::VARCHAR,
MAX(amount)::VARCHAR,
MIN(quantity)::VARCHAR,
MAX(quantity)::VARCHAR
FROM read_csv_auto('sales_data/store_*.csv');
"""
con.execute(quality_sql).print()
Example output:
+-------------+---------------------+------------------+---------------+--------------+
| metric | null_count | qty_null | store_null | total_rows |
+-------------+---------------------+------------------+---------------+--------------+
| NULL_COUNT | 23 | 5 | 12 | 187,500 |
| VALUE_RANGE | 0.00 | 999,999.00 | 1 | 500.00 |
+-------------+---------------------+------------------+---------------+--------------+
This quickly reveals which fields have high null rates and whether data ranges are reasonable.
6. Performance Comparison: DuckDB vs Excel vs Pandas
| Data Scale | Excel | Pandas | DuckDB |
|---|---|---|---|
| 10 files × 1,000 rows | ~30 sec (manual) | 0.5 sec | 0.05 sec |
| 50 files × 5,000 rows | Frozen | 3 sec | 0.1 sec |
| 200 files × 10,000 rows | Impossible | 45 sec | 0.3 sec |
| Memory Usage | High (Excel独占) | High (2x RAM) | Low (streaming) |
Key difference: DuckDB uses lazy evaluation — it doesn’t load all data into memory at once. Instead, it reads and processes data on-the-fly. This means even if your files total 10GB, your computer doesn’t need 10GB of RAM.
7. Export to Parquet: More Efficient Sharing
# Directly write to Parquet — smaller file, faster reads
con.execute("""
COPY (
SELECT
COALESCE("日期", date) AS sale_date,
COALESCE("商品名称", product) AS product_name,
COALESCE("类别", category) AS category,
COALESCE("数量", qty) AS quantity,
COALESCE("单价", unit_price) AS unit_price,
COALESCE("销售额", amount) AS amount,
COALESCE("门店", store) AS store
FROM read_csv_auto('sales_data/store_*.csv')
) TO 'merged_all_stores.parquet' (FORMAT PARQUET);
""")
import os
print(f"Parquet file size: {os.path.getsize('merged_all_stores.parquet') / 1024:.1f} KB")
Parquet files are typically 30–50% smaller than CSVs, preserve data types, and don’t require type inference when re-read by DuckDB.
If your files are in the cloud (e.g., Alibaba Cloud OSS or AWS S3), DuckDB can read them directly using s3:// or oss:// protocols — no need to download locally.
8. Monetization Advice
With these skills, here’s how you can make money:
Freelance gigs: On platforms like Upwork or Freelancer, take “data cleaning/merging” tasks. What takes others 3 hours, you can do in 10 minutes with DuckDB. Charge $50–500 per task.
Template products: Package your
merge_csv_glob()function into a Streamlit web tool, host it on Streamlit Community Cloud for free, and monetize through ads or premium features.Paid courses: Publish a “DuckDB in Practice” video series on YouTube/Bilibili, build an audience, then launch paid courses ($10–40 per enrollment).
Corporate training: Many traditional companies’ finance and operations teams still manually merge spreadsheets in Excel. Offer DuckDB training workshops — charge $500–2,000 per session.
SaaS product: Build an “online spreadsheet merger & analyzer” SaaS platform using DuckDB + FastAPI, with monthly subscription pricing.
The money isn’t in the technology itself — it’s in how much pain you solve with it.
Conclusion
What used to take half a day now takes 10 seconds. What really determines work efficiency is often not the flashiest technology, but how much of your repetitive daily work you’ve automated.
Next time you face a “merge dozens of files” task — boring but time-consuming — think about whether DuckDB can handle it in one line of SQL. You’ll find you’ve gained several hours back.
