Multi-CSV File Merging: Real-World Multi-Store Sales Analysis with DuckDB

Introduction

“Can you analyze the sales data from all our stores?”

Behind this innocent request lies a mess: each store has its own CSV file, generated daily, with inconsistent filenames, mixed column names, and varying encodings. The traditional approach — writing a Python script to iterate through directories, read each file, concatenate DataFrames, handle encoding issues — is slow, fragile, and error-prone.

This is where DuckDB shines. In this tutorial, we’ll walk through a multi-store sales data consolidation project, from scattered CSV files to a multi-dimensional analytical report — using nothing but SQL.

The Scenario

A retail chain has 5 stores. Each store generates a daily sales CSV. The file structure looks like this:

data/
├── store_001_daily_20260501.csv
├── store_001_daily_20260502.csv
├── store_002_daily_20260501.csv
├── store_002_daily_20260502.csv
├── store_003_daily_20260501.csv
├── store_003_daily_20260502.csv
├── store_004_daily_20260501.csv
├── store_004_daily_20260502.csv
├── store_005_daily_20260501.csv
├── store_005_daily_20260502.csv

Each CSV has the same structure with Chinese column names (a common real-world complication):

订单号,商品名称,单价,数量,金额,销售日期,收银员
ORD001,Latte,32.00,2,64.00,2026-05-01,Zhang San
ORD002,American Coffee,25.00,1,25.00,2026-05-01,Li Si

Step 1: Read All CSVs with a Single Wildcard

Traditional approach: write a Python script to walk directories, read each file one by one, concatenate DataFrames, and handle type inference. DuckDB approach: one line of SQL.

-- Read all CSV files using a glob pattern — DuckDB auto-infers the schema
CREATE TABLE raw_sales AS
SELECT * FROM read_csv_auto('data/*.csv');

-- Check the merged result
SELECT COUNT(*) AS total_rows, COUNT(DISTINCT 订单号) AS total_orders FROM raw_sales;

read_csv_auto is DuckDB’s killer feature for CSV wrangling:

The * wildcard matches all CSV files in the directory
Automatically infers column names, data types, and delimiters
Supports recursive glob patterns: **/*.csv matches subdirectories
Use union_by_name=true for files with slightly different column structures

-- More robust: auto-merge by column name
CREATE TABLE raw_sales AS
SELECT * FROM read_csv_auto('data/*.csv', union_by_name=true);

-- Inspect the auto-inferred schema
DESCRIBE raw_sales;

Step 2: Extract Store Info and Dates from Filenames

The store ID and date are encoded in the filenames. DuckDB’s regex and file-path functions make extraction trivial:

-- Extract metadata from filenames
CREATE TABLE sales_with_meta AS
SELECT
  filename,
  regexp_extract(filename, 'store_(\d+)', 1) AS store_id,
  regexp_extract(filename, '(\d{8})\.csv', 1) AS date_string,
  *
FROM read_csv_auto('data/*.csv', filename=true, union_by_name=true);

-- Convert date strings to proper DATE type
CREATE TABLE sales_clean AS
SELECT
  store_id,
  strptime(date_string, '%Y%m%d')::DATE AS sale_date,
  订单号 AS order_id,
  商品名称 AS product_name,
  单价 AS unit_price,
  数量 AS quantity,
  金额 AS amount,
  收银员 AS cashier
FROM sales_with_meta;

The filename=true parameter adds a filename column that records which file each row came from — invaluable for debugging multi-file merges.

Step 3: Analysis with Chinese Column Names

DuckDB supports Chinese (and any Unicode) column names natively. No configuration, no quoting tricks needed:

-- Sales ranking by store
SELECT
  store_id,
  SUM(amount) AS total_revenue,
  COUNT(DISTINCT order_id) AS order_count,
  SUM(quantity) AS total_units,
  ROUND(AVG(amount), 2) AS avg_order_value
FROM sales_clean
GROUP BY store_id
ORDER BY total_revenue DESC;

-- Top 10 best-selling products
SELECT
  product_name,
  SUM(quantity) AS total_units_sold,
  SUM(amount) AS total_revenue,
  COUNT(DISTINCT store_id) AS stores_stocked
FROM sales_clean
GROUP BY product_name
ORDER BY total_revenue DESC
LIMIT 10;

-- Cashier performance ranking
SELECT
  cashier,
  store_id,
  COUNT(*) AS transactions_processed,
  SUM(amount) AS total_handled
FROM sales_clean
GROUP BY cashier, store_id
ORDER BY total_handled DESC;

Step 4: Time-Based Aggregation with strftime

Time-based aggregation is the backbone of sales analysis. DuckDB’s strftime function provides Python-style date formatting:

-- Daily aggregation
SELECT
  strftime(sale_date, '%Y-%m-%d') AS day,
  SUM(amount) AS daily_revenue
FROM sales_clean
GROUP BY day
ORDER BY day;

-- Weekly aggregation
SELECT
  strftime(sale_date, '%Y-W%W') AS week,
  SUM(amount) AS weekly_revenue,
  COUNT(DISTINCT sale_date) AS operating_days
FROM sales_clean
GROUP BY week
ORDER BY week;

-- Monthly aggregation
SELECT
  strftime(sale_date, '%Y-%m') AS month,
  SUM(amount) AS monthly_revenue,
  SUM(quantity) AS monthly_units,
  ROUND(AVG(amount), 2) AS avg_daily_revenue
FROM sales_clean
GROUP BY month
ORDER BY month;

-- Time-of-day analysis (sales peak hours)
SELECT
  CASE
    WHEN strftime(sale_date, '%H') BETWEEN '06' AND '09' THEN 'Breakfast'
    WHEN strftime(sale_date, '%H') BETWEEN '10' AND '13' THEN 'Lunch'
    WHEN strftime(sale_date, '%H') BETWEEN '14' AND '17' THEN 'Afternoon Tea'
    ELSE 'Dinner'
  END AS time_period,
  SUM(amount) AS revenue
FROM sales_clean
GROUP BY time_period
ORDER BY revenue DESC;

Common strftime format specifiers:

Format	Meaning	Example
`%Y`	4-digit year	2026
`%m`	2-digit month	05
`%d`	2-digit day	11
`%W`	Week of year	19
`%w`	Day of week (0-6)	1
`%H`	Hour (00-23)	14

Step 5: Export to Parquet

Once your analysis is complete, export the results as Parquet — 10x faster reads, 5x smaller files, and native columnar storage:

-- Export cleaned data to Parquet
COPY sales_clean TO 'output/sales_clean.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD);

-- Export an analytical report
COPY (
  SELECT
    store_id,
    strftime(sale_date, '%Y-%m') AS month,
    strftime(sale_date, '%W') AS week_number,
    product_name,
    SUM(quantity) AS total_units,
    SUM(amount) AS total_revenue
  FROM sales_clean
  GROUP BY store_id, month, week_number, product_name
  ORDER BY store_id, month, week_number, total_revenue DESC
) TO 'output/daily_report.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);

-- Query Parquet files directly
SELECT store_id, SUM(total_revenue) AS total
FROM read_parquet('output/*.parquet')
GROUP BY store_id
ORDER BY store_id;

Why Parquet matters:

Amazing compression: ZSTD-compressed Parquet is typically only 20% of the CSV size
Columnar storage: Only reads the columns you query — dramatically faster I/O
Self-describing schema: Type information is embedded in the file, no DDL needed
DuckDB native optimization: Projection pushdown, predicate pushdown, late materialization

Complete Workflow Script

Here’s the entire pipeline as a single repeatable SQL script:

-- merge_analysis.sql
-- Multi-store sales data consolidation & analysis — from CSV to Parquet

-- 1. Import — read all CSVs with wildcards
CREATE TABLE raw AS
SELECT * FROM read_csv_auto('data/*.csv', filename=true, union_by_name=true);

-- 2. Clean & transform — extract metadata from filenames
CREATE TABLE clean AS
SELECT
  regexp_extract(filename, 'store_(\d+)', 1) AS store_id,
  strptime(regexp_extract(filename, '(\d{8})\.csv', 1), '%Y%m%d') AS sale_date,
  订单号 AS order_id,
  商品名称 AS product_name,
  单价 AS unit_price,
  数量 AS quantity,
  金额 AS amount,
  收银员 AS cashier
FROM raw;

-- 3. Aggregate — monthly store performance
CREATE TABLE monthly_summary AS
SELECT
  store_id,
  strftime(sale_date, '%Y-%m') AS month,
  COUNT(DISTINCT order_id) AS order_count,
  SUM(amount) AS revenue,
  SUM(quantity) AS units_sold,
  ROUND(AVG(amount), 2) AS avg_order_value
FROM clean
GROUP BY store_id, month
ORDER BY store_id, month;

-- 4. Export — save as compressed Parquet
COPY clean TO 'output/clean_data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);
COPY monthly_summary TO 'output/monthly_summary.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);

-- 5. Quick validation
SELECT 'Total Rows' AS metric, COUNT(*)::VARCHAR AS value FROM clean
UNION ALL
SELECT 'Total Stores', COUNT(DISTINCT store_id)::VARCHAR FROM clean
UNION ALL
SELECT 'Date Range', MIN(sale_date)::VARCHAR || ' ~ ' || MAX(sale_date)::VARCHAR FROM clean;

Run it with:

duckdb < merge_analysis.sql
# Or interactively:
duckdb -c ".read merge_analysis.sql"

Monetization SOP: From Technical Skill to Revenue

Being able to write SQL is one thing. Being able to package data integration as a deliverable service is where the real value lies.

Pricing Strategy

Tier	Deliverable	Price	Target Client
Basic	One-time CSV merge + 3 summary tables	$100–200	Small businesses (1–5 stores)
Standard	Multi-source merge + weekly/monthly report templates + Parquet export	$300–500	Mid-size chains (5–20 stores)
Premium	Fully automated pipeline + custom dashboard + ongoing maintenance	$800–2,000/month	Large chains (20+ stores)

Customer Acquisition Channels

Targeted outreach
- Write industry-specific case studies (“How a coffee chain saved 20 hours/week on reporting”)
- Open-source a basic version on GitHub with your contact info in the README
- Post on Hacker News / Lobsters when you hit interesting performance numbers
Partner channels
- Partner with POS system vendors and ERP implementation firms
- Collaborate with accounting/financial services firms (they have the client base)
- Join retail/restaurant industry communities and forums
Content marketing
- Write niche blog posts (e.g., “Data Stack for a 10-Store Retail Chain: Under $100/mo”)
- Create short demo videos showing before/after
- Offer free 30-minute data health check consultations

Delivery Checklist

## Delivery Checklist
1. ✅ SQL automation script (merge_analysis.sql)
2. ✅ Data dictionary (PDF/Excel)
3. ✅ Cleaned dataset (Parquet format)
4. ✅ Monthly/weekly report templates
5. ✅ README with operation guide
6. ✅ 1-week free remote support

Include a Data Health Report with every delivery:

Completeness: null values, missing dates, outlier detection
Consistency: duplicate records, order ID conflicts, referential integrity
Performance: current pipeline runtime, optimization recommendations

Upsell Opportunities

Service	Description	Price Range
Real-time Dashboard	Streamlit/Grafana live dashboard for store managers	$1,000–3,000
AI-generated Reports	Weekly narrative summaries via LLM integration	$500–1,000/mo
Anomaly Alerts	Automated alerts for unusual sales drops or spikes	$200–500/mo
Data API	Standardized API for POS system integration	$300–800/mo

Handling Common Objections

“We can just use Excel.” Response: Excel chokes around 100K rows and can’t auto-merge daily files. This solution reads and aggregates 1M+ rows in under 5 seconds, with automatic daily updates via cron.

“It’s too expensive.” Response: Let’s quantify your current time cost. If you spend 30 minutes/day manually merging spreadsheets, that’s 15 hours/month. Even at minimum wage, you’re spending more on manual work in 3 months than this solution costs.

“We don’t really need this.” Response: Let me do a free 30-minute data health check. I’ll merge your files and show you insights you can’t get from isolated spreadsheets — which store is most profitable, which products are cannibalizing each other, and where you’re leaving money on the table.

FAQ

Q1: What about file encoding issues?

-- Specify UTF-8 or other encodings
SELECT * FROM read_csv_auto('data/*.csv', encoding='utf-8');
SELECT * FROM read_csv_auto('data/*.csv', encoding='latin1');

Q2: Column names have inconsistent whitespace?

-- Normalize column names automatically
SELECT * FROM read_csv_auto('data/*.csv', normalize_names=true);

Q3: Too many files, not enough memory?

-- Read in batches, append to table
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('data/2026-01/*.csv');
INSERT INTO sales SELECT * FROM read_csv_auto('data/2026-02/*.csv');
-- Repeat per month...

Q4: How to automate daily CSV ingestion?

Set up a cron job:

# crontab -e
# Run every day at 2 AM
0 2 * * * cd /path/to/project && duckdb < merge_analysis.sql

Conclusion

From scattered multi-store CSV files to queryable, compressed Parquet datasets — DuckDB turns what used to be a tedious Python scripting task into a handful of clean SQL statements. The core takeaways:

read_csv_auto with glob + filename=true: One-shot reads, provenance tracking
Regex extraction + strptime: Reverse-engineer metadata from filenames
Native Unicode column name support: No friction for international teams
strftime time aggregation: Flexible time-based analytics
Parquet export: 10x faster downstream analysis

This workflow isn’t limited to retail sales. Any scenario involving multiple files, multiple sources, and recurring consolidation — multi-warehouse inventory, multi-server logs, multi-location foot traffic — benefits from the same pattern.

Try it yourself: point DuckDB at a directory of CSVs, write a single read_csv_auto('*.csv') query, and see how far SQL alone can take you.