Featured image of post Multi-CSV File Merging: Real-World Multi-Store Sales Analysis with DuckDB

Multi-CSV File Merging: Real-World Multi-Store Sales Analysis with DuckDB

In real-world data analysis, your data is rarely in a single clean table. It's scattered across dozens, sometimes hundreds of CSV files. This tutorial walks through a multi-store sales data consolidation with DuckDB — using read_csv_auto glob patterns, Chinese column names, strftime time aggregation, and Parquet export for a complete ETL pipeline.

Introduction

“Can you analyze the sales data from all our stores?”

Behind this innocent request lies a mess: each store has its own CSV file, generated daily, with inconsistent filenames, mixed column names, and varying encodings. The traditional approach — writing a Python script to iterate through directories, read each file, concatenate DataFrames, handle encoding issues — is slow, fragile, and error-prone.

This is where DuckDB shines. In this tutorial, we’ll walk through a multi-store sales data consolidation project, from scattered CSV files to a multi-dimensional analytical report — using nothing but SQL.

The Scenario

A retail chain has 5 stores. Each store generates a daily sales CSV. The file structure looks like this:

data/
├── store_001_daily_20260501.csv
├── store_001_daily_20260502.csv
├── store_002_daily_20260501.csv
├── store_002_daily_20260502.csv
├── store_003_daily_20260501.csv
├── store_003_daily_20260502.csv
├── store_004_daily_20260501.csv
├── store_004_daily_20260502.csv
├── store_005_daily_20260501.csv
├── store_005_daily_20260502.csv

Each CSV has the same structure with Chinese column names (a common real-world complication):

订单号,商品名称,单价,数量,金额,销售日期,收银员
ORD001,Latte,32.00,2,64.00,2026-05-01,Zhang San
ORD002,American Coffee,25.00,1,25.00,2026-05-01,Li Si

Step 1: Read All CSVs with a Single Wildcard

Traditional approach: write a Python script to walk directories, read each file one by one, concatenate DataFrames, and handle type inference. DuckDB approach: one line of SQL.

-- Read all CSV files using a glob pattern — DuckDB auto-infers the schema
CREATE TABLE raw_sales AS
SELECT * FROM read_csv_auto('data/*.csv');

-- Check the merged result
SELECT COUNT(*) AS total_rows, COUNT(DISTINCT 订单号) AS total_orders FROM raw_sales;

read_csv_auto is DuckDB’s killer feature for CSV wrangling:

  • The * wildcard matches all CSV files in the directory
  • Automatically infers column names, data types, and delimiters
  • Supports recursive glob patterns: **/*.csv matches subdirectories
  • Use union_by_name=true for files with slightly different column structures
-- More robust: auto-merge by column name
CREATE TABLE raw_sales AS
SELECT * FROM read_csv_auto('data/*.csv', union_by_name=true);

-- Inspect the auto-inferred schema
DESCRIBE raw_sales;

Step 2: Extract Store Info and Dates from Filenames

The store ID and date are encoded in the filenames. DuckDB’s regex and file-path functions make extraction trivial:

-- Extract metadata from filenames
CREATE TABLE sales_with_meta AS
SELECT
  filename,
  regexp_extract(filename, 'store_(\d+)', 1) AS store_id,
  regexp_extract(filename, '(\d{8})\.csv', 1) AS date_string,
  *
FROM read_csv_auto('data/*.csv', filename=true, union_by_name=true);

-- Convert date strings to proper DATE type
CREATE TABLE sales_clean AS
SELECT
  store_id,
  strptime(date_string, '%Y%m%d')::DATE AS sale_date,
  订单号 AS order_id,
  商品名称 AS product_name,
  单价 AS unit_price,
  数量 AS quantity,
  金额 AS amount,
  收银员 AS cashier
FROM sales_with_meta;

The filename=true parameter adds a filename column that records which file each row came from — invaluable for debugging multi-file merges.

Step 3: Analysis with Chinese Column Names

DuckDB supports Chinese (and any Unicode) column names natively. No configuration, no quoting tricks needed:

-- Sales ranking by store
SELECT
  store_id,
  SUM(amount) AS total_revenue,
  COUNT(DISTINCT order_id) AS order_count,
  SUM(quantity) AS total_units,
  ROUND(AVG(amount), 2) AS avg_order_value
FROM sales_clean
GROUP BY store_id
ORDER BY total_revenue DESC;

-- Top 10 best-selling products
SELECT
  product_name,
  SUM(quantity) AS total_units_sold,
  SUM(amount) AS total_revenue,
  COUNT(DISTINCT store_id) AS stores_stocked
FROM sales_clean
GROUP BY product_name
ORDER BY total_revenue DESC
LIMIT 10;

-- Cashier performance ranking
SELECT
  cashier,
  store_id,
  COUNT(*) AS transactions_processed,
  SUM(amount) AS total_handled
FROM sales_clean
GROUP BY cashier, store_id
ORDER BY total_handled DESC;

Step 4: Time-Based Aggregation with strftime

Time-based aggregation is the backbone of sales analysis. DuckDB’s strftime function provides Python-style date formatting:

-- Daily aggregation
SELECT
  strftime(sale_date, '%Y-%m-%d') AS day,
  SUM(amount) AS daily_revenue
FROM sales_clean
GROUP BY day
ORDER BY day;

-- Weekly aggregation
SELECT
  strftime(sale_date, '%Y-W%W') AS week,
  SUM(amount) AS weekly_revenue,
  COUNT(DISTINCT sale_date) AS operating_days
FROM sales_clean
GROUP BY week
ORDER BY week;

-- Monthly aggregation
SELECT
  strftime(sale_date, '%Y-%m') AS month,
  SUM(amount) AS monthly_revenue,
  SUM(quantity) AS monthly_units,
  ROUND(AVG(amount), 2) AS avg_daily_revenue
FROM sales_clean
GROUP BY month
ORDER BY month;

-- Time-of-day analysis (sales peak hours)
SELECT
  CASE
    WHEN strftime(sale_date, '%H') BETWEEN '06' AND '09' THEN 'Breakfast'
    WHEN strftime(sale_date, '%H') BETWEEN '10' AND '13' THEN 'Lunch'
    WHEN strftime(sale_date, '%H') BETWEEN '14' AND '17' THEN 'Afternoon Tea'
    ELSE 'Dinner'
  END AS time_period,
  SUM(amount) AS revenue
FROM sales_clean
GROUP BY time_period
ORDER BY revenue DESC;

Common strftime format specifiers:

FormatMeaningExample
%Y4-digit year2026
%m2-digit month05
%d2-digit day11
%WWeek of year19
%wDay of week (0-6)1
%HHour (00-23)14

Step 5: Export to Parquet

Once your analysis is complete, export the results as Parquet — 10x faster reads, 5x smaller files, and native columnar storage:

-- Export cleaned data to Parquet
COPY sales_clean TO 'output/sales_clean.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD);

-- Export an analytical report
COPY (
  SELECT
    store_id,
    strftime(sale_date, '%Y-%m') AS month,
    strftime(sale_date, '%W') AS week_number,
    product_name,
    SUM(quantity) AS total_units,
    SUM(amount) AS total_revenue
  FROM sales_clean
  GROUP BY store_id, month, week_number, product_name
  ORDER BY store_id, month, week_number, total_revenue DESC
) TO 'output/daily_report.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);

-- Query Parquet files directly
SELECT store_id, SUM(total_revenue) AS total
FROM read_parquet('output/*.parquet')
GROUP BY store_id
ORDER BY store_id;

Why Parquet matters:

  • Amazing compression: ZSTD-compressed Parquet is typically only 20% of the CSV size
  • Columnar storage: Only reads the columns you query — dramatically faster I/O
  • Self-describing schema: Type information is embedded in the file, no DDL needed
  • DuckDB native optimization: Projection pushdown, predicate pushdown, late materialization

Complete Workflow Script

Here’s the entire pipeline as a single repeatable SQL script:

-- merge_analysis.sql
-- Multi-store sales data consolidation & analysis — from CSV to Parquet

-- 1. Import — read all CSVs with wildcards
CREATE TABLE raw AS
SELECT * FROM read_csv_auto('data/*.csv', filename=true, union_by_name=true);

-- 2. Clean & transform — extract metadata from filenames
CREATE TABLE clean AS
SELECT
  regexp_extract(filename, 'store_(\d+)', 1) AS store_id,
  strptime(regexp_extract(filename, '(\d{8})\.csv', 1), '%Y%m%d') AS sale_date,
  订单号 AS order_id,
  商品名称 AS product_name,
  单价 AS unit_price,
  数量 AS quantity,
  金额 AS amount,
  收银员 AS cashier
FROM raw;

-- 3. Aggregate — monthly store performance
CREATE TABLE monthly_summary AS
SELECT
  store_id,
  strftime(sale_date, '%Y-%m') AS month,
  COUNT(DISTINCT order_id) AS order_count,
  SUM(amount) AS revenue,
  SUM(quantity) AS units_sold,
  ROUND(AVG(amount), 2) AS avg_order_value
FROM clean
GROUP BY store_id, month
ORDER BY store_id, month;

-- 4. Export — save as compressed Parquet
COPY clean TO 'output/clean_data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);
COPY monthly_summary TO 'output/monthly_summary.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);

-- 5. Quick validation
SELECT 'Total Rows' AS metric, COUNT(*)::VARCHAR AS value FROM clean
UNION ALL
SELECT 'Total Stores', COUNT(DISTINCT store_id)::VARCHAR FROM clean
UNION ALL
SELECT 'Date Range', MIN(sale_date)::VARCHAR || ' ~ ' || MAX(sale_date)::VARCHAR FROM clean;

Run it with:

duckdb < merge_analysis.sql
# Or interactively:
duckdb -c ".read merge_analysis.sql"

Monetization SOP: From Technical Skill to Revenue

Being able to write SQL is one thing. Being able to package data integration as a deliverable service is where the real value lies.

Pricing Strategy

TierDeliverablePriceTarget Client
BasicOne-time CSV merge + 3 summary tables$100–200Small businesses (1–5 stores)
StandardMulti-source merge + weekly/monthly report templates + Parquet export$300–500Mid-size chains (5–20 stores)
PremiumFully automated pipeline + custom dashboard + ongoing maintenance$800–2,000/monthLarge chains (20+ stores)

Customer Acquisition Channels

  1. Targeted outreach

    • Write industry-specific case studies (“How a coffee chain saved 20 hours/week on reporting”)
    • Open-source a basic version on GitHub with your contact info in the README
    • Post on Hacker News / Lobsters when you hit interesting performance numbers
  2. Partner channels

    • Partner with POS system vendors and ERP implementation firms
    • Collaborate with accounting/financial services firms (they have the client base)
    • Join retail/restaurant industry communities and forums
  3. Content marketing

    • Write niche blog posts (e.g., “Data Stack for a 10-Store Retail Chain: Under $100/mo”)
    • Create short demo videos showing before/after
    • Offer free 30-minute data health check consultations

Delivery Checklist

## Delivery Checklist
1. ✅ SQL automation script (merge_analysis.sql)
2. ✅ Data dictionary (PDF/Excel)
3. ✅ Cleaned dataset (Parquet format)
4. ✅ Monthly/weekly report templates
5. ✅ README with operation guide
6. ✅ 1-week free remote support

Include a Data Health Report with every delivery:

  • Completeness: null values, missing dates, outlier detection
  • Consistency: duplicate records, order ID conflicts, referential integrity
  • Performance: current pipeline runtime, optimization recommendations

Upsell Opportunities

ServiceDescriptionPrice Range
Real-time DashboardStreamlit/Grafana live dashboard for store managers$1,000–3,000
AI-generated ReportsWeekly narrative summaries via LLM integration$500–1,000/mo
Anomaly AlertsAutomated alerts for unusual sales drops or spikes$200–500/mo
Data APIStandardized API for POS system integration$300–800/mo

Handling Common Objections

“We can just use Excel.” Response: Excel chokes around 100K rows and can’t auto-merge daily files. This solution reads and aggregates 1M+ rows in under 5 seconds, with automatic daily updates via cron.

“It’s too expensive.” Response: Let’s quantify your current time cost. If you spend 30 minutes/day manually merging spreadsheets, that’s 15 hours/month. Even at minimum wage, you’re spending more on manual work in 3 months than this solution costs.

“We don’t really need this.” Response: Let me do a free 30-minute data health check. I’ll merge your files and show you insights you can’t get from isolated spreadsheets — which store is most profitable, which products are cannibalizing each other, and where you’re leaving money on the table.

FAQ

Q1: What about file encoding issues?

-- Specify UTF-8 or other encodings
SELECT * FROM read_csv_auto('data/*.csv', encoding='utf-8');
SELECT * FROM read_csv_auto('data/*.csv', encoding='latin1');

Q2: Column names have inconsistent whitespace?

-- Normalize column names automatically
SELECT * FROM read_csv_auto('data/*.csv', normalize_names=true);

Q3: Too many files, not enough memory?

-- Read in batches, append to table
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('data/2026-01/*.csv');
INSERT INTO sales SELECT * FROM read_csv_auto('data/2026-02/*.csv');
-- Repeat per month...

Q4: How to automate daily CSV ingestion?

Set up a cron job:

# crontab -e
# Run every day at 2 AM
0 2 * * * cd /path/to/project && duckdb < merge_analysis.sql

Architecture Overview

Conclusion

From scattered multi-store CSV files to queryable, compressed Parquet datasets — DuckDB turns what used to be a tedious Python scripting task into a handful of clean SQL statements. The core takeaways:

  1. read_csv_auto with glob + filename=true: One-shot reads, provenance tracking
  2. Regex extraction + strptime: Reverse-engineer metadata from filenames
  3. Native Unicode column name support: No friction for international teams
  4. strftime time aggregation: Flexible time-based analytics
  5. Parquet export: 10x faster downstream analysis

This workflow isn’t limited to retail sales. Any scenario involving multiple files, multiple sources, and recurring consolidation — multi-warehouse inventory, multi-server logs, multi-location foot traffic — benefits from the same pattern.

Try it yourself: point DuckDB at a directory of CSVs, write a single read_csv_auto('*.csv') query, and see how far SQL alone can take you.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.