Introduction
“Can you analyze the sales data from all our stores?”
Behind this innocent request lies a mess: each store has its own CSV file, generated daily, with inconsistent filenames, mixed column names, and varying encodings. The traditional approach — writing a Python script to iterate through directories, read each file, concatenate DataFrames, handle encoding issues — is slow, fragile, and error-prone.
This is where DuckDB shines. In this tutorial, we’ll walk through a multi-store sales data consolidation project, from scattered CSV files to a multi-dimensional analytical report — using nothing but SQL.
The Scenario
A retail chain has 5 stores. Each store generates a daily sales CSV. The file structure looks like this:
data/
├── store_001_daily_20260501.csv
├── store_001_daily_20260502.csv
├── store_002_daily_20260501.csv
├── store_002_daily_20260502.csv
├── store_003_daily_20260501.csv
├── store_003_daily_20260502.csv
├── store_004_daily_20260501.csv
├── store_004_daily_20260502.csv
├── store_005_daily_20260501.csv
├── store_005_daily_20260502.csv
Each CSV has the same structure with Chinese column names (a common real-world complication):
订单号,商品名称,单价,数量,金额,销售日期,收银员
ORD001,Latte,32.00,2,64.00,2026-05-01,Zhang San
ORD002,American Coffee,25.00,1,25.00,2026-05-01,Li Si
Step 1: Read All CSVs with a Single Wildcard
Traditional approach: write a Python script to walk directories, read each file one by one, concatenate DataFrames, and handle type inference. DuckDB approach: one line of SQL.
-- Read all CSV files using a glob pattern — DuckDB auto-infers the schema
CREATE TABLE raw_sales AS
SELECT * FROM read_csv_auto('data/*.csv');
-- Check the merged result
SELECT COUNT(*) AS total_rows, COUNT(DISTINCT 订单号) AS total_orders FROM raw_sales;
read_csv_auto is DuckDB’s killer feature for CSV wrangling:
- The
*wildcard matches all CSV files in the directory - Automatically infers column names, data types, and delimiters
- Supports recursive
globpatterns:**/*.csvmatches subdirectories - Use
union_by_name=truefor files with slightly different column structures
-- More robust: auto-merge by column name
CREATE TABLE raw_sales AS
SELECT * FROM read_csv_auto('data/*.csv', union_by_name=true);
-- Inspect the auto-inferred schema
DESCRIBE raw_sales;
Step 2: Extract Store Info and Dates from Filenames
The store ID and date are encoded in the filenames. DuckDB’s regex and file-path functions make extraction trivial:
-- Extract metadata from filenames
CREATE TABLE sales_with_meta AS
SELECT
filename,
regexp_extract(filename, 'store_(\d+)', 1) AS store_id,
regexp_extract(filename, '(\d{8})\.csv', 1) AS date_string,
*
FROM read_csv_auto('data/*.csv', filename=true, union_by_name=true);
-- Convert date strings to proper DATE type
CREATE TABLE sales_clean AS
SELECT
store_id,
strptime(date_string, '%Y%m%d')::DATE AS sale_date,
订单号 AS order_id,
商品名称 AS product_name,
单价 AS unit_price,
数量 AS quantity,
金额 AS amount,
收银员 AS cashier
FROM sales_with_meta;
The filename=true parameter adds a filename column that records which file each row came from — invaluable for debugging multi-file merges.
Step 3: Analysis with Chinese Column Names
DuckDB supports Chinese (and any Unicode) column names natively. No configuration, no quoting tricks needed:
-- Sales ranking by store
SELECT
store_id,
SUM(amount) AS total_revenue,
COUNT(DISTINCT order_id) AS order_count,
SUM(quantity) AS total_units,
ROUND(AVG(amount), 2) AS avg_order_value
FROM sales_clean
GROUP BY store_id
ORDER BY total_revenue DESC;
-- Top 10 best-selling products
SELECT
product_name,
SUM(quantity) AS total_units_sold,
SUM(amount) AS total_revenue,
COUNT(DISTINCT store_id) AS stores_stocked
FROM sales_clean
GROUP BY product_name
ORDER BY total_revenue DESC
LIMIT 10;
-- Cashier performance ranking
SELECT
cashier,
store_id,
COUNT(*) AS transactions_processed,
SUM(amount) AS total_handled
FROM sales_clean
GROUP BY cashier, store_id
ORDER BY total_handled DESC;
Step 4: Time-Based Aggregation with strftime
Time-based aggregation is the backbone of sales analysis. DuckDB’s strftime function provides Python-style date formatting:
-- Daily aggregation
SELECT
strftime(sale_date, '%Y-%m-%d') AS day,
SUM(amount) AS daily_revenue
FROM sales_clean
GROUP BY day
ORDER BY day;
-- Weekly aggregation
SELECT
strftime(sale_date, '%Y-W%W') AS week,
SUM(amount) AS weekly_revenue,
COUNT(DISTINCT sale_date) AS operating_days
FROM sales_clean
GROUP BY week
ORDER BY week;
-- Monthly aggregation
SELECT
strftime(sale_date, '%Y-%m') AS month,
SUM(amount) AS monthly_revenue,
SUM(quantity) AS monthly_units,
ROUND(AVG(amount), 2) AS avg_daily_revenue
FROM sales_clean
GROUP BY month
ORDER BY month;
-- Time-of-day analysis (sales peak hours)
SELECT
CASE
WHEN strftime(sale_date, '%H') BETWEEN '06' AND '09' THEN 'Breakfast'
WHEN strftime(sale_date, '%H') BETWEEN '10' AND '13' THEN 'Lunch'
WHEN strftime(sale_date, '%H') BETWEEN '14' AND '17' THEN 'Afternoon Tea'
ELSE 'Dinner'
END AS time_period,
SUM(amount) AS revenue
FROM sales_clean
GROUP BY time_period
ORDER BY revenue DESC;
Common strftime format specifiers:
| Format | Meaning | Example |
|---|---|---|
%Y | 4-digit year | 2026 |
%m | 2-digit month | 05 |
%d | 2-digit day | 11 |
%W | Week of year | 19 |
%w | Day of week (0-6) | 1 |
%H | Hour (00-23) | 14 |
Step 5: Export to Parquet
Once your analysis is complete, export the results as Parquet — 10x faster reads, 5x smaller files, and native columnar storage:
-- Export cleaned data to Parquet
COPY sales_clean TO 'output/sales_clean.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD);
-- Export an analytical report
COPY (
SELECT
store_id,
strftime(sale_date, '%Y-%m') AS month,
strftime(sale_date, '%W') AS week_number,
product_name,
SUM(quantity) AS total_units,
SUM(amount) AS total_revenue
FROM sales_clean
GROUP BY store_id, month, week_number, product_name
ORDER BY store_id, month, week_number, total_revenue DESC
) TO 'output/daily_report.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);
-- Query Parquet files directly
SELECT store_id, SUM(total_revenue) AS total
FROM read_parquet('output/*.parquet')
GROUP BY store_id
ORDER BY store_id;
Why Parquet matters:
- Amazing compression: ZSTD-compressed Parquet is typically only 20% of the CSV size
- Columnar storage: Only reads the columns you query — dramatically faster I/O
- Self-describing schema: Type information is embedded in the file, no DDL needed
- DuckDB native optimization: Projection pushdown, predicate pushdown, late materialization
Complete Workflow Script
Here’s the entire pipeline as a single repeatable SQL script:
-- merge_analysis.sql
-- Multi-store sales data consolidation & analysis — from CSV to Parquet
-- 1. Import — read all CSVs with wildcards
CREATE TABLE raw AS
SELECT * FROM read_csv_auto('data/*.csv', filename=true, union_by_name=true);
-- 2. Clean & transform — extract metadata from filenames
CREATE TABLE clean AS
SELECT
regexp_extract(filename, 'store_(\d+)', 1) AS store_id,
strptime(regexp_extract(filename, '(\d{8})\.csv', 1), '%Y%m%d') AS sale_date,
订单号 AS order_id,
商品名称 AS product_name,
单价 AS unit_price,
数量 AS quantity,
金额 AS amount,
收银员 AS cashier
FROM raw;
-- 3. Aggregate — monthly store performance
CREATE TABLE monthly_summary AS
SELECT
store_id,
strftime(sale_date, '%Y-%m') AS month,
COUNT(DISTINCT order_id) AS order_count,
SUM(amount) AS revenue,
SUM(quantity) AS units_sold,
ROUND(AVG(amount), 2) AS avg_order_value
FROM clean
GROUP BY store_id, month
ORDER BY store_id, month;
-- 4. Export — save as compressed Parquet
COPY clean TO 'output/clean_data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);
COPY monthly_summary TO 'output/monthly_summary.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);
-- 5. Quick validation
SELECT 'Total Rows' AS metric, COUNT(*)::VARCHAR AS value FROM clean
UNION ALL
SELECT 'Total Stores', COUNT(DISTINCT store_id)::VARCHAR FROM clean
UNION ALL
SELECT 'Date Range', MIN(sale_date)::VARCHAR || ' ~ ' || MAX(sale_date)::VARCHAR FROM clean;
Run it with:
duckdb < merge_analysis.sql
# Or interactively:
duckdb -c ".read merge_analysis.sql"
Monetization SOP: From Technical Skill to Revenue
Being able to write SQL is one thing. Being able to package data integration as a deliverable service is where the real value lies.
Pricing Strategy
| Tier | Deliverable | Price | Target Client |
|---|---|---|---|
| Basic | One-time CSV merge + 3 summary tables | $100–200 | Small businesses (1–5 stores) |
| Standard | Multi-source merge + weekly/monthly report templates + Parquet export | $300–500 | Mid-size chains (5–20 stores) |
| Premium | Fully automated pipeline + custom dashboard + ongoing maintenance | $800–2,000/month | Large chains (20+ stores) |
Customer Acquisition Channels
Targeted outreach
- Write industry-specific case studies (“How a coffee chain saved 20 hours/week on reporting”)
- Open-source a basic version on GitHub with your contact info in the README
- Post on Hacker News / Lobsters when you hit interesting performance numbers
Partner channels
- Partner with POS system vendors and ERP implementation firms
- Collaborate with accounting/financial services firms (they have the client base)
- Join retail/restaurant industry communities and forums
Content marketing
- Write niche blog posts (e.g., “Data Stack for a 10-Store Retail Chain: Under $100/mo”)
- Create short demo videos showing before/after
- Offer free 30-minute data health check consultations
Delivery Checklist
## Delivery Checklist
1. ✅ SQL automation script (merge_analysis.sql)
2. ✅ Data dictionary (PDF/Excel)
3. ✅ Cleaned dataset (Parquet format)
4. ✅ Monthly/weekly report templates
5. ✅ README with operation guide
6. ✅ 1-week free remote support
Include a Data Health Report with every delivery:
- Completeness: null values, missing dates, outlier detection
- Consistency: duplicate records, order ID conflicts, referential integrity
- Performance: current pipeline runtime, optimization recommendations
Upsell Opportunities
| Service | Description | Price Range |
|---|---|---|
| Real-time Dashboard | Streamlit/Grafana live dashboard for store managers | $1,000–3,000 |
| AI-generated Reports | Weekly narrative summaries via LLM integration | $500–1,000/mo |
| Anomaly Alerts | Automated alerts for unusual sales drops or spikes | $200–500/mo |
| Data API | Standardized API for POS system integration | $300–800/mo |
Handling Common Objections
“We can just use Excel.” Response: Excel chokes around 100K rows and can’t auto-merge daily files. This solution reads and aggregates 1M+ rows in under 5 seconds, with automatic daily updates via cron.
“It’s too expensive.” Response: Let’s quantify your current time cost. If you spend 30 minutes/day manually merging spreadsheets, that’s 15 hours/month. Even at minimum wage, you’re spending more on manual work in 3 months than this solution costs.
“We don’t really need this.” Response: Let me do a free 30-minute data health check. I’ll merge your files and show you insights you can’t get from isolated spreadsheets — which store is most profitable, which products are cannibalizing each other, and where you’re leaving money on the table.
FAQ
Q1: What about file encoding issues?
-- Specify UTF-8 or other encodings
SELECT * FROM read_csv_auto('data/*.csv', encoding='utf-8');
SELECT * FROM read_csv_auto('data/*.csv', encoding='latin1');
Q2: Column names have inconsistent whitespace?
-- Normalize column names automatically
SELECT * FROM read_csv_auto('data/*.csv', normalize_names=true);
Q3: Too many files, not enough memory?
-- Read in batches, append to table
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('data/2026-01/*.csv');
INSERT INTO sales SELECT * FROM read_csv_auto('data/2026-02/*.csv');
-- Repeat per month...
Q4: How to automate daily CSV ingestion?
Set up a cron job:
# crontab -e
# Run every day at 2 AM
0 2 * * * cd /path/to/project && duckdb < merge_analysis.sql
Conclusion
From scattered multi-store CSV files to queryable, compressed Parquet datasets — DuckDB turns what used to be a tedious Python scripting task into a handful of clean SQL statements. The core takeaways:
- read_csv_auto with glob + filename=true: One-shot reads, provenance tracking
- Regex extraction + strptime: Reverse-engineer metadata from filenames
- Native Unicode column name support: No friction for international teams
- strftime time aggregation: Flexible time-based analytics
- Parquet export: 10x faster downstream analysis
This workflow isn’t limited to retail sales. Any scenario involving multiple files, multiple sources, and recurring consolidation — multi-warehouse inventory, multi-server logs, multi-location foot traffic — benefits from the same pattern.
Try it yourself: point DuckDB at a directory of CSVs, write a single read_csv_auto('*.csv') query, and see how far SQL alone can take you.