
From Messy CSVs to Clean Reports: DuckDB’s One-Stop Solution
Have you ever taken on a data project where the client dumped a dozen CSV and Excel files on you? Some use Chinese column names, others use English. Dates come in 2026-07-01, 01/07/2026, or July 1, 2026 formats. Amount columns contain ¥29.90, $49.00, €78.90, or even 29.90元.
The traditional approach is to write a bunch of Python pandas cleaning scripts or manually replace values in Excel. But DuckDB can handle all of this in a single SQL query.
Today we’ll walk through a real side-hustle direction: providing data cleaning and profit analysis services for multi-platform e-commerce sellers. One DuckDB script can serve dozens of clients with near-zero marginal cost.
1. Data Standardization: Unifying Three Platforms’ Orders in One SQL
Imagine you receive a data request from a multi-platform e-commerce seller who needs to merge order data from Pinduoduo, Taobao, and Shopify. Each platform exports data in a completely different format:
Pinduoduo export format:
订单号,商品名称,实付金额,订单状态,下单时间
PDD123456,iPhone 15 Case,¥29.90,Completed,2026-07-01 14:30:00
Taobao export format:
order_id,product_title,pay_amount,status,create_time
TB789012,MagSafe Charger,$49.00,TRADE_SUCCESS,2026-07-01 15:20:00
Shopify export format:
order_number,item_name,total_price,fulfilled_status,created_at
#1001,Case + Charger Bundle,€78.90,fulfilled,2026-07-01T16:00:00Z
Step 1: Read and Standardize Each Platform’s Data Separately
-- Read Pinduoduo data, normalize fields
CREATE OR REPLACE TABLE pdd_raw AS
SELECT
'pdd' AS platform,
TRIM(订单号) AS order_id,
TRIM(商品名称) AS product_name,
CAST(REGEXP_REPLACE(实付金额, '[¥¥,]', '') AS DOUBLE) AS amount,
CASE
WHEN 订单状态 LIKE '%已完成%' THEN 'completed'
WHEN 订单状态 LIKE '%退款%' THEN 'refunded'
WHEN 订单状态 LIKE '%取消%' THEN 'cancelled'
ELSE 'pending'
END AS status,
CAST(下单时间 AS TIMESTAMP) AS order_time
FROM read_csv_auto('pinduoduo_orders.csv');
-- Read Taobao data, normalize fields
CREATE OR REPLACE TABLE taobao_raw AS
SELECT
'taobao' AS platform,
TRIM(order_id) AS order_id,
TRIM(product_title) AS product_name,
CAST(REGEXP_REPLACE(pay_amount, '[¥¥$,]', '') AS DOUBLE) AS amount,
CASE
WHEN status LIKE '%已完成%' THEN 'completed'
WHEN status LIKE '%退款%' THEN 'refunded'
WHEN status LIKE '%取消%' THEN 'cancelled'
ELSE 'pending'
END AS status,
CAST(create_time AS TIMESTAMP) AS order_time
FROM read_csv_auto('taobao_orders.csv');
-- Read Shopify data, normalize fields
CREATE OR REPLACE TABLE shopify_raw AS
SELECT
'shopify' AS platform,
TRIM(order_number) AS order_id,
TRIM(item_name) AS product_name,
CAST(REGEXP_REPLACE(total_price, '[€£$,¥¥,]', '') AS DOUBLE) AS amount,
CASE
WHEN fulfilled_status IN ('fulfilled', 'shipped') THEN 'completed'
WHEN fulfilled_status IN ('refunded', 'cancelled') THEN 'refunded'
ELSE 'pending'
END AS status,
CAST(created_at AS TIMESTAMP) AS order_time
FROM read_csv_auto('shopify_orders.csv');
Key technical points here:
REGEXP_REPLACE: Clears all currency symbols and thousand separators in one shot — no Python regex neededCAST(... AS TIMESTAMP): DuckDB automatically recognizes multiple date formats, including ISO 8601, US format, and Chinese formatCASE WHEN: Maps each platform’s different status enumerations to standardized valuesread_csv_auto(): Automatically infers column types without needing a pre-defined schema
Step 2: Merge Into a Unified Table
CREATE OR REPLACE TABLE all_orders AS
SELECT * FROM pdd_raw
UNION ALL SELECT * FROM taobao_raw
UNION ALL SELECT * FROM shopify_raw;
Now all data lives in one table, and subsequent analysis only requires querying this single table.
2. Core Profit Analysis: Calculating Every Penny
With clean data, we answer the seller’s most important question: how much money did I actually make?
-- Multi-dimensional profit analysis
WITH platform_summary AS (
SELECT
platform,
COUNT(*) AS total_orders,
COUNT(*) FILTER (WHERE status = 'completed') AS completed_orders,
COUNT(*) FILTER (WHERE status = 'refunded') AS refunded_orders,
ROUND(SUM(amount) FILTER (WHERE status = 'completed'), 2) AS gross_revenue,
ROUND(SUM(amount) FILTER (WHERE status = 'refunded'), 2) AS refund_amount,
ROUND(AVG(amount) FILTER (WHERE status = 'completed'), 2) AS avg_order_value,
ROUND(
1.0 * COUNT(*) FILTER (WHERE status = 'refunded')
/ NULLIF(COUNT(*), 0) * 100, 2
) AS refund_rate_pct
FROM all_orders
GROUP BY platform
)
SELECT * FROM platform_summary
ORDER BY gross_revenue DESC;
This query output tells the seller:
- Actual revenue per platform (after deducting refunds)
- Refund rate (a key metric directly impacting profit)
- Average order value (the foundation for marketing strategy)
Advanced: Time-Series Revenue Trends
-- Weekly revenue trends per platform with WoW change
WITH weekly_revenue AS (
SELECT
platform,
DATE_TRUNC('week', order_time) AS trade_week,
ROUND(SUM(amount) FILTER (WHERE status = 'completed'), 2) AS revenue,
COUNT(*) FILTER (WHERE status = 'completed') AS orders,
-- Week-over-week comparison
LAG(ROUND(SUM(amount) FILTER (WHERE status = 'completed'), 2))
OVER (PARTITION BY platform ORDER BY DATE_TRUNC('week', order_time)) AS prev_week_revenue,
-- WoW growth rate
ROUND(
(SUM(amount) FILTER (WHERE status = 'completed')
- LAG(SUM(amount) FILTER (WHERE status = 'completed'))
OVER (PARTITION BY platform ORDER BY DATE_TRUNC('week', order_time)))
/ NULLIF(LAG(SUM(amount) FILTER (WHERE status = 'completed'))
OVER (PARTITION BY platform ORDER BY DATE_TRUNC('week', order_time)), 0)
* 100, 2
) AS week_over_week_change
FROM all_orders
GROUP BY platform, DATE_TRUNC('week', order_time)
)
SELECT * FROM weekly_revenue
ORDER BY platform, trade_week;
Two powerful window functions are used here:
LAG(): Access the previous row’s data to calculate period-over-period changesFILTER (WHERE ...): Conditional filtering within aggregation, far cleaner than traditionalSUM(CASE WHEN ...)
3. ABC Classification: Finding Cash Cows and Dead Weight
Knowing total revenue isn’t enough. Sellers need to know which SKUs are making money and which are losing it. This is the classic ABC analysis method.
-- ABC analysis: Tier SKUs by revenue contribution
WITH sku_stats AS (
SELECT
product_name,
platform,
COUNT(*) AS order_count,
ROUND(SUM(amount) FILTER (WHERE status = 'completed'), 2) AS total_revenue,
ROUND(AVG(amount) FILTER (WHERE status = 'completed'), 2) AS avg_price,
COUNT(*) FILTER (WHERE status = 'refunded') AS refund_count,
-- Cumulative revenue share for ABC tiering
SUM(SUM(amount) FILTER (WHERE status = 'completed'))
OVER (ORDER BY SUM(amount) FILTER (WHERE status = 'completed') DESC)
/ NULLIF(SUM(SUM(amount) FILTER (WHERE status = 'completed')))
OVER () AS cumulative_revenue_share
FROM all_orders
GROUP BY product_name, platform
)
SELECT
product_name,
platform,
order_count,
total_revenue,
avg_price,
refund_count,
ROUND(cumulative_revenue_share * 100, 2) AS cum_rev_pct,
CASE
WHEN cumulative_revenue_share <= 0.7 THEN 'Tier A (Cash Cow)'
WHEN cumulative_revenue_share <= 0.9 THEN 'Tier B (Stable)'
ELSE 'Tier C (Long Tail)'
END AS sku_tier,
-- Refund rate alert
CASE
WHEN ROUND(
1.0 * refund_count / NULLIF(order_count, 0) * 100, 2
) > 15 THEN '⚠️ High refund rate'
ELSE '✅ Normal'
END AS risk_flag
FROM sku_stats
ORDER BY total_revenue DESC;
The output of this query can be handed directly to the seller:
- Tier A SKUs: Core products accounting for 70% of total sales — should increase promotion
- Tier B SKUs: Stable products at 70%-90% — maintain current strategy
- Tier C SKUs: Long-tail products — consider delisting
- Refund rate alerts: Products with refund rates above 15% are flagged in red
4. Comparison With Traditional Tools
| Dimension | Excel | Pandas | DuckDB |
|---|---|---|---|
| Multi-format CSV merge | Manual copy-paste | Python loop required | One SQL query |
| Currency symbol cleanup | VLOOKUP + SUBSTITUTE | Regex + loop | REGEXP_REPLACE one-liner |
| Date format unification | Split columns + text conversion | Multiple pd.to_datetime calls | CAST AS TIMESTAMP auto-detect |
| Million-row aggregation | Crashes | Works but slow | Milliseconds |
| Deployment cost | File sharing | Requires Python env | Runs standalone |
| Reusability | Redo every time | Scripts need maintenance | SQL templates copy-paste-ready |
5. Monetization Strategy: How Much Is This Skill Worth?
Freelance Model
- Pricing: $70-280 USD per client per month
- Client sources: Upwork, Fiverr, e-commerce communities, Reddit r/smallbusiness
- Deliverables: A DuckDB script + weekly auto-generated analysis reports
- Marginal cost: One script serves dozens of clients — just swap the data file paths
SaaS Model
- Embed DuckDB in a FastAPI backend with a web interface
- Seller uploads CSV → auto-analysis → HTML report output
- Pricing: $14-40 USD/month per store
- Reference case:
duckdb-fastapi-analytics-saashas a complete implementation
Data Product Model
- Create universal e-commerce analysis templates and sell on Gumroad or Knowledge Planet
- Pricing: $28-70 USD per license
- Includes video tutorial + template scripts + customer support
Core advantage: People who know SQL don’t understand e-commerce business, and people who understand e-commerce don’t know SQL. Being able to solve both groups’ pain points is your moat.
6. Summary
Starting from scratch, we walked through a complete e-commerce data cleaning and profit analysis workflow using DuckDB:
- Data standardization: Clean multi-platform dirty data with
REGEXP_REPLACE,CAST, andCASE WHEN - Unified merge: Combine heterogeneous data into one standard table with
UNION ALL - Profit analysis: Calculate per-platform revenue and trends using
FILTERand window functions - ABC classification: Perform cumulative share analysis with
SUM() OVER (ORDER BY ...) - Risk alerts: Automatically flag high-refund-rate SKUs
The entire process requires pure SQL — not a single line of Python. This is DuckDB’s core competitive advantage: unifying data processing and data analysis in a single engine.
The full version of this article is published at duckdblab.org, with more detailed steps and additional cases.