Featured image of post DuckDB Data Cleaning & Profit Analysis: From Messy CSVs to Precise Profit Reports

DuckDB Data Cleaning & Profit Analysis: From Messy CSVs to Precise Profit Reports

Use DuckDB's read_csv_auto, REGEXP_REPLACE, CAST, and window functions to standardize messy multi-platform CSV data and perform profit margin analysis with ABC classification.

Data Standardization & Profit Analysis Pipeline

From Messy CSVs to Clean Reports: DuckDB’s One-Stop Solution

Have you ever taken on a data project where the client dumped a dozen CSV and Excel files on you? Some use Chinese column names, others use English. Dates come in 2026-07-01, 01/07/2026, or July 1, 2026 formats. Amount columns contain ¥29.90, $49.00, €78.90, or even 29.90元.

The traditional approach is to write a bunch of Python pandas cleaning scripts or manually replace values in Excel. But DuckDB can handle all of this in a single SQL query.

Today we’ll walk through a real side-hustle direction: providing data cleaning and profit analysis services for multi-platform e-commerce sellers. One DuckDB script can serve dozens of clients with near-zero marginal cost.


1. Data Standardization: Unifying Three Platforms’ Orders in One SQL

Imagine you receive a data request from a multi-platform e-commerce seller who needs to merge order data from Pinduoduo, Taobao, and Shopify. Each platform exports data in a completely different format:

Pinduoduo export format:

订单号,商品名称,实付金额,订单状态,下单时间
PDD123456,iPhone 15 Case,¥29.90,Completed,2026-07-01 14:30:00

Taobao export format:

order_id,product_title,pay_amount,status,create_time
TB789012,MagSafe Charger,$49.00,TRADE_SUCCESS,2026-07-01 15:20:00

Shopify export format:

order_number,item_name,total_price,fulfilled_status,created_at
#1001,Case + Charger Bundle,€78.90,fulfilled,2026-07-01T16:00:00Z

Step 1: Read and Standardize Each Platform’s Data Separately

-- Read Pinduoduo data, normalize fields
CREATE OR REPLACE TABLE pdd_raw AS
SELECT
    'pdd' AS platform,
    TRIM(订单号) AS order_id,
    TRIM(商品名称) AS product_name,
    CAST(REGEXP_REPLACE(实付金额, '[¥¥,]', '') AS DOUBLE) AS amount,
    CASE
        WHEN 订单状态 LIKE '%已完成%' THEN 'completed'
        WHEN 订单状态 LIKE '%退款%' THEN 'refunded'
        WHEN 订单状态 LIKE '%取消%' THEN 'cancelled'
        ELSE 'pending'
    END AS status,
    CAST(下单时间 AS TIMESTAMP) AS order_time
FROM read_csv_auto('pinduoduo_orders.csv');

-- Read Taobao data, normalize fields
CREATE OR REPLACE TABLE taobao_raw AS
SELECT
    'taobao' AS platform,
    TRIM(order_id) AS order_id,
    TRIM(product_title) AS product_name,
    CAST(REGEXP_REPLACE(pay_amount, '[¥¥$,]', '') AS DOUBLE) AS amount,
    CASE
        WHEN status LIKE '%已完成%' THEN 'completed'
        WHEN status LIKE '%退款%' THEN 'refunded'
        WHEN status LIKE '%取消%' THEN 'cancelled'
        ELSE 'pending'
    END AS status,
    CAST(create_time AS TIMESTAMP) AS order_time
FROM read_csv_auto('taobao_orders.csv');

-- Read Shopify data, normalize fields
CREATE OR REPLACE TABLE shopify_raw AS
SELECT
    'shopify' AS platform,
    TRIM(order_number) AS order_id,
    TRIM(item_name) AS product_name,
    CAST(REGEXP_REPLACE(total_price, '[€£$,¥¥,]', '') AS DOUBLE) AS amount,
    CASE
        WHEN fulfilled_status IN ('fulfilled', 'shipped') THEN 'completed'
        WHEN fulfilled_status IN ('refunded', 'cancelled') THEN 'refunded'
        ELSE 'pending'
    END AS status,
    CAST(created_at AS TIMESTAMP) AS order_time
FROM read_csv_auto('shopify_orders.csv');

Key technical points here:

  • REGEXP_REPLACE: Clears all currency symbols and thousand separators in one shot — no Python regex needed
  • CAST(... AS TIMESTAMP): DuckDB automatically recognizes multiple date formats, including ISO 8601, US format, and Chinese format
  • CASE WHEN: Maps each platform’s different status enumerations to standardized values
  • read_csv_auto(): Automatically infers column types without needing a pre-defined schema

Step 2: Merge Into a Unified Table

CREATE OR REPLACE TABLE all_orders AS
SELECT * FROM pdd_raw
UNION ALL SELECT * FROM taobao_raw
UNION ALL SELECT * FROM shopify_raw;

Now all data lives in one table, and subsequent analysis only requires querying this single table.


2. Core Profit Analysis: Calculating Every Penny

With clean data, we answer the seller’s most important question: how much money did I actually make?

-- Multi-dimensional profit analysis
WITH platform_summary AS (
    SELECT
        platform,
        COUNT(*) AS total_orders,
        COUNT(*) FILTER (WHERE status = 'completed') AS completed_orders,
        COUNT(*) FILTER (WHERE status = 'refunded') AS refunded_orders,
        ROUND(SUM(amount) FILTER (WHERE status = 'completed'), 2) AS gross_revenue,
        ROUND(SUM(amount) FILTER (WHERE status = 'refunded'), 2) AS refund_amount,
        ROUND(AVG(amount) FILTER (WHERE status = 'completed'), 2) AS avg_order_value,
        ROUND(
            1.0 * COUNT(*) FILTER (WHERE status = 'refunded')
            / NULLIF(COUNT(*), 0) * 100, 2
        ) AS refund_rate_pct
    FROM all_orders
    GROUP BY platform
)
SELECT * FROM platform_summary
ORDER BY gross_revenue DESC;

This query output tells the seller:

  • Actual revenue per platform (after deducting refunds)
  • Refund rate (a key metric directly impacting profit)
  • Average order value (the foundation for marketing strategy)
-- Weekly revenue trends per platform with WoW change
WITH weekly_revenue AS (
    SELECT
        platform,
        DATE_TRUNC('week', order_time) AS trade_week,
        ROUND(SUM(amount) FILTER (WHERE status = 'completed'), 2) AS revenue,
        COUNT(*) FILTER (WHERE status = 'completed') AS orders,
        -- Week-over-week comparison
        LAG(ROUND(SUM(amount) FILTER (WHERE status = 'completed'), 2))
            OVER (PARTITION BY platform ORDER BY DATE_TRUNC('week', order_time)) AS prev_week_revenue,
        -- WoW growth rate
        ROUND(
            (SUM(amount) FILTER (WHERE status = 'completed')
            - LAG(SUM(amount) FILTER (WHERE status = 'completed'))
                OVER (PARTITION BY platform ORDER BY DATE_TRUNC('week', order_time)))
            / NULLIF(LAG(SUM(amount) FILTER (WHERE status = 'completed'))
                OVER (PARTITION BY platform ORDER BY DATE_TRUNC('week', order_time)), 0)
            * 100, 2
        ) AS week_over_week_change
    FROM all_orders
    GROUP BY platform, DATE_TRUNC('week', order_time)
)
SELECT * FROM weekly_revenue
ORDER BY platform, trade_week;

Two powerful window functions are used here:

  • LAG(): Access the previous row’s data to calculate period-over-period changes
  • FILTER (WHERE ...): Conditional filtering within aggregation, far cleaner than traditional SUM(CASE WHEN ...)

3. ABC Classification: Finding Cash Cows and Dead Weight

Knowing total revenue isn’t enough. Sellers need to know which SKUs are making money and which are losing it. This is the classic ABC analysis method.

-- ABC analysis: Tier SKUs by revenue contribution
WITH sku_stats AS (
    SELECT
        product_name,
        platform,
        COUNT(*) AS order_count,
        ROUND(SUM(amount) FILTER (WHERE status = 'completed'), 2) AS total_revenue,
        ROUND(AVG(amount) FILTER (WHERE status = 'completed'), 2) AS avg_price,
        COUNT(*) FILTER (WHERE status = 'refunded') AS refund_count,
        -- Cumulative revenue share for ABC tiering
        SUM(SUM(amount) FILTER (WHERE status = 'completed'))
            OVER (ORDER BY SUM(amount) FILTER (WHERE status = 'completed') DESC)
            / NULLIF(SUM(SUM(amount) FILTER (WHERE status = 'completed')))
            OVER () AS cumulative_revenue_share
    FROM all_orders
    GROUP BY product_name, platform
)
SELECT
    product_name,
    platform,
    order_count,
    total_revenue,
    avg_price,
    refund_count,
    ROUND(cumulative_revenue_share * 100, 2) AS cum_rev_pct,
    CASE
        WHEN cumulative_revenue_share <= 0.7 THEN 'Tier A (Cash Cow)'
        WHEN cumulative_revenue_share <= 0.9 THEN 'Tier B (Stable)'
        ELSE 'Tier C (Long Tail)'
    END AS sku_tier,
    -- Refund rate alert
    CASE
        WHEN ROUND(
            1.0 * refund_count / NULLIF(order_count, 0) * 100, 2
        ) > 15 THEN '⚠️ High refund rate'
        ELSE '✅ Normal'
    END AS risk_flag
FROM sku_stats
ORDER BY total_revenue DESC;

The output of this query can be handed directly to the seller:

  • Tier A SKUs: Core products accounting for 70% of total sales — should increase promotion
  • Tier B SKUs: Stable products at 70%-90% — maintain current strategy
  • Tier C SKUs: Long-tail products — consider delisting
  • Refund rate alerts: Products with refund rates above 15% are flagged in red

4. Comparison With Traditional Tools

DimensionExcelPandasDuckDB
Multi-format CSV mergeManual copy-pastePython loop requiredOne SQL query
Currency symbol cleanupVLOOKUP + SUBSTITUTERegex + loopREGEXP_REPLACE one-liner
Date format unificationSplit columns + text conversionMultiple pd.to_datetime callsCAST AS TIMESTAMP auto-detect
Million-row aggregationCrashesWorks but slowMilliseconds
Deployment costFile sharingRequires Python envRuns standalone
ReusabilityRedo every timeScripts need maintenanceSQL templates copy-paste-ready

5. Monetization Strategy: How Much Is This Skill Worth?

Freelance Model

  • Pricing: $70-280 USD per client per month
  • Client sources: Upwork, Fiverr, e-commerce communities, Reddit r/smallbusiness
  • Deliverables: A DuckDB script + weekly auto-generated analysis reports
  • Marginal cost: One script serves dozens of clients — just swap the data file paths

SaaS Model

  • Embed DuckDB in a FastAPI backend with a web interface
  • Seller uploads CSV → auto-analysis → HTML report output
  • Pricing: $14-40 USD/month per store
  • Reference case: duckdb-fastapi-analytics-saas has a complete implementation

Data Product Model

  • Create universal e-commerce analysis templates and sell on Gumroad or Knowledge Planet
  • Pricing: $28-70 USD per license
  • Includes video tutorial + template scripts + customer support

Core advantage: People who know SQL don’t understand e-commerce business, and people who understand e-commerce don’t know SQL. Being able to solve both groups’ pain points is your moat.


6. Summary

Starting from scratch, we walked through a complete e-commerce data cleaning and profit analysis workflow using DuckDB:

  1. Data standardization: Clean multi-platform dirty data with REGEXP_REPLACE, CAST, and CASE WHEN
  2. Unified merge: Combine heterogeneous data into one standard table with UNION ALL
  3. Profit analysis: Calculate per-platform revenue and trends using FILTER and window functions
  4. ABC classification: Perform cumulative share analysis with SUM() OVER (ORDER BY ...)
  5. Risk alerts: Automatically flag high-refund-rate SKUs

The entire process requires pure SQL — not a single line of Python. This is DuckDB’s core competitive advantage: unifying data processing and data analysis in a single engine.

The full version of this article is published at duckdblab.org, with more detailed steps and additional cases.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.