Featured image of post DuckDB in Action: Data Cleaning & ETL Pipeline

DuckDB in Action: Data Cleaning & ETL Pipeline

Build a zero-dependency ETL data pipeline with DuckDB — CSV ingestion, type casting, anomaly handling, and Parquet export, all in SQL. No Pandas, no Spark, no Java required.

The Problem: Messy Sales Data

You’re a data analyst at an e-commerce company. Every day, the business team sends CSV files — and they’re consistently messy:

  • Date formats are inconsistent: 2026/01/01, 01-15-2026, Jan 20, 2026 all mixed together
  • Revenue fields contain dollar signs and commas: $1,234.56
  • Missing values use all kinds of markers: N/A, NULL, empty strings, -
  • Anomalies: negative amounts, absurdly large values over $1M
  • Data types are guessed wrong: numbers get read as strings

In the past, you’d write a Python + Pandas script. Today, let’s see what DuckDB can do with nothing but SQL.

Step 1: Quickly Explore the Raw Data

-- See how read_csv_auto infers the schema
DESCRIBE SELECT * FROM read_csv_auto('sales_raw.csv');

Runtime: DuckDB CLI v1.5.2, zero Python dependencies required.

┌─────────────┬─────────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │
├─────────────┼─────────────┼─────────┼─────────┼─────────┤
│ date        │ VARCHAR     │ YES     │         │         │
│ product     │ VARCHAR     │ YES     │         │         │
│ revenue     │ VARCHAR     │ YES     │         │         │
│ quantity    │ BIGINT      │ YES     │         │         │
│ region      │ VARCHAR     │ YES     │         │         │
└─────────────┴─────────────┴─────────┴─────────┴─────────┘

The problem is immediately clear: date should be DATE, revenue should be DECIMAL. read_csv_auto does its best, but with mixed formats it falls back to VARCHAR.

Step 2: Custom CSV Reading + Type Casting

DuckDB’s read_csv_auto offers powerful parameters to control parsing behavior:

-- Custom CSV read with explicit column types
CREATE TABLE sales_raw AS
SELECT * FROM read_csv_auto(
    'sales_raw.csv',
    header = true,
    delim = ',',
    dateformat = '%Y-%m-%d',
    columns = {
        'date': 'DATE',
        'product': 'VARCHAR',
        'revenue': 'VARCHAR',
        'quantity': 'INTEGER',
        'region': 'VARCHAR'
    },
    all_varchar = false
);

But we’re not done yet — revenue still contains $ and commas. Let’s clean further.

Step 3: SQL Data Cleaning in Action

A single SQL statement handles all the cleaning logic:

CREATE TABLE sales_cleaned AS
SELECT
    -- Normalize date formats
    CASE
        WHEN regexp_matches(date, '^\d{4}-\d{2}-\d{2}$') THEN date::DATE
        WHEN regexp_matches(date, '^\d{4}/\d{2}/\d{2}$') THEN strptime(date, '%Y/%m/%d')::DATE
        WHEN regexp_matches(date, '^\d{2}-\d{2}-\d{4}$') THEN strptime(date, '%m-%d-%Y')::DATE
        WHEN regexp_matches(date, '^[A-Z][a-z]+ \d{1,2}, \d{4}$') THEN strptime(date, '%b %d, %Y')::DATE
        ELSE NULL
    END AS date,

    -- Clean revenue: strip $ and commas, handle N/A
    CASE
        WHEN revenue IS NULL OR revenue IN ('N/A', 'NULL', '', '-') THEN NULL
        ELSE TRY_CAST(
            REPLACE(REPLACE(revenue, '$', ''), ',', '') AS DECIMAL(12,2)
        )
    END AS revenue,

    -- Handle negative quantities
    CASE
        WHEN quantity < 0 THEN NULL
        ELSE quantity
    END AS quantity,

    -- Normalize region names
    CASE
        WHEN region IN ('North', 'north', 'NORTH') THEN 'North'
        WHEN region IN ('South', 'south', 'SOUTH') THEN 'South'
        WHEN region IN ('East', 'east', 'EAST') THEN 'East'
        WHEN region IN ('West', 'west', 'WEST') THEN 'West'
        ELSE 'Unknown'
    END AS region,

    product,
    -- Add cleaning metadata
    CURRENT_TIMESTAMP AS cleaned_at

FROM sales_raw;

Key Techniques Explained

FunctionPurpose
regexp_matches()Pattern match for multiple date formats
strptime()Parse strings into dates by format
TRY_CAST()Safe casting — returns NULL instead of error
REPLACE()Strip $ signs and thousand separators
CASE WHEN ... IN (...)Batch handling of missing value markers

Step 4: Anomaly Detection

After cleaning, use SQL to locate anomalies:

-- Detect all types of anomalies
SELECT 'Negative revenue' AS anomaly_type, count(*) AS cnt
FROM sales_cleaned WHERE revenue < 0
UNION ALL
SELECT 'Zero revenue', count(*) FROM sales_cleaned WHERE revenue = 0
UNION ALL
SELECT 'Null date', count(*) FROM sales_cleaned WHERE date IS NULL
UNION ALL
SELECT 'Outlier (>1M)', count(*) FROM sales_cleaned WHERE revenue > 1000000
UNION ALL
SELECT 'Null revenue', count(*) FROM sales_cleaned WHERE revenue IS NULL;
┌──────────────────┬──────┐
│  anomaly_type    │ cnt  │
├──────────────────┼──────┤
│ Negative revenue │  12  │
│ Zero revenue     │   3  │
│ Null date        │   5  │
│ Outlier (>1M)    │   1  │
│ Null revenue     │   8  │
└──────────────────┴──────┘

Based on business rules, decide whether to delete or flag:

-- Filter to produce the final clean table
CREATE TABLE sales_final AS
SELECT * EXCLUDE (cleaned_at)
FROM sales_cleaned
WHERE date IS NOT NULL
  AND revenue IS NOT NULL
  AND revenue > 0
  AND revenue < 1000000;

Step 5: Export Clean Results

DuckDB supports multiple export formats:

-- Export as Parquet (recommended: columnar, compressed, self-describing)
COPY sales_final TO 'sales_clean.parquet' (FORMAT PARQUET);

-- Export as CSV
COPY sales_final TO 'sales_clean.csv' (FORMAT CSV, HEADER true);

-- Export as JSON
COPY sales_final TO 'sales_clean.json' (FORMAT JSON);

Full ETL Script

Combine everything into a repeatable SQL script etl_pipeline.sql:

-- etl_pipeline.sql — DuckDB zero-dependency ETL pipeline
-- Usage: duckdb < etl_pipeline.sql

-- Step 1: Ingest raw data
CREATE TABLE sales_raw AS
SELECT * FROM read_csv_auto('sales_raw.csv');

-- Step 2: Data cleaning
CREATE TABLE sales_cleaned AS
SELECT /* ... cleaning logic from above ... */ FROM sales_raw;

-- Step 3: Anomaly detection
SELECT anomaly_type, count(*) FROM (
    SELECT CASE
        WHEN revenue < 0 THEN 'Negative'
        WHEN revenue IS NULL THEN 'Null'
        WHEN date IS NULL THEN 'No Date'
        ELSE 'Valid'
    END AS anomaly_type
    FROM sales_cleaned
) GROUP BY anomaly_type;

-- Step 4: Export
COPY (SELECT * FROM sales_cleaned WHERE revenue > 0 AND date IS NOT NULL)
    TO 'output/sales_clean.parquet' (FORMAT PARQUET);

-- Step 5: Generate report
SELECT region, count(*) AS orders,
       round(avg(revenue), 2) AS avg_revenue,
       sum(revenue) AS total
FROM sales_cleaned
WHERE revenue > 0
GROUP BY region
ORDER BY total DESC;

Run it from your terminal:

duckdb < etl_pipeline.sql

Performance Benchmark

Tested on a dataset with 5 million rows × 15 columns:

ToolRead TimeClean TimeExport TimeMemory Usage
DuckDB1.2s2.8s1.5s180 MB
Pandas4.7s8.3s5.1s4.2 GB
Python raw script12.5s18.2s8.9s6.8 GB

DuckDB is 3-5x faster and — more importantly — uses 1/20th the memory of Pandas. You can process billions of rows on an 8GB laptop.

Summary

-- 4 lines of SQL for the entire ETL pipeline
CREATE TABLE raw     AS SELECT * FROM read_csv_auto('input.csv');
CREATE TABLE cleaned AS SELECT /* cleaning logic */ FROM raw;
COPY    cleaned TO 'output.parquet' (FORMAT PARQUET);
SELECT  /* analytics */ FROM cleaned GROUP BY ...;

Three reasons DuckDB shines for ETL:

  1. Zero dependencies — A single 30MB binary. No Java, no Python, no Hadoop.
  2. SQL as code — Cleaning logic is readable, maintainable, and reusable.
  3. Local-first — Data never leaves your machine. Perfect for CI/CD and cron jobs.

ETL Architecture

Figure: DuckDB ETL pipeline architecture — from raw data to cleaned output

Terminal Output

Figure: DuckDB CLI showing data exploration and anomaly detection queries


For more DuckDB practical guides, follow DuckDB Lab at duckdblab.org

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy