Featured image of post Kill the ETL Upsert Nightmare with DuckDB MERGE INTO

Kill the ETL Upsert Nightmare with DuckDB MERGE INTO

One SQL line replaces 20 lines of Python code. Master DuckDB MERGE INTO for idempotent ETL writes and eliminate race conditions and dirty data forever.

Kill the ETL Upsert Nightmare with DuckDB MERGE INTO: One SQL Line Replaces 20 Lines of Python

Difficulty:⭐⭐⭐ | Estimated Time:15 minutes to get started, then goodbye to dirty data forever

DuckDB MERGE INTO Architecture

1. Is Your ETL Pipeline Writing Code Like This?

Here is the daily reality for many data engineers:

# Traditional approach: check existence, then decide INSERT or UPDATE
existing = con.execute("SELECT id FROM daily_sales WHERE date = '2026-06-30'").fetchall()
if existing:
    con.execute("UPDATE daily_sales SET ... WHERE date = '2026-06-30'")
else:
    con.execute("INSERT INTO daily_sales VALUES (...)")

What is wrong with this code?

  1. Not concurrency-safe: Two processes check simultaneously, both find nothing exists, and insert duplicate records.
  2. Terrible performance: Two round trips (SELECT + INSERT/UPDATE) per record.
  3. Verbosity: Every table requires a custom check-then-insert logic block.
  4. Transaction complexity: You must manually manage transaction rollback on failure.

DuckDB’s MERGE INTO solves all of this in a single SQL statement: atomically checks for existence, updates if found, inserts if not. Zero race conditions.

2. Core Principle of Idempotent Writes

DuckDB supports the SQL:2003 standard MERGE syntax (also known as UPSERT), fully available since v0.9. At its core, it is an atomic update-or-insert operation:

MERGE INTO target_table
USING source_data
ON target_table.key = source_data.key
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT VALUES ...

Key advantages:

  • Atomicity: The entire operation completes within one transaction, eliminating race conditions
  • Batch processing: Process tens of thousands of records in a single operation, far faster than looping
  • Readability: Intent is crystal clear — update if matched, insert if not

3. Complete Hands-on: Idempotent ETL for E-commerce Sales Reports

Imagine you consume sales data from a message queue every day and need to write it into DuckDB’s analytical tables. The data may contain duplicate messages, and you need to ensure the same order is never double-counted.

Step 1: Create Tables

import duckdb

con = duckdb.connect(":memory:")

# Main table: deduplicated by order ID
con.execute("""
    CREATE TABLE daily_sales (
        order_id VARCHAR PRIMARY KEY,
        product_id VARCHAR,
        quantity INTEGER,
        amount DOUBLE,
        sale_date DATE,
        region VARCHAR,
        updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
""")

# Temporary table: holds all raw data consumed today (may contain duplicates)
con.execute("""
    CREATE TEMP TABLE raw_ingest AS
    SELECT * FROM read_csv_auto('/data/kafka/sales_20260630.csv')
""")

Step 2: Idempotent Write with MERGE INTO

# The core: one SQL statement handles idempotent writes
con.execute("""
    MERGE INTO daily_sales ds
    USING raw_ingest ri
    ON ds.order_id = ri.order_id
    WHEN MATCHED THEN
        UPDATE SET
            quantity = ds.quantity + ri.quantity,
            amount = ds.amount + ri.amount,
            updated_at = CURRENT_TIMESTAMP
    WHEN NOT MATCHED THEN
        INSERT (order_id, product_id, quantity, amount, sale_date, region)
        VALUES (ri.order_id, ri.product_id, ri.quantity, ri.amount, ri.sale_date, ri.region)
""")

# Verify results
result = con.execute("""
    SELECT
        COUNT(*) as total_records,
        COUNT(DISTINCT order_id) as unique_orders,
        SUM(quantity) as total_qty,
        ROUND(SUM(amount), 2) as total_revenue
    FROM daily_sales
""").fetchall()

print(f'Total records: {result[0][0]} | Unique orders: {result[0][1]} | Total revenue: {result[0][3]}')

Key insight: Even if the CSV contains 100 duplicate order_id values, the idempotent write will INSERT on the first match and accumulate subsequent duplicates via UPDATE. This is idempotency — executing the same operation multiple times produces the same result.

4. Advanced Scenario: Conditional Update with MERGE

Not every match requires an update. Sometimes you only want to update existing records under specific conditions:

-- Only update records where the amount changes by more than 10%
MERGE INTO daily_sales ds
USING raw_ingest ri
ON ds.order_id = ri.order_id
WHEN MATCHED AND ABS(ds.amount - ri.amount) / NULLIF(ds.amount, 0) > 0.1 THEN
    UPDATE SET
        amount = ri.amount,
        quantity = ri.quantity,
        updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
    INSERT (order_id, product_id, quantity, amount, sale_date, region)
    VALUES (ri.order_id, ri.product_id, ri.quantity, ri.amount, ri.sale_date, ri.region);

This is extremely practical in production: when your data pipeline occasionally sends correction messages (fixing previously sent incorrect data), only records exceeding a threshold trigger a real update, avoiding unnecessary write amplification.

5. Performance Optimization: Best Practices for Large-Scale MERGE

When MERGE involves millions of records, pay attention to these best practices:

1. Deduplicate Source Data First

-- Deduplicate before MERGE to reduce matching workload
CREATE TEMP TABLE deduped_raw AS
SELECT 
    order_id,
    product_id,
    SUM(quantity) as quantity,
    SUM(amount) as amount,
    MIN(sale_date) as sale_date,
    FIRST_VALUE(region) OVER (PARTITION BY order_id) as region
FROM raw_ingest
GROUP BY order_id, product_id, sale_date

2. Leverage DuckDB’s Vectorized Execution

DuckDB’s MERGE executes in vectorized batches, 10-100x faster than row-by-row loops. Ensure multi-threading is enabled:

SET threads = 0;  -- Use all available CPU cores

3. Persist Results as Parquet

# After MERGE completes, export to Parquet for downstream analysis
con.execute("""
    COPY daily_sales TO '/data/daily/daily_sales_20260630.parquet'
    (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000)
""")

4. Monitor MERGE Statistics

-- DuckDB doesn't directly return MERGE row counts, but you can verify afterward
SELECT 
    (SELECT COUNT(*) FROM daily_sales) as current_total,
    (SELECT COUNT(*) FROM raw_ingest) as source_total,
    (SELECT COUNT(DISTINCT order_id) FROM raw_ingest) as source_unique;

6. Traditional Tools vs DuckDB MERGE INTO Comparison

FeatureTraditional Python UpsertDuckDB MERGE INTOSpark DataFrameWriterPostgreSQL ON CONFLICT
Code Lines15-201 SQL statement5-8 lines3-5 lines
Concurrency Safe❌ Requires manual locking✅ Atomic operation✅ Distributed transactions✅ Row-level locks
Batch Performance⭐⭐ Row-by-row loop⭐⭐⭐⭐⭐ Vectorized⭐⭐⭐⭐ Distributed⭐⭐⭐⭐ Index-accelerated
Conditional Update✅ With if/else✅ WHEN MATCHED AND❌ Not supported❌ Not supported
Learning CurveMediumLow (standard SQL)High (Spark API)Low
Best Use CaseSmall-scale dataMedium-large analyticsUltra-large datasetsOLTP production DBs

7. Monetization Advice

After mastering DuckDB MERGE INTO for idempotent ETL writes, you can monetize your skills through several avenues:

  1. ETL Consulting Services: Help enterprises optimize their ETL pipelines by replacing traditional Python upsert code with DuckDB MERGE INTO. Typically improves data processing speed by 10x+. Charge per project: $700-$7,000.

  2. SaaS Data Products: Build lightweight data management platforms powered by DuckDB, leveraging MERGE INTO for real-time data updates and deduplication. Offer subscription-based service to SMEs ($30-$300/month per user).

  3. Technical Training Courses: Create advanced DuckDB courses focusing on MERGE INTO in ETL scenarios. Sell on platforms like Udemy, GeekTime, or your own website at $15-$60 per course.

  4. Open Source Toolkits: Package idempotent write best practices into a Python library (e.g., duckdb-merge-utils) with one-click ETL templates. Monetize via GitHub Sponsors and commercial licensing.

  5. Automated Reporting Products: Combine DuckDB’s PARQUET export with MERGE INTO’s idempotent feature to build daily/weekly automated reporting systems. Charge annual fees per client ($1,500-$15,000/year).

💡 Tip: If you are building data analytics backends or need high-performance CSV/Parquet querying, visit duckdblab.org for more hands-on tutorials and tool recommendations.


This article is based on DuckDB hands-on实战 content. Follow our channel for the latest advanced tutorials.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.