Kill the ETL Upsert Nightmare with DuckDB MERGE INTO: One SQL Line Replaces 20 Lines of Python
Difficulty:⭐⭐⭐ | Estimated Time:15 minutes to get started, then goodbye to dirty data forever

1. Is Your ETL Pipeline Writing Code Like This?
Here is the daily reality for many data engineers:
# Traditional approach: check existence, then decide INSERT or UPDATE
existing = con.execute("SELECT id FROM daily_sales WHERE date = '2026-06-30'").fetchall()
if existing:
con.execute("UPDATE daily_sales SET ... WHERE date = '2026-06-30'")
else:
con.execute("INSERT INTO daily_sales VALUES (...)")
What is wrong with this code?
- Not concurrency-safe: Two processes check simultaneously, both find nothing exists, and insert duplicate records.
- Terrible performance: Two round trips (SELECT + INSERT/UPDATE) per record.
- Verbosity: Every table requires a custom check-then-insert logic block.
- Transaction complexity: You must manually manage transaction rollback on failure.
DuckDB’s MERGE INTO solves all of this in a single SQL statement: atomically checks for existence, updates if found, inserts if not. Zero race conditions.
2. Core Principle of Idempotent Writes
DuckDB supports the SQL:2003 standard MERGE syntax (also known as UPSERT), fully available since v0.9. At its core, it is an atomic update-or-insert operation:
MERGE INTO target_table
USING source_data
ON target_table.key = source_data.key
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT VALUES ...
Key advantages:
- Atomicity: The entire operation completes within one transaction, eliminating race conditions
- Batch processing: Process tens of thousands of records in a single operation, far faster than looping
- Readability: Intent is crystal clear — update if matched, insert if not
3. Complete Hands-on: Idempotent ETL for E-commerce Sales Reports
Imagine you consume sales data from a message queue every day and need to write it into DuckDB’s analytical tables. The data may contain duplicate messages, and you need to ensure the same order is never double-counted.
Step 1: Create Tables
import duckdb
con = duckdb.connect(":memory:")
# Main table: deduplicated by order ID
con.execute("""
CREATE TABLE daily_sales (
order_id VARCHAR PRIMARY KEY,
product_id VARCHAR,
quantity INTEGER,
amount DOUBLE,
sale_date DATE,
region VARCHAR,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
# Temporary table: holds all raw data consumed today (may contain duplicates)
con.execute("""
CREATE TEMP TABLE raw_ingest AS
SELECT * FROM read_csv_auto('/data/kafka/sales_20260630.csv')
""")
Step 2: Idempotent Write with MERGE INTO
# The core: one SQL statement handles idempotent writes
con.execute("""
MERGE INTO daily_sales ds
USING raw_ingest ri
ON ds.order_id = ri.order_id
WHEN MATCHED THEN
UPDATE SET
quantity = ds.quantity + ri.quantity,
amount = ds.amount + ri.amount,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (order_id, product_id, quantity, amount, sale_date, region)
VALUES (ri.order_id, ri.product_id, ri.quantity, ri.amount, ri.sale_date, ri.region)
""")
# Verify results
result = con.execute("""
SELECT
COUNT(*) as total_records,
COUNT(DISTINCT order_id) as unique_orders,
SUM(quantity) as total_qty,
ROUND(SUM(amount), 2) as total_revenue
FROM daily_sales
""").fetchall()
print(f'Total records: {result[0][0]} | Unique orders: {result[0][1]} | Total revenue: {result[0][3]}')
Key insight: Even if the CSV contains 100 duplicate order_id values, the idempotent write will INSERT on the first match and accumulate subsequent duplicates via UPDATE. This is idempotency — executing the same operation multiple times produces the same result.
4. Advanced Scenario: Conditional Update with MERGE
Not every match requires an update. Sometimes you only want to update existing records under specific conditions:
-- Only update records where the amount changes by more than 10%
MERGE INTO daily_sales ds
USING raw_ingest ri
ON ds.order_id = ri.order_id
WHEN MATCHED AND ABS(ds.amount - ri.amount) / NULLIF(ds.amount, 0) > 0.1 THEN
UPDATE SET
amount = ri.amount,
quantity = ri.quantity,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (order_id, product_id, quantity, amount, sale_date, region)
VALUES (ri.order_id, ri.product_id, ri.quantity, ri.amount, ri.sale_date, ri.region);
This is extremely practical in production: when your data pipeline occasionally sends correction messages (fixing previously sent incorrect data), only records exceeding a threshold trigger a real update, avoiding unnecessary write amplification.
5. Performance Optimization: Best Practices for Large-Scale MERGE
When MERGE involves millions of records, pay attention to these best practices:
1. Deduplicate Source Data First
-- Deduplicate before MERGE to reduce matching workload
CREATE TEMP TABLE deduped_raw AS
SELECT
order_id,
product_id,
SUM(quantity) as quantity,
SUM(amount) as amount,
MIN(sale_date) as sale_date,
FIRST_VALUE(region) OVER (PARTITION BY order_id) as region
FROM raw_ingest
GROUP BY order_id, product_id, sale_date
2. Leverage DuckDB’s Vectorized Execution
DuckDB’s MERGE executes in vectorized batches, 10-100x faster than row-by-row loops. Ensure multi-threading is enabled:
SET threads = 0; -- Use all available CPU cores
3. Persist Results as Parquet
# After MERGE completes, export to Parquet for downstream analysis
con.execute("""
COPY daily_sales TO '/data/daily/daily_sales_20260630.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000)
""")
4. Monitor MERGE Statistics
-- DuckDB doesn't directly return MERGE row counts, but you can verify afterward
SELECT
(SELECT COUNT(*) FROM daily_sales) as current_total,
(SELECT COUNT(*) FROM raw_ingest) as source_total,
(SELECT COUNT(DISTINCT order_id) FROM raw_ingest) as source_unique;
6. Traditional Tools vs DuckDB MERGE INTO Comparison
| Feature | Traditional Python Upsert | DuckDB MERGE INTO | Spark DataFrameWriter | PostgreSQL ON CONFLICT |
|---|---|---|---|---|
| Code Lines | 15-20 | 1 SQL statement | 5-8 lines | 3-5 lines |
| Concurrency Safe | ❌ Requires manual locking | ✅ Atomic operation | ✅ Distributed transactions | ✅ Row-level locks |
| Batch Performance | ⭐⭐ Row-by-row loop | ⭐⭐⭐⭐⭐ Vectorized | ⭐⭐⭐⭐ Distributed | ⭐⭐⭐⭐ Index-accelerated |
| Conditional Update | ✅ With if/else | ✅ WHEN MATCHED AND | ❌ Not supported | ❌ Not supported |
| Learning Curve | Medium | Low (standard SQL) | High (Spark API) | Low |
| Best Use Case | Small-scale data | Medium-large analytics | Ultra-large datasets | OLTP production DBs |
7. Monetization Advice
After mastering DuckDB MERGE INTO for idempotent ETL writes, you can monetize your skills through several avenues:
ETL Consulting Services: Help enterprises optimize their ETL pipelines by replacing traditional Python upsert code with DuckDB MERGE INTO. Typically improves data processing speed by 10x+. Charge per project: $700-$7,000.
SaaS Data Products: Build lightweight data management platforms powered by DuckDB, leveraging MERGE INTO for real-time data updates and deduplication. Offer subscription-based service to SMEs ($30-$300/month per user).
Technical Training Courses: Create advanced DuckDB courses focusing on MERGE INTO in ETL scenarios. Sell on platforms like Udemy, GeekTime, or your own website at $15-$60 per course.
Open Source Toolkits: Package idempotent write best practices into a Python library (e.g.,
duckdb-merge-utils) with one-click ETL templates. Monetize via GitHub Sponsors and commercial licensing.Automated Reporting Products: Combine DuckDB’s PARQUET export with MERGE INTO’s idempotent feature to build daily/weekly automated reporting systems. Charge annual fees per client ($1,500-$15,000/year).
💡 Tip: If you are building data analytics backends or need high-performance CSV/Parquet querying, visit duckdblab.org for more hands-on tutorials and tool recommendations.
This article is based on DuckDB hands-on实战 content. Follow our channel for the latest advanced tutorials.