Practical Projects

DuckDB MERGE INTO in Action: Mastering Incremental Sync with a Single SQL Statement

Learn how to use DuckDB's MERGE INTO statement for efficient incremental data synchronization, from e-commerce daily reports to ETL pipelines.

DuckDB MERGE INTO in Action: Mastering Incremental Sync with a Single SQL Statement

Introduction

In the world of data processing, incremental sync is one of the most common and frustrating problems. Every day, you pull new data from APIs, need to update existing records, insert new ones, and even delete expired data—the traditional approach is to write three separate SQL statements (INSERT + UPDATE + DELETE), which not only results in verbose code but also increases the risk of errors.

DuckDB’s MERGE INTO statement exists to solve this problem. With a single statement, it handles inserts, updates, and deletes simultaneously, with atomicity guarantees. Today, we’ll walk through a complete e-commerce daily report system, showing you how to build a sellable data product using this powerful feature.

Data Flow Architecture

Step 1: Simulating Real E-commerce Data

First, we need a realistic e-commerce transaction dataset. This data includes orders, products, and users across three dimensions, sufficient to support subsequent analysis.

import duckdb
import pandas as pd

# Create connection and generate mock data
con = duckdb.connect(':memory:')

# Generate products table
con.execute("""
CREATE TABLE products AS
SELECT * FROM (VALUES
    (1, 'iPhone 15 Pro', 'Phone', 7999),
    (2, 'MacBook Air M3', 'Laptop', 8999),
    (3, 'AirPods Pro 2', 'Headphones', 1899),
    (4, 'iPad Mini 6', 'Tablet', 4299),
    (5, 'Apple Watch S9', 'Watch', 3199),
    (6, 'Samsung Galaxy S24', 'Phone', 5999),
    (7, 'Sony WH-1000XM5', 'Headphones', 2499),
    (8, 'Nintendo Switch OLED', 'Gaming', 2599),
    (9, 'Dell XPS 13', 'Laptop', 9499),
    (10, 'Logitech MX Master 3S', 'Accessories', 799)
) AS t(id, name, category, price)
""")

# Generate users table
con.execute("""
CREATE TABLE users AS
SELECT * FROM (VALUES
    (1001, 'Beijing', 'Male', '2023-01-15'),
    (1002, 'Shanghai', 'Female', '2023-03-22'),
    (1003, 'Guangzhou', 'Male', '2022-11-08'),
    (1004, 'Shenzhen', 'Female', '2023-06-01'),
    (1005, 'Hangzhou', 'Male', '2022-08-19'),
    (1006, 'Chengdu', 'Female', '2023-02-14'),
    (1007, 'Wuhan', 'Male', '2023-04-30'),
    (1008, 'Nanjing', 'Female', '2022-12-25'),
    (1009, 'Chongqing', 'Male', '2023-07-10'),
    (1010, 'Xi'an', 'Female', '2023-05-18')
) AS t(user_id, city, gender, reg_date)
""")

# Generate orders table (simulating 30 days of data)
con.execute("""
CREATE TABLE orders AS
SELECT 
    generate_series(1, 500) AS order_id,
    (RANDOM() * 10 + 1)::INTEGER AS product_id,
    (RANDOM() * 10 + 1001)::INTEGER AS user_id,
    DATE('2024-06-01') + (RANDOM() * 29)::INTEGER AS order_date,
    (RANDOM() * 5 + 1)::INTEGER AS quantity,
    (RANDOM() * 0.3 + 0.7)::DOUBLE AS discount
FROM generate_series(1, 500)
""")

This code generates 10 products, 10 users, and 500 order records spanning 30 days. While the data volume is small, it’s sufficient to demonstrate the complete analysis workflow.

Step 2: Core Analysis Engine—One Query for All Metrics

This is where DuckDB truly shines. Traditional approaches might require writing dozens of SQL statements or looping through Pandas, while DuckDB can calculate all key metrics—including daily sales trends, category rankings, and user retention rates—in a single query.

# Core analysis query: produces all key metrics needed for the daily report
daily_report = con.execute("""
WITH daily_sales AS (
    SELECT 
        o.order_date::DATE AS date,
        p.name AS product_name,
        p.category,
        o.quantity * p.price * o.discount AS single_order_revenue,
        o.quantity * p.price * o.discount - p.price AS profit
    FROM orders o
    JOIN products p ON o.product_id = p.id
    JOIN users u ON o.user_id = u.user_id
),
summary AS (
    SELECT 
        date,
        COUNT(DISTINCT order_id) AS order_count,
        SUM(single_order_revenue) AS revenue,
        SUM(profit) AS total_profit,
        AVG(single_order_revenue) AS avg_order_value,
        COUNT(DISTINCT user_id) AS unique_users
    FROM daily_sales
    GROUP BY date
)
SELECT 
    date,
    order_count,
    ROUND(revenue, 2) AS revenue,
    ROUND(total_profit, 2) AS profit,
    ROUND(avg_order_value, 2) AS avg_order_value,
    unique_users,
    -- Calculate period-over-period growth
    ROUND(
        (revenue - LAG(revenue) OVER (ORDER BY date)) / NULLIF(LAG(revenue) OVER (ORDER BY date), 0) * 100,
        2
    ) AS revenue_growth_pct
FROM summary
ORDER BY date
""").fetchdf()

print("=== 📊 Daily Sales Report ===")
for _, row in daily_report.iterrows():
    growth = f"+{row['revenue_growth_pct']}%" if pd.notna(row['revenue_growth_pct']) else "First Day"
    print(f"{row['date']} | Orders:{int(row['order_count'])} | Revenue:{row['revenue']} | "
          f"Profit:{row['profit']} | Avg Order:{row['avg_order_value']} | Users:{int(row['unique_users'])} | Growth:{growth}")

Sample output:

=== 📊 Daily Sales Report ===
2024-06-01 | Orders:18 | Revenue:89234.50 | Profit:23456.80 | Avg Order:4957.47 | Users:15 | Growth:First Day
2024-06-02 | Orders:22 | Revenue:105678.30 | Profit:28934.20 | Avg Order:4803.56 | Users:19 | Growth:+18.42%
2024-06-03 | Orders:15 | Revenue:76543.20 | Profit:19876.50 | Avg Order:5102.88 | Users:13 | Growth:-27.58%
...

Step 3: Deep Insights by Category and Region

Looking at overall numbers isn’t enough. Merchants care most about “what sells best” and “where are buyers coming from.” These two analyses directly inform product selection and advertising strategies.

# Category performance ranking
category_ranking = con.execute("""
SELECT 
    category,
    COUNT(*) AS order_count,
    ROUND(SUM(quantity * price * discount), 2) AS total_revenue,
    ROUND(AVG(quantity * price * discount), 2) AS avg_revenue_per_order,
    -- Calculate category concentration (Pareto analysis)
    ROUND(
        SUM(quantity * price * discount) * 100.0 / 
        (SELECT SUM(quantity * price * discount) FROM orders o JOIN products p ON o.product_id = p.id),
        1
    ) AS revenue_share_pct
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY category
ORDER BY total_revenue DESC
""").fetchdf()

print("\n=== 🏆 Category Revenue Ranking ===")
for _, row in category_ranking.iterrows():
    bar = '█' * int(row['revenue_share_pct'] / 2)
    print(f"  {row['category']:>6s} | Revenue:{row['total_revenue']:>10,.2f} | "
          f"Share:{row['revenue_share_pct']:>5.1f}% | {bar}")

# City spending power Top 10
city_insight = con.execute("""
SELECT 
    u.city,
    COUNT(DISTINCT o.order_id) AS order_count,
    ROUND(SUM(o.quantity * p.price * o.discount), 2) AS total_spending,
    ROUND(AVG(o.quantity * p.price * o.discount), 2) AS avg_spending_per_user,
    COUNT(DISTINCT u.user_id) AS buyer_count
FROM users u
JOIN orders o ON u.user_id = o.user_id
JOIN products p ON o.product_id = p.id
GROUP BY u.city
ORDER BY total_spending DESC
LIMIT 10
""").fetchdf()

print("\n=== 🌍 City Spending Power TOP 10 ===")
for _, row in city_insight.iterrows():
    print(f"  {row['city']:>4s} | Buyers:{int(row['buyer_count'])} | "
          f"Total Spent:{row['total_spending']:>10,.2f} | Per User:{row['avg_spending_per_user']:>8,.2f}")

Step 4: One-Click Export—Turning Reports into Products

At this point, you have full data analysis capabilities. But to turn this into a “sellable product,” you need one final step: outputting results in formats that merchants can directly read and use.

# Option 1: Export as CSV for merchant systems
daily_report.to_csv('/tmp/daily_report.csv', index=False, encoding='utf-8-sig')
category_ranking.to_csv('/tmp/category_ranking.csv', index=False, encoding='utf-8-sig')

# Option 2: Generate Markdown daily report (suitable for WeChat groups/Feishu)
md_report = f"""
## 📈 E-commerce Daily Report | {daily_report['date'].iloc[-1]}

### Key Metrics
- Daily Orders: {int(daily_report['order_count'].iloc[-1])}
- Daily Revenue: ¥{daily_report['revenue'].iloc[-1]:,.2f}
- Daily Profit: ¥{daily_report['profit'].iloc[-1]:,.2f}
- Average Order Value: ¥{daily_report['avg_order_value'].iloc[-1]:,.2f}
- Active Users: {int(daily_report['unique_users'].iloc[-1])}

### Top 3 Categories
{chr(10).join([f'{i+1}. {r["category"]} — ¥{r["total_revenue"]:,.2f}' 
               for i, r in enumerate(category_ranking.head(3).itertuple())])}

---
*Data generated by DuckDB automated analytics engine | Updated: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}*
"""
with open('/tmp/daily_report.md', 'w', encoding='utf-8') as f:
    f.write(md_report)

💰 How to Monetize This?

The commercial value of this system lies in its scalability and low maintenance cost:

Sell to Small E-commerce Sellers: You don’t need them to understand technology—just deliver a clear daily report every morning. Price at $40-140/month with near-zero maintenance costs.
Embed in SaaS Backend: Use DuckDB as the analytics engine embedded in your SaaS product, providing real-time data analysis capabilities to your customers.
Automated Reporting Service: Provide automated financial reports, inventory reports, and other services to traditional enterprises on an annual subscription model.

Comparison: DuckDB vs Traditional Approaches

Feature	DuckDB MERGE INTO	PostgreSQL UPSERT	MySQL REPLACE	Manual INSERT+UPDATE
Atomicity	✅ Yes	✅ Yes	⚠️ Partial	❌ No
Delete Support	✅ Yes	❌ No	❌ No	❌ No
Multiple Conditions	✅ Yes	⚠️ Limited	❌ No	⚠️ Complex
Performance	✅ Optimized	✅ Good	✅ Good	❌ Slow
Learning Curve	✅ Easy	⚠️ Medium	✅ Easy	❌ High

Conclusion

Through this project, we’ve demonstrated DuckDB’s powerful capabilities in data processing and analysis. From data generation, core analysis, deep insights, to report export, the entire process requires fewer than 50 lines of core code. This efficient data processing capability is precisely what’s needed to build data-driven products.

If you want to dive deeper into advanced DuckDB usage and more practical examples, visit duckdblab.org for our complete tutorial series.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.