Featured image of post Replace ClickHouse with DuckDB: A Complete Real-Time GMV Dashboard Guide

Replace ClickHouse with DuckDB: A Complete Real-Time GMV Dashboard Guide

A cross-border e-commerce team replaced their 3-node ClickHouse cluster with a single DuckDB instance for real-time GMV monitoring. Monthly cost dropped from $280 to $35, query latency from 2.3s to 0.4s. Full architecture, incremental ingestion, pre-aggregation, Streamlit dashboard code, and tuning tips included.

1. Why Replace ClickHouse with DuckDB?

A cross-border e-commerce team was using ClickHouse for their real-time GMV dashboard. Their setup: 3 × 8C16G EC2 instances, monthly cost of $280. Data volume: 5 million order events per day, 150 million rows retained over 30 days, 200 columns per row.

Their pain points were straightforward:

  • Too expensive: $280/month for an internal dashboard was hard to justify
  • Complex operations: ZooKeeper + shard configuration, every scaling operation required data redistribution
  • Not actually fast: Network I/O became the bottleneck, dashboard loading averaged 2.3 seconds
  • Overkill: Fewer than 10 concurrent users — ClickHouse’s distributed capability was completely wasted

After replacing with DuckDB + Streamlit on a single 4C8G instance:

MetricOld (ClickHouse 3-node)New (DuckDB Single Node)
Monthly Cost$280$35
P50 Query Latency1.2s0.3s
P99 Query Latency4.5s0.9s
Operations ComplexityHigh (ZK + sharding)Low (one file)
Data Ingestion Latency20-30s (Kafka + flush)5-10s (direct append)

Core insight: At the 150M row scale, single-node DuckDB is 3-5× faster than distributed ClickHouse for analytical queries — the difference isn’t in the engine but in network I/O. When you don’t need 50+ concurrent users, DuckDB is the more rational choice.


2. System Architecture

System Architecture

The system has three layers:

[Order Events] → [Python Ingestion] → [DuckDB Memory Table] → [Parquet Archive]
                                            ↓
                                  [Pre-Aggregation Layer (gmv_hourly)]
                                            ↓
                                  [Streamlit Dashboard (read-only)]

Key design principles:

  1. Zero ETL: Data is analysis-ready the moment it lands — no need for Kafka → ClickHouse materialization pipelines
  2. Tiered storage: Hot data in DuckDB memory table (last 6 hours), warm data in Parquet (6-48 hours), cold data in compressed Parquet (48+ hours)
  3. Pre-aggregation + incremental updates: Use INSERT OR REPLACE to simulate ClickHouse’s AggregatingMergeTree

3. Incremental Data Ingestion (No Kafka Needed)

3.1 Table Schema

No Kafka. Use DuckDB’s in-memory table as a write buffer, flushing to Parquet every 30 seconds:

CREATE TABLE IF NOT EXISTS orders_raw (
    order_id VARCHAR,
    user_id VARCHAR,
    product_id VARCHAR,
    category VARCHAR,
    amount DECIMAL(12,2),
    status VARCHAR,        -- paid, refunded, pending, cancelled
    event_time TIMESTAMP,
    country VARCHAR,
    utm_source VARCHAR,
    -- ... ~200 fields total
    _loaded_at TIMESTAMP DEFAULT now()
);

3.2 Python Ingestion

import duckdb
import polars as pl
from pathlib import Path
from datetime import datetime, timedelta

DB_PATH = "/data/analytics.duckdb"
PARQUET_DIR = "/data/parquet/orders"

con = duckdb.connect(DB_PATH)

def ingest_batch(df: pl.DataFrame):
    """Write Polars DataFrame into DuckDB"""
    con.register("_batch", df.to_arrow())
    
    # Append only, no upserts needed
    con.execute("""
        INSERT INTO orders_raw 
        SELECT *, now() AS _loaded_at FROM _batch
    """)
    
    # Archive every 5M rows or every 24 hours
    row_count = con.execute(
        "SELECT count(*) FROM orders_raw "
        "WHERE event_time < now() - interval '6 hours'"
    ).fetchone()[0]
    
    if row_count > 5_000_000:
        partition_key = datetime.now().strftime("%Y%m%d_%H")
        con.execute(f"""
            COPY (
                SELECT * FROM orders_raw 
                WHERE event_time < now() - interval '6 hours'
            ) TO '{PARQUET_DIR}/{partition_key}.parquet'
            (FORMAT PARQUET, COMPRESSION ZSTD)
        """)
        # Clean up archived data
        con.execute("""
            DELETE FROM orders_raw 
            WHERE event_time < now() - interval '6 hours'
        """)

Performance: DuckDB COPY TO PARQUET writes 5M rows in ~8 seconds single-threaded. ClickHouse with the same data + network I/O takes 12-15 seconds. Local write I/O is a massive advantage.

3.3 Why No Kafka?

In this scenario, the data source is an internal API (order system push), not a high-throughput log stream. Peak throughput is ~800 events/second. Python writing directly to DuckDB handles this easily. Kafka would just add operational complexity.

Decision rule: Every additional component in your toolchain doubles the failure probability. If a file can solve your problem, don’t add a message queue.


4. Real-Time Aggregation: Pre-Aggregation Instead of Full Scans

Never say “let’s just count(*) on the fly” — that’s what amateurs do. A full scan of 150M rows, even with DuckDB’s speed, takes hundreds of milliseconds. Under concurrency, it breaks down.

The right approach: pre-aggregation + incremental updates.

-- Hourly pre-aggregation table
CREATE TABLE IF NOT EXISTS gmv_hourly AS
SELECT 
    date_trunc('hour', event_time) AS hour,
    category,
    country,
    status,
    count(*) AS order_count,
    sum(amount) AS gmv,
    count(DISTINCT user_id) AS unique_buyers,
    sum(CASE WHEN status = 'paid' THEN amount ELSE 0 END) AS paid_gmv,
    sum(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) AS refund_amount
FROM orders_raw
WHERE event_time >= date_trunc('hour', now()) - interval '48 hours'
GROUP BY ALL;

-- Unique constraint for upsert
CREATE UNIQUE INDEX idx_gmv_hourly ON gmv_hourly (hour, category, country, status);

4.1 Incremental Update (every 5 minutes)

DuckDB doesn’t have ClickHouse’s AggregatingMergeTree, but INSERT OR REPLACE + ON CONFLICT achieves the same effect:

INSERT OR REPLACE INTO gmv_hourly
SELECT 
    date_trunc('hour', event_time) AS hour,
    category,
    country,
    status,
    count(*) AS order_count,
    sum(amount) AS gmv,
    count(DISTINCT user_id) AS unique_buyers,
    sum(CASE WHEN status = 'paid' THEN amount ELSE 0 END) AS paid_gmv,
    sum(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) AS refund_amount
FROM orders_raw
WHERE event_time >= date_trunc('hour', now()) - interval '2 hours'
GROUP BY ALL
ON CONFLICT (hour, category, country, status) DO UPDATE SET
    order_count = excluded.order_count,
    gmv = excluded.gmv,
    unique_buyers = excluded.unique_buyers,
    paid_gmv = excluded.paid_gmv,
    refund_amount = excluded.refund_amount;

Why scan only the last 2 hours? Because data older than 2 hours rarely changes (order statuses stabilize quickly). This makes the incremental update an order of magnitude faster than a full table scan.

4.2 Query Performance Comparison

Query PatternFull Scan (150M rows)Pre-Aggregated (~200K rows)
Today’s GMV320ms12ms
48-hour Trend890ms35ms
Drill-down by Country + Category1.2s28ms
5 Concurrent Queries2.8s (avg)45ms (avg)

Pre-aggregation delivers 20-40× speedup — this is why the dashboard can support 5 concurrent users refreshing under 300ms.


5. Streamlit Dashboard Implementation

5.1 Complete Dashboard Code

import streamlit as st
import duckdb
import plotly.express as px
import pandas as pd
from datetime import datetime, timedelta

st.set_page_config(layout="wide", page_title="GMV Real-Time Monitor")

con = duckdb.connect("/data/analytics.duckdb", read_only=True)

@st.cache_data(ttl=60)
def load_realtime_metrics():
    """Load today vs yesterday comparison"""
    return con.execute("""
        WITH today AS (
            SELECT 
                count(*) AS orders,
                sum(amount) AS gmv,
                count(DISTINCT user_id) AS buyers
            FROM orders_raw
            WHERE date_trunc('day', event_time) = date_trunc('day', now())
        ),
        yesterday AS (
            SELECT 
                count(*) AS orders,
                sum(amount) AS gmv,
                count(DISTINCT user_id) AS buyers
            FROM orders_raw
            WHERE date_trunc('day', event_time) = date_trunc('day', now() - interval '1 day')
        )
        SELECT 
            t.orders, t.gmv, t.buyers,
            y.orders AS y_orders, y.gmv AS y_gmv, y.buyers AS y_buyers,
            CASE WHEN y.gmv > 0 
                THEN round((t.gmv - y.gmv) / y.gmv * 100, 1) 
                ELSE 0 END AS gmv_growth_pct
        FROM today t, yesterday y
    """).fetchdf()

@st.cache_data(ttl=300)
def load_hourly_trend():
    """Last 48 hours GMV trend"""
    return con.execute("""
        SELECT 
            hour,
            sum(gmv) AS total_gmv,
            sum(order_count) AS total_orders
        FROM gmv_hourly
        WHERE hour >= now() - interval '48 hours'
        GROUP BY hour
        ORDER BY hour
    """).fetchdf()

@st.cache_data(ttl=300)
def load_top_categories():
    """Today's category ranking"""
    return con.execute("""
        SELECT 
            category,
            count(*) AS orders,
            sum(amount) AS gmv,
            count(DISTINCT user_id) AS buyers
        FROM orders_raw
        WHERE date_trunc('day', event_time) = date_trunc('day', now())
        GROUP BY category
        ORDER BY gmv DESC
        LIMIT 10
    """).fetchdf()

# ── Top Metric Cards ──
metrics = load_realtime_metrics()
col1, col2, col3, col4 = st.columns(4)
col1.metric("Today GMV", f{metrics['gmv'][0]:,.0f}", 
            f"{metrics['gmv_growth_pct'][0]:+.1f}%")
col2.metric("Today Orders", f"{metrics['orders'][0]:,}",
            f"{metrics['orders'][0] - metrics['y_orders'][0]:+,}")
col3.metric("Buyers", f"{metrics['buyers'][0]:,}",
            f"{metrics['buyers'][0] - metrics['y_buyers'][0]:+,}")
col4.metric("Avg Order Value", f{metrics['gmv'][0]/max(metrics['orders'][0],1):,.0f}")

# ── Trend Chart ──
st.subheader("48-Hour GMV Trend")
df_trend = load_hourly_trend()
fig = px.line(df_trend, x='hour', y='total_gmv',
              title='Hourly GMV Trend')
st.plotly_chart(fig, use_container_width=True)

# ── Category Ranking ──
st.subheader("Today's Category Top 10")
df_cat = load_top_categories()
fig_bar = px.bar(df_cat, x='category', y='gmv',
                  title='GMV by Category')
st.plotly_chart(fig_bar, use_container_width=True)

5.2 Launch & Benchmark

streamlit run dashboard.py --server.port 8501 --server.maxUploadSize 10

Benchmark results (5 concurrent users, 30-second refresh):

  • Average page load: 280ms
  • Cold cache first load: 890ms
  • DuckDB peak memory: 1.8 GB (including OS cache)
  • Zero connection pool issues (single connection reused)

For comparison, the ClickHouse version averaged 1.8s with worst case 4.5s.


6. Performance Tuning: Avoiding WAL Blocking

The biggest trap: DuckDB’s CHECKPOINT runs automatically every 3 seconds by default. If you write heavily and query immediately, writes and queries compete for I/O.

6.1 The Problem

Heavy writes → query latency spikes from 20ms to 800ms

The default checkpoint_threshold = '16MB' triggers a CHECKPOINT every time 16MB of changes accumulate, blocking reads.

6.2 The Fix

-- Increase checkpoint threshold before batch writes
SET checkpoint_threshold = '500MB';

-- Or disable auto-checkpoint entirely (batch-only scenarios)
SET automatic_checkpoint = false;

-- Manual checkpoint after batch completes
CHECKPOINT;

-- Restore defaults
SET checkpoint_threshold = '16MB';

Result: Write throughput jumps from 500K rows/sec to 1.2M rows/sec, query latency stays under 30ms.

6.3 Additional Tuning Parameters

-- Increase memory limit (default is 80% of RAM)
SET memory_limit = '6GB';

-- Set temp directory to avoid filling /tmp
SET temp_directory = '/data/tmp';

-- Set parallelism to match CPU cores
SET threads = 4;

-- External merge sort to save memory
SET max_temp_directory_size = '10GB';

7. ClickHouse vs DuckDB: Deep Comparison

DimensionClickHouseDuckDB
ArchitectureDistributed, multi-node + ZKSingle process, embedded or standalone
DeploymentMinimum 3 servers1 low-spec server or embedded
Monthly Cost (this case)$280$35
P50 Query (150M rows)1.2s0.3s
P99 Query4.5s0.9s
Concurrency Limit50-100+5-20 (query-dependent)
Data IngestionRequires Kafka/3rd-party toolsDirect append, one line of code
Materialized ViewsAggregatingMergeTree (native)INSERT OR REPLACE (manual)
OperationsNeeds a DBAOne file, migrate via scp
Best ForLarge-scale OLAP, high concurrencySmall-to-medium analytics, embedded

When to choose what:

  • < 1B rows, < 20 concurrent users → DuckDB. Cheaper, simpler, faster in this range
  • > 1B rows, > 50 concurrent users → ClickHouse. Right tool for the job

8. Monetization Strategies

This isn’t just a dashboard — it can be packaged into several products:

8.1 E-Commerce Analytics SaaS ($99/month+)

Package this as a SaaS product for small-to-medium e-commerce sellers:

  • Lightweight: Each customer gets one DuckDB file — isolation is trivial, backup is a file copy
  • Multi-tenant: Use DuckDB’s ATTACH syntax for cross-database queries
  • White-label: Streamlit supports custom themes — resell under your own brand
  • Pricing: Basic $99/month (30-day data), Pro $299/month (90-day data + custom reports)

8.2 ClickHouse Migration Service ($2,000-5,000/project)

Many teams are overpaying for ClickHouse. Offer a migration assessment + implementation service:

  • Assessment: analyze data volume, query patterns, concurrency needs
  • Implementation: migrate data, rewrite queries, deploy dashboard
  • Optimization: pre-aggregation strategy, parameter tuning
  • Deliverables: migrated dashboard + DuckDB tuning guide

8.3 Report Automation Plugin ($49/one-time)

Build a plugin based on the SQL templates from this project:

  • Auto-generate daily/weekly reports from DuckDB
  • WeChat/DingTalk push notifications
  • Scheduled PDF report delivery

9. Production Considerations

  1. Backup: DuckDB files can’t be backed up online — stop the service and cp. Or use periodic COPY TO PARQUET for redundancy
  2. Monitoring: DuckDB has no built-in monitoring. Log queries and track slow queries yourself
  3. Upgrades: Version upgrades may change file format — always backup before upgrading
  4. Disk: DuckDB’s write amplification is higher than ClickHouse. Reserve 2× the data size

Bottom line: If your data is under 1 billion rows and you need fewer than 20 concurrent users, replacing ClickHouse with DuckDB saves $200+/month and eliminates the need for a DBA.

📺 More DuckDB tutorials on YouTube → youtube.com/@duckdblab

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy