1. Why Replace ClickHouse with DuckDB?
A cross-border e-commerce team was using ClickHouse for their real-time GMV dashboard. Their setup: 3 × 8C16G EC2 instances, monthly cost of $280. Data volume: 5 million order events per day, 150 million rows retained over 30 days, 200 columns per row.
Their pain points were straightforward:
- Too expensive: $280/month for an internal dashboard was hard to justify
- Complex operations: ZooKeeper + shard configuration, every scaling operation required data redistribution
- Not actually fast: Network I/O became the bottleneck, dashboard loading averaged 2.3 seconds
- Overkill: Fewer than 10 concurrent users — ClickHouse’s distributed capability was completely wasted
After replacing with DuckDB + Streamlit on a single 4C8G instance:
| Metric | Old (ClickHouse 3-node) | New (DuckDB Single Node) |
|---|---|---|
| Monthly Cost | $280 | $35 |
| P50 Query Latency | 1.2s | 0.3s |
| P99 Query Latency | 4.5s | 0.9s |
| Operations Complexity | High (ZK + sharding) | Low (one file) |
| Data Ingestion Latency | 20-30s (Kafka + flush) | 5-10s (direct append) |
Core insight: At the 150M row scale, single-node DuckDB is 3-5× faster than distributed ClickHouse for analytical queries — the difference isn’t in the engine but in network I/O. When you don’t need 50+ concurrent users, DuckDB is the more rational choice.
2. System Architecture

The system has three layers:
[Order Events] → [Python Ingestion] → [DuckDB Memory Table] → [Parquet Archive]
↓
[Pre-Aggregation Layer (gmv_hourly)]
↓
[Streamlit Dashboard (read-only)]
Key design principles:
- Zero ETL: Data is analysis-ready the moment it lands — no need for Kafka → ClickHouse materialization pipelines
- Tiered storage: Hot data in DuckDB memory table (last 6 hours), warm data in Parquet (6-48 hours), cold data in compressed Parquet (48+ hours)
- Pre-aggregation + incremental updates: Use
INSERT OR REPLACEto simulate ClickHouse’s AggregatingMergeTree
3. Incremental Data Ingestion (No Kafka Needed)
3.1 Table Schema
No Kafka. Use DuckDB’s in-memory table as a write buffer, flushing to Parquet every 30 seconds:
CREATE TABLE IF NOT EXISTS orders_raw (
order_id VARCHAR,
user_id VARCHAR,
product_id VARCHAR,
category VARCHAR,
amount DECIMAL(12,2),
status VARCHAR, -- paid, refunded, pending, cancelled
event_time TIMESTAMP,
country VARCHAR,
utm_source VARCHAR,
-- ... ~200 fields total
_loaded_at TIMESTAMP DEFAULT now()
);
3.2 Python Ingestion
import duckdb
import polars as pl
from pathlib import Path
from datetime import datetime, timedelta
DB_PATH = "/data/analytics.duckdb"
PARQUET_DIR = "/data/parquet/orders"
con = duckdb.connect(DB_PATH)
def ingest_batch(df: pl.DataFrame):
"""Write Polars DataFrame into DuckDB"""
con.register("_batch", df.to_arrow())
# Append only, no upserts needed
con.execute("""
INSERT INTO orders_raw
SELECT *, now() AS _loaded_at FROM _batch
""")
# Archive every 5M rows or every 24 hours
row_count = con.execute(
"SELECT count(*) FROM orders_raw "
"WHERE event_time < now() - interval '6 hours'"
).fetchone()[0]
if row_count > 5_000_000:
partition_key = datetime.now().strftime("%Y%m%d_%H")
con.execute(f"""
COPY (
SELECT * FROM orders_raw
WHERE event_time < now() - interval '6 hours'
) TO '{PARQUET_DIR}/{partition_key}.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD)
""")
# Clean up archived data
con.execute("""
DELETE FROM orders_raw
WHERE event_time < now() - interval '6 hours'
""")
Performance: DuckDB COPY TO PARQUET writes 5M rows in ~8 seconds single-threaded. ClickHouse with the same data + network I/O takes 12-15 seconds. Local write I/O is a massive advantage.
3.3 Why No Kafka?
In this scenario, the data source is an internal API (order system push), not a high-throughput log stream. Peak throughput is ~800 events/second. Python writing directly to DuckDB handles this easily. Kafka would just add operational complexity.
Decision rule: Every additional component in your toolchain doubles the failure probability. If a file can solve your problem, don’t add a message queue.
4. Real-Time Aggregation: Pre-Aggregation Instead of Full Scans
Never say “let’s just count(*) on the fly” — that’s what amateurs do. A full scan of 150M rows, even with DuckDB’s speed, takes hundreds of milliseconds. Under concurrency, it breaks down.
The right approach: pre-aggregation + incremental updates.
-- Hourly pre-aggregation table
CREATE TABLE IF NOT EXISTS gmv_hourly AS
SELECT
date_trunc('hour', event_time) AS hour,
category,
country,
status,
count(*) AS order_count,
sum(amount) AS gmv,
count(DISTINCT user_id) AS unique_buyers,
sum(CASE WHEN status = 'paid' THEN amount ELSE 0 END) AS paid_gmv,
sum(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) AS refund_amount
FROM orders_raw
WHERE event_time >= date_trunc('hour', now()) - interval '48 hours'
GROUP BY ALL;
-- Unique constraint for upsert
CREATE UNIQUE INDEX idx_gmv_hourly ON gmv_hourly (hour, category, country, status);
4.1 Incremental Update (every 5 minutes)
DuckDB doesn’t have ClickHouse’s AggregatingMergeTree, but INSERT OR REPLACE + ON CONFLICT achieves the same effect:
INSERT OR REPLACE INTO gmv_hourly
SELECT
date_trunc('hour', event_time) AS hour,
category,
country,
status,
count(*) AS order_count,
sum(amount) AS gmv,
count(DISTINCT user_id) AS unique_buyers,
sum(CASE WHEN status = 'paid' THEN amount ELSE 0 END) AS paid_gmv,
sum(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) AS refund_amount
FROM orders_raw
WHERE event_time >= date_trunc('hour', now()) - interval '2 hours'
GROUP BY ALL
ON CONFLICT (hour, category, country, status) DO UPDATE SET
order_count = excluded.order_count,
gmv = excluded.gmv,
unique_buyers = excluded.unique_buyers,
paid_gmv = excluded.paid_gmv,
refund_amount = excluded.refund_amount;
Why scan only the last 2 hours? Because data older than 2 hours rarely changes (order statuses stabilize quickly). This makes the incremental update an order of magnitude faster than a full table scan.
4.2 Query Performance Comparison
| Query Pattern | Full Scan (150M rows) | Pre-Aggregated (~200K rows) |
|---|---|---|
| Today’s GMV | 320ms | 12ms |
| 48-hour Trend | 890ms | 35ms |
| Drill-down by Country + Category | 1.2s | 28ms |
| 5 Concurrent Queries | 2.8s (avg) | 45ms (avg) |
Pre-aggregation delivers 20-40× speedup — this is why the dashboard can support 5 concurrent users refreshing under 300ms.
5. Streamlit Dashboard Implementation
5.1 Complete Dashboard Code
import streamlit as st
import duckdb
import plotly.express as px
import pandas as pd
from datetime import datetime, timedelta
st.set_page_config(layout="wide", page_title="GMV Real-Time Monitor")
con = duckdb.connect("/data/analytics.duckdb", read_only=True)
@st.cache_data(ttl=60)
def load_realtime_metrics():
"""Load today vs yesterday comparison"""
return con.execute("""
WITH today AS (
SELECT
count(*) AS orders,
sum(amount) AS gmv,
count(DISTINCT user_id) AS buyers
FROM orders_raw
WHERE date_trunc('day', event_time) = date_trunc('day', now())
),
yesterday AS (
SELECT
count(*) AS orders,
sum(amount) AS gmv,
count(DISTINCT user_id) AS buyers
FROM orders_raw
WHERE date_trunc('day', event_time) = date_trunc('day', now() - interval '1 day')
)
SELECT
t.orders, t.gmv, t.buyers,
y.orders AS y_orders, y.gmv AS y_gmv, y.buyers AS y_buyers,
CASE WHEN y.gmv > 0
THEN round((t.gmv - y.gmv) / y.gmv * 100, 1)
ELSE 0 END AS gmv_growth_pct
FROM today t, yesterday y
""").fetchdf()
@st.cache_data(ttl=300)
def load_hourly_trend():
"""Last 48 hours GMV trend"""
return con.execute("""
SELECT
hour,
sum(gmv) AS total_gmv,
sum(order_count) AS total_orders
FROM gmv_hourly
WHERE hour >= now() - interval '48 hours'
GROUP BY hour
ORDER BY hour
""").fetchdf()
@st.cache_data(ttl=300)
def load_top_categories():
"""Today's category ranking"""
return con.execute("""
SELECT
category,
count(*) AS orders,
sum(amount) AS gmv,
count(DISTINCT user_id) AS buyers
FROM orders_raw
WHERE date_trunc('day', event_time) = date_trunc('day', now())
GROUP BY category
ORDER BY gmv DESC
LIMIT 10
""").fetchdf()
# ── Top Metric Cards ──
metrics = load_realtime_metrics()
col1, col2, col3, col4 = st.columns(4)
col1.metric("Today GMV", f"¥{metrics['gmv'][0]:,.0f}",
f"{metrics['gmv_growth_pct'][0]:+.1f}%")
col2.metric("Today Orders", f"{metrics['orders'][0]:,}",
f"{metrics['orders'][0] - metrics['y_orders'][0]:+,}")
col3.metric("Buyers", f"{metrics['buyers'][0]:,}",
f"{metrics['buyers'][0] - metrics['y_buyers'][0]:+,}")
col4.metric("Avg Order Value", f"¥{metrics['gmv'][0]/max(metrics['orders'][0],1):,.0f}")
# ── Trend Chart ──
st.subheader("48-Hour GMV Trend")
df_trend = load_hourly_trend()
fig = px.line(df_trend, x='hour', y='total_gmv',
title='Hourly GMV Trend')
st.plotly_chart(fig, use_container_width=True)
# ── Category Ranking ──
st.subheader("Today's Category Top 10")
df_cat = load_top_categories()
fig_bar = px.bar(df_cat, x='category', y='gmv',
title='GMV by Category')
st.plotly_chart(fig_bar, use_container_width=True)
5.2 Launch & Benchmark
streamlit run dashboard.py --server.port 8501 --server.maxUploadSize 10
Benchmark results (5 concurrent users, 30-second refresh):
- Average page load: 280ms
- Cold cache first load: 890ms
- DuckDB peak memory: 1.8 GB (including OS cache)
- Zero connection pool issues (single connection reused)
For comparison, the ClickHouse version averaged 1.8s with worst case 4.5s.
6. Performance Tuning: Avoiding WAL Blocking
The biggest trap: DuckDB’s CHECKPOINT runs automatically every 3 seconds by default. If you write heavily and query immediately, writes and queries compete for I/O.
6.1 The Problem
Heavy writes → query latency spikes from 20ms to 800ms
The default checkpoint_threshold = '16MB' triggers a CHECKPOINT every time 16MB of changes accumulate, blocking reads.
6.2 The Fix
-- Increase checkpoint threshold before batch writes
SET checkpoint_threshold = '500MB';
-- Or disable auto-checkpoint entirely (batch-only scenarios)
SET automatic_checkpoint = false;
-- Manual checkpoint after batch completes
CHECKPOINT;
-- Restore defaults
SET checkpoint_threshold = '16MB';
Result: Write throughput jumps from 500K rows/sec to 1.2M rows/sec, query latency stays under 30ms.
6.3 Additional Tuning Parameters
-- Increase memory limit (default is 80% of RAM)
SET memory_limit = '6GB';
-- Set temp directory to avoid filling /tmp
SET temp_directory = '/data/tmp';
-- Set parallelism to match CPU cores
SET threads = 4;
-- External merge sort to save memory
SET max_temp_directory_size = '10GB';
7. ClickHouse vs DuckDB: Deep Comparison
| Dimension | ClickHouse | DuckDB |
|---|---|---|
| Architecture | Distributed, multi-node + ZK | Single process, embedded or standalone |
| Deployment | Minimum 3 servers | 1 low-spec server or embedded |
| Monthly Cost (this case) | $280 | $35 |
| P50 Query (150M rows) | 1.2s | 0.3s |
| P99 Query | 4.5s | 0.9s |
| Concurrency Limit | 50-100+ | 5-20 (query-dependent) |
| Data Ingestion | Requires Kafka/3rd-party tools | Direct append, one line of code |
| Materialized Views | AggregatingMergeTree (native) | INSERT OR REPLACE (manual) |
| Operations | Needs a DBA | One file, migrate via scp |
| Best For | Large-scale OLAP, high concurrency | Small-to-medium analytics, embedded |
When to choose what:
- < 1B rows, < 20 concurrent users → DuckDB. Cheaper, simpler, faster in this range
- > 1B rows, > 50 concurrent users → ClickHouse. Right tool for the job
8. Monetization Strategies
This isn’t just a dashboard — it can be packaged into several products:
8.1 E-Commerce Analytics SaaS ($99/month+)
Package this as a SaaS product for small-to-medium e-commerce sellers:
- Lightweight: Each customer gets one DuckDB file — isolation is trivial, backup is a file copy
- Multi-tenant: Use DuckDB’s
ATTACHsyntax for cross-database queries - White-label: Streamlit supports custom themes — resell under your own brand
- Pricing: Basic $99/month (30-day data), Pro $299/month (90-day data + custom reports)
8.2 ClickHouse Migration Service ($2,000-5,000/project)
Many teams are overpaying for ClickHouse. Offer a migration assessment + implementation service:
- Assessment: analyze data volume, query patterns, concurrency needs
- Implementation: migrate data, rewrite queries, deploy dashboard
- Optimization: pre-aggregation strategy, parameter tuning
- Deliverables: migrated dashboard + DuckDB tuning guide
8.3 Report Automation Plugin ($49/one-time)
Build a plugin based on the SQL templates from this project:
- Auto-generate daily/weekly reports from DuckDB
- WeChat/DingTalk push notifications
- Scheduled PDF report delivery
9. Production Considerations
- Backup: DuckDB files can’t be backed up online — stop the service and
cp. Or use periodicCOPY TO PARQUETfor redundancy - Monitoring: DuckDB has no built-in monitoring. Log queries and track slow queries yourself
- Upgrades: Version upgrades may change file format — always backup before upgrading
- Disk: DuckDB’s write amplification is higher than ClickHouse. Reserve 2× the data size
Bottom line: If your data is under 1 billion rows and you need fewer than 20 concurrent users, replacing ClickHouse with DuckDB saves $200+/month and eliminates the need for a DBA.
📺 More DuckDB tutorials on YouTube → youtube.com/@duckdblab