dbt + DuckDB: Build a Production-Ready Data Warehouse in Half a Day

dbt + DuckDB: Build a Production-Ready Data Warehouse in Half a Day

dbt + DuckDB is the ultimate data warehousing solution for SMBs. This hands-on guide walks through a complete e-commerce data modeling project — from CSV sources to dbt three-tier architecture (Staging → Marts → Dashboard), RFM customer segmentation, one-click Excel reporting, and a monetization framework worth ¥8,000 to ¥20,000.

The Problem: Does Every Company Need a ¥100K Data Warehouse?

“Can you build me a data warehouse? I need to see daily sales.”

When you take this gig, the available options seem to be:

Snowflake + dbt Cloud: Professional, but $2,000+/month starting
Alibaba Cloud MaxCompute: ¥3,000/month, 1-2 weeks to set up
Self-hosted Hadoop/Spark: 3 servers + a big data engineer (¥300K+/year)
Excel + Manual: Free, but 3 hours per report, repeated weekly

There’s a huge gap in the middle — and it’s where 90% of small and medium businesses live.

The reality is: Most SMBs (¥1M ~ ¥50M annual revenue) don’t need a distributed data warehouse. Their data is tens of thousands to a few million rows in CSV/Excel, perfectly runnable on a laptop.

What they actually need:

Zero software cost — no new monthly bills
Delivered in half a day — set up today, use tomorrow
Maintainable and scalable — not a one-off script, a real data engineering architecture
One-click reporting — management KPIs generated automatically

This is exactly where dbt + DuckDB dominates.

dbt + DuckDB: The Best Data Stack for SMBs

What is dbt?

dbt (data build tool) is the hottest data transformation tool in modern data engineering. Its core philosophy:

Define your data transformation logic in SQL. dbt handles dependency management, execution ordering, documentation, and testing — automatically.

You write SELECT statements (clean data, aggregate, compute business metrics). dbt takes care of:

Dependency resolution (run model A first, then model B)
Incremental vs full refresh strategies
Data lineage visualization (see which models depend on which)
Automated testing and documentation

Why DuckDB as the Engine?

DuckDB pairs perfectly with dbt for SMB workloads:

Feature	Snowflake	Spark	DuckDB + dbt
Annual Cost	¥170K+	¥100K+	¥0
Setup Time	2-4 weeks	4-8 weeks	Half a day
Server Required	✅ Cloud cluster	✅ Cluster	❌ A laptop
DBA Required	✅	✅	❌ You
Portability	❌ Vendor lock-in	❌ JVM-dependent	✅ One .duckdb file
Learning Curve	Medium	Steep	Low (just need SQL)

🔧 Full Project: E-Commerce Data Warehouse

Here’s a complete e-commerce data warehouse project — data generation, dbt modeling, and report export — fully executable.

📥 Prerequisites

pip install duckdb dbt-duckdb openpyxl pandas

# Verify dbt installation
dbt --version
# Core: 1.11.x, Plugin: duckdb 1.10.x

📁 Project Structure

day24_dbt_project/
├── dbt_project.yml          # dbt project config
├── profiles.yml             # DuckDB connection config
├── seeds/                   # Raw data (CSV)
│   ├── customers.csv        # 200 customers
│   ├── products.csv         # 50 products
│   ├── orders.csv           # 2,000 orders
│   └── reviews.csv          # 1,500 reviews
└── models/
    ├── staging/             # Data cleaning layer (VIEW)
    │   ├── stg_customers.sql
    │   ├── stg_products.sql
    │   ├── stg_orders.sql
    │   └── stg_reviews.sql
    └── marts/               # Business analytics layer (TABLE)
        ├── daily_sales_summary.sql
        ├── product_performance.sql
        ├── customer_analytics.sql
        └── kpi_dashboard.sql

Step 1: Generate Sample Data

Run this Python script to generate realistic e-commerce data:

#!/usr/bin/env python3
"""Generate sample e-commerce data for dbt + DuckDB demo."""
import csv, random, os
from datetime import datetime, timedelta

random.seed(42)
OUTPUT_DIR = os.path.dirname(os.path.abspath(__file__))

NUM_CUSTOMERS = 200
NUM_PRODUCTS = 50
NUM_ORDERS = 2000
NUM_REVIEWS = 1500
START_DATE = datetime(2025, 1, 1)
END_DATE = datetime(2026, 5, 1)

def gen_customers():
    cities = ["Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou",
              "Chengdu", "Wuhan", "Nanjing", "Chongqing", "Xi'an"]
    levels = ["Regular", "Silver", "Gold", "Diamond"]
    channels = ["Direct", "Search Engine", "Social Media", "Email", "Ads"]
    rows = []
    for i in range(1, NUM_CUSTOMERS + 1):
        reg_date = START_DATE + timedelta(days=random.randint(0, 400))
        rows.append({"customer_id": i, "name": f"User_{i:04d}",
            "city": random.choice(cities),
            "level": random.choices(levels, weights=[50,30,15,5])[0],
            "channel": random.choice(channels),
            "registration_date": reg_date.strftime("%Y-%m-%d"),
            "is_active": 1 if random.random() > 0.15 else 0})
    return rows

def gen_products():
    categories = ["Electronics", "Clothing", "Food", "Home", "Beauty", "Books"]
    suppliers = ["Supplier_A", "Supplier_B", "Supplier_C", "Supplier_D", "Supplier_E"]
    rows = []
    for i in range(1, NUM_PRODUCTS + 1):
        cost = round(random.uniform(10, 500), 2)
        price = round(cost * random.uniform(1.3, 3.0), 2)
        rows.append({"product_id": i, "product_name": f"Product_{i:04d}",
            "category": random.choice(categories), "supplier": random.choice(suppliers),
            "cost": cost, "price": price, "stock": random.randint(0, 1000),
            "shelf_date": (START_DATE + timedelta(days=random.randint(0, 480))).strftime("%Y-%m-%d")})
    return rows

def gen_orders():
    statuses = ["Completed", "Shipped", "Cancelled", "Refunding"]
    payments = ["WeChat Pay", "Alipay", "Bank Card", "COD"]
    rows = []
    for i in range(1, NUM_ORDERS + 1):
        order_date = START_DATE + timedelta(days=random.randint(0, (END_DATE - START_DATE).days - 1))
        quantity = random.randint(1, 5)
        unit_price = round(random.uniform(20, 800), 2)
        rows.append({"order_id": i, "customer_id": random.randint(1, NUM_CUSTOMERS),
            "product_id": random.randint(1, NUM_PRODUCTS),
            "order_date": order_date.strftime("%Y-%m-%d %H:%M:%S"),
            "quantity": quantity, "unit_price": unit_price,
            "total_amount": round(unit_price * quantity, 2),
            "status": random.choices(statuses, weights=[60,20,15,5])[0],
            "payment_method": random.choice(payments)})
    return rows

def gen_reviews():
    rows = []
    for i in range(1, NUM_REVIEWS + 1):
        review_date = START_DATE + timedelta(days=random.randint(0, (END_DATE - START_DATE).days - 1))
        rows.append({"review_id": i, "order_id": random.randint(1, NUM_ORDERS),
            "product_id": random.randint(1, NUM_PRODUCTS),
            "customer_id": random.randint(1, NUM_CUSTOMERS),
            "rating": random.choices([5,4,3,2,1], weights=[40,30,15,10,5])[0],
            "review_date": review_date.strftime("%Y-%m-%d"),
            "is_verified_purchase": 1 if random.random() > 0.3 else 0})
    return rows

# Write CSV files
os.makedirs(os.path.join(OUTPUT_DIR, "seeds"), exist_ok=True)
for name, gen_fn in [("customers", gen_customers), ("products", gen_products),
                     ("orders", gen_orders), ("reviews", gen_reviews)]:
    rows = gen_fn()
    path = os.path.join(OUTPUT_DIR, "seeds", f"{name}.csv")
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writeheader(); writer.writerows(rows)
    print(f"✅ Generated {path} ({len(rows)} rows)")

Step 2: Configure dbt Project

dbt_project.yml

name: 'duckdb_shop'
version: '1.0.0'
config-version: 2
profile: 'duckdb_shop'

model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]

models:
  duckdb_shop:
    staging:
      +materialized: view     # Cleaning layer — views save disk space
      +schema: staging
    marts:
      +materialized: table    # Analytics layer — tables for speed
      +schema: marts

seeds:
  duckdb_shop:
    +schema: raw

profiles.yml

duckdb_shop:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: duckdb_shop.duckdb
      schema: main
      threads: 4

Step 3: Write dbt Models (Three-Tier Architecture)

Tier 1: Staging — Raw Data Cleaning

Staging models clean, type-cast, and standardize raw CSV data. Materialized as VIEWs (zero storage).

models/staging/stg_customers.sql

-- Clean customer data: standardize fields and cast types
with source as (
    select * from {{ ref('customers') }}
),
cleaned as (
    select
        customer_id, name as customer_name, city,
        case
            when level in ('Regular', 'Silver', 'Gold', 'Diamond') then level
            else 'Regular'
        end as customer_level,
        channel as acquisition_channel,
        registration_date::date as registration_date,
        is_active::boolean as is_active,
        current_timestamp as loaded_at
    from source
)
select * from cleaned

models/staging/stg_orders.sql

-- Clean orders: parse timestamps, add derived fields
with source as (
    select * from {{ ref('orders') }}
),
cleaned as (
    select
        order_id, customer_id, product_id,
        order_date::timestamp as order_timestamp,
        order_date::date as order_date,
        strftime(order_date::timestamp, '%Y') as order_year,
        strftime(order_date::timestamp, '%m') as order_month,
        strftime(order_date::timestamp, '%Y-%m') as order_year_month,
        strftime(order_date::timestamp, '%u') as order_week,
        quantity, unit_price, total_amount,
        status as order_status, payment_method,
        case
            when status in ('Completed', 'Shipped') then 'valid'
            else 'invalid'
        end as is_valid_order,
        current_timestamp as loaded_at
    from source
)
select * from cleaned

models/staging/stg_products.sql

-- Clean products: compute gross margin, stock status
with source as (
    select * from {{ ref('products') }}
),
cleaned as (
    select
        product_id, product_name, category, supplier,
        cost, price,
        round((price - cost) / nullif(price, 0) * 100, 2) as gross_margin_pct,
        stock, shelf_date::date as shelf_date,
        case
            when stock = 0 then 'Out of Stock'
            when stock < 50 then 'Low Stock'
            when stock < 200 then 'Normal'
            else 'Well Stocked'
        end as stock_status,
        current_timestamp as loaded_at
    from source
)
select * from cleaned

Tier 2: Marts — Business Analytics Models

Marts aggregate cleaned data into business-ready analytical tables. Materialized as TABLEs for fast queries.

models/marts/customer_analytics.sql — RFM Segmentation

-- RFM customer segmentation: find high-value customers
with orders as (
    select * from {{ ref('stg_orders') }}
    where is_valid_order = 'valid'
),
customers as (
    select * from {{ ref('stg_customers') }}
    where is_active = true
),
customer_metrics as (
    select
        c.customer_id, c.customer_name, c.city,
        c.customer_level, c.acquisition_channel, c.registration_date,
        count(distinct o.order_id) as total_orders,
        sum(o.total_amount) as total_spent,
        avg(o.total_amount) as avg_order_value,
        max(o.order_date) as last_order_date,
        min(o.order_date) as first_order_date,
        datediff('day', max(o.order_date), current_date) as days_since_last_order,
        count(distinct o.product_id) as unique_products_bought
    from customers c
    left join orders o on c.customer_id = o.customer_id
    group by 1, 2, 3, 4, 5, 6
),
rfm as (
    select *,
        -- Recency (1-5)
        case when days_since_last_order <= 30 then 5
             when days_since_last_order <= 90 then 4
             when days_since_last_order <= 180 then 3
             when days_since_last_order <= 365 then 2
             else 1 end as r_score,
        -- Frequency (1-5)
        case when total_orders >= 10 then 5
             when total_orders >= 6 then 4
             when total_orders >= 3 then 3
             when total_orders >= 1 then 2
             else 1 end as f_score,
        -- Monetary (1-5)
        case when total_spent >= 10000 then 5
             when total_spent >= 5000 then 4
             when total_spent >= 2000 then 3
             when total_spent >= 500 then 2
             else 1 end as m_score
    from customer_metrics
)
select *,
    r_score + f_score + m_score as rfm_total,
    case
        when (r_score >= 4 and f_score >= 4 and m_score >= 4) then '⭐ VIP Customer'
        when (r_score >= 4 and f_score >= 4 and m_score >= 2) then 'Growth Customer'
        when (r_score >= 3 and f_score >= 3) then 'Standard Customer'
        when (r_score >= 1 and total_orders > 0) then 'At-Risk Customer'
        else 'Silent Customer'
    end as customer_segment
from rfm order by rfm_total desc

models/marts/kpi_dashboard.sql — Executive Dashboard

-- Core KPIs: one query to generate the management dashboard
with daily_sales as (select * from {{ ref('daily_sales_summary') }}),
orders as (select * from {{ ref('stg_orders') }}),
customer_metrics as (select * from {{ ref('customer_analytics') }})

select 'Total Revenue' as metric_name,
    round(sum(total_revenue), 2) as metric_value, 'CNY' as unit
from daily_sales
union all
select 'Total Orders', count(*), 'orders'
from orders where is_valid_order = 'valid'
union all
select 'Avg Order Value', round(avg(total_amount), 2), 'CNY'
from orders where is_valid_order = 'valid'
union all
select 'Active Customers', count(*), 'people'
from customer_metrics where total_orders > 0
union all
select 'VIP Customers', count(*), 'people'
from customer_metrics where customer_segment = '⭐ VIP Customer'
union all
select 'Avg RFM Score', round(avg(rfm_total), 2), 'points'
from customer_metrics
union all
select 'Gross Margin',
    round(sum(total_profit) / nullif(sum(total_revenue), 0) * 100, 2), '%'
from {{ ref('product_performance') }}
order by metric_name

Step 4: One-Click Run + Excel Export

#!/usr/bin/env python3
"""day24_run_all.py — One-click data modeling + report export"""
import subprocess, duckdb, pandas as pd
from pathlib import Path

PROJECT_DIR = Path("day24_dbt_project")
DB_PATH = PROJECT_DIR / "duckdb_shop.duckdb"

# 1. Run dbt seed (import CSVs into DuckDB)
print("📥 Importing CSV data...")
subprocess.run(["dbt", "seed", "--profiles-dir", str(PROJECT_DIR)],
               cwd=PROJECT_DIR, check=True)

# 2. Run dbt run (execute all models)
print("🔨 Running dbt models...")
subprocess.run(["dbt", "run", "--profiles-dir", str(PROJECT_DIR)],
               cwd=PROJECT_DIR, check=True)

# 3. Connect to DuckDB and export reports
conn = duckdb.connect(str(DB_PATH))

# KPI Dashboard
kpi = conn.execute("""
    SELECT metric_name, metric_value, unit
    FROM main_marts.kpi_dashboard
    WHERE metric_name IN ('Total Revenue','Total Orders',
          'Avg Order Value','Active Customers','Gross Margin')
""").fetchdf()
print("\n📈 KPI Dashboard:")
print(kpi.to_string(index=False))

# Top 10 Products
top_products = conn.execute("""
    SELECT product_name, category, units_sold,
           total_revenue, gross_margin_pct
    FROM main_marts.product_performance
    WHERE units_sold > 0
    ORDER BY total_revenue DESC LIMIT 10
""").fetchdf()

# Customer Segments
segments = conn.execute("""
    SELECT customer_segment, count(*) as cnt,
           round(avg(total_spent),2) as avg_spent
    FROM main_marts.customer_analytics
    GROUP BY customer_segment
""").fetchdf()

# Export to Excel
with pd.ExcelWriter("day24_dbt_report.xlsx", engine='openpyxl') as writer:
    kpi.to_excel(writer, sheet_name='KPI Dashboard', index=False)
    top_products.to_excel(writer, sheet_name='Top Products', index=False)
    segments.to_excel(writer, sheet_name='Customer Segments', index=False)

print(f"\n✅ Report exported: day24_dbt_report.xlsx")
conn.close()

Sample Output

📋 Data Model Overview:
  ✅ main_raw.customers: 200 rows
  ✅ main_raw.orders: 2000 rows
  ✅ main_marts.customer_analytics: 174 rows
  ✅ main_marts.product_performance: 50 rows

📈 KPI Dashboard:
  Total Revenue     1,998,087 CNY
  Total Orders      1,593 orders
  Avg Order Value   1,254 CNY
  Gross Margin      40.73%

👥 Customer Segments:
  ⭐ VIP Customer      95 people  avg spend ¥12,135
  Standard Customer   64 people  avg spend ¥8,084
  At-Risk Customer    10 people  avg spend ¥5,555

💰 Monetization Framework

Target Clients

SMBs with ¥1M ~ ¥50M annual revenue, data scattered in Excel/ERP/POS systems
Business owners who need analytics but can’t hire a data engineer
Companies that want to upgrade but can’t afford Snowflake + Tableau

Pricing Structure

Service	Price	Description
Base Setup	¥8,000	One-time data modeling + report templates
Monthly Maintenance	¥500/mo	Monthly run + data quality checks
Custom Model	¥3,000 each	New analysis requirements add one dbt model
Training	¥2,000/session	Teach client’s staff to query data themselves
Annual Package	¥12,000/yr	Setup + 12 months maintenance + 2 custom models

Competitive Comparison

Solution	Cost	Setup Time	Best For
Snowflake + dbt Cloud	$2,000+/month	2-4 weeks	Enterprise
AWS Redshift	$1,000+/month	1-3 weeks	Mid-market
DuckDB + dbt	¥8K-20K one-time	Half a day	SMBs
Excel / Manual	¥0 software + ¥50K labor	Repeating weekly	Tiny businesses

Delivery Checklist

Client provides: Business data exports (CSV/Excel), data dictionary if available
You deliver: Complete dbt project code + DuckDB database file + Excel reports + deployment documentation
Acceptance criteria: One-click re-run generates latest reports, data is accurate

🔗 Scaling the Service

Combine with Previously Learned Skills

Skill	How to Combine with dbt
Cross-DB JOINs	ATTACH MySQL/PostgreSQL as dbt sources
Pandas Integration	Use Python dbt models for complex cleaning
FastAPI API	Build REST API on dbt output for browser-based queries
Cron Automation	Schedule daily dbt run + email delivery

Industry-Specific Variants

E-commerce: Multi-store dashboard + SKU analysis + competitor price tracking
Restaurant: Ingredient cost analysis + dish margin ranking + peak hour analysis
Logistics: Delivery time analytics + anomaly detection + driver performance
Manufacturing: Production capacity + yield rate tracking + supply chain management

The dbt Ecosystem Opportunity

dbt is the fastest-growing tool in data engineering. Adding dbt + DuckDB to your toolkit opens up:

Freelance dbt modeling gigs on Upwork/Fiverr — ¥3,000-5,000/project
Excel-to-dbt migration consulting — ¥5,000-10,000/project
dbt training courses on knowledge platforms
Corporate dbt + DuckDB training — ¥3,000-5,000/day

Summary

dbt + DuckDB is the optimal data warehousing solution for small and medium businesses. It enables any SQL-capable developer to build a production-grade data warehouse in half a day, for less than 1/10th the cost of traditional solutions.

Your new skill package includes:

✅ Three-tier data modeling architecture (Staging → Marts → Dashboard)
✅ dbt project configuration and model authoring
✅ RFM customer segmentation analytics
✅ One-click report export (Python + DuckDB + Excel)
✅ Complete monetization framework and delivery process

Next steps: Save this project template. Next time a client asks “Can you build a data warehouse?” — your answer is: “Yes, ¥8,000, and it’s running by the end of the day.”

All code verified on DuckDB v1.5.2, dbt-core v1.11.11, dbt-duckdb v1.10.1, Python 3.12

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.