Featured image of post dbt + DuckDB: Build a Production-Ready Data Warehouse in Half a Day

dbt + DuckDB: Build a Production-Ready Data Warehouse in Half a Day

dbt + DuckDB is the ultimate data warehousing solution for SMBs. This hands-on guide walks through a complete e-commerce data modeling project — from CSV sources to dbt three-tier architecture (Staging → Marts → Dashboard), RFM customer segmentation, one-click Excel reporting, and a monetization framework worth ¥8,000 to ¥20,000.

The Problem: Does Every Company Need a ¥100K Data Warehouse?

“Can you build me a data warehouse? I need to see daily sales.”

When you take this gig, the available options seem to be:

  • Snowflake + dbt Cloud: Professional, but $2,000+/month starting
  • Alibaba Cloud MaxCompute: ¥3,000/month, 1-2 weeks to set up
  • Self-hosted Hadoop/Spark: 3 servers + a big data engineer (¥300K+/year)
  • Excel + Manual: Free, but 3 hours per report, repeated weekly

There’s a huge gap in the middle — and it’s where 90% of small and medium businesses live.

The reality is: Most SMBs (¥1M ~ ¥50M annual revenue) don’t need a distributed data warehouse. Their data is tens of thousands to a few million rows in CSV/Excel, perfectly runnable on a laptop.

What they actually need:

  1. Zero software cost — no new monthly bills
  2. Delivered in half a day — set up today, use tomorrow
  3. Maintainable and scalable — not a one-off script, a real data engineering architecture
  4. One-click reporting — management KPIs generated automatically

This is exactly where dbt + DuckDB dominates.


dbt + DuckDB: The Best Data Stack for SMBs

What is dbt?

dbt (data build tool) is the hottest data transformation tool in modern data engineering. Its core philosophy:

Define your data transformation logic in SQL. dbt handles dependency management, execution ordering, documentation, and testing — automatically.

You write SELECT statements (clean data, aggregate, compute business metrics). dbt takes care of:

  • Dependency resolution (run model A first, then model B)
  • Incremental vs full refresh strategies
  • Data lineage visualization (see which models depend on which)
  • Automated testing and documentation

Why DuckDB as the Engine?

DuckDB pairs perfectly with dbt for SMB workloads:

FeatureSnowflakeSparkDuckDB + dbt
Annual Cost¥170K+¥100K+¥0
Setup Time2-4 weeks4-8 weeksHalf a day
Server Required✅ Cloud cluster✅ Cluster❌ A laptop
DBA Required❌ You
Portability❌ Vendor lock-in❌ JVM-dependent✅ One .duckdb file
Learning CurveMediumSteepLow (just need SQL)

🔧 Full Project: E-Commerce Data Warehouse

Here’s a complete e-commerce data warehouse project — data generation, dbt modeling, and report export — fully executable.

📥 Prerequisites

pip install duckdb dbt-duckdb openpyxl pandas

# Verify dbt installation
dbt --version
# Core: 1.11.x, Plugin: duckdb 1.10.x

📁 Project Structure

day24_dbt_project/
├── dbt_project.yml          # dbt project config
├── profiles.yml             # DuckDB connection config
├── seeds/                   # Raw data (CSV)
│   ├── customers.csv        # 200 customers
│   ├── products.csv         # 50 products
│   ├── orders.csv           # 2,000 orders
│   └── reviews.csv          # 1,500 reviews
└── models/
    ├── staging/             # Data cleaning layer (VIEW)
    │   ├── stg_customers.sql
    │   ├── stg_products.sql
    │   ├── stg_orders.sql
    │   └── stg_reviews.sql
    └── marts/               # Business analytics layer (TABLE)
        ├── daily_sales_summary.sql
        ├── product_performance.sql
        ├── customer_analytics.sql
        └── kpi_dashboard.sql

Step 1: Generate Sample Data

Run this Python script to generate realistic e-commerce data:

#!/usr/bin/env python3
"""Generate sample e-commerce data for dbt + DuckDB demo."""
import csv, random, os
from datetime import datetime, timedelta

random.seed(42)
OUTPUT_DIR = os.path.dirname(os.path.abspath(__file__))

NUM_CUSTOMERS = 200
NUM_PRODUCTS = 50
NUM_ORDERS = 2000
NUM_REVIEWS = 1500
START_DATE = datetime(2025, 1, 1)
END_DATE = datetime(2026, 5, 1)

def gen_customers():
    cities = ["Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou",
              "Chengdu", "Wuhan", "Nanjing", "Chongqing", "Xi'an"]
    levels = ["Regular", "Silver", "Gold", "Diamond"]
    channels = ["Direct", "Search Engine", "Social Media", "Email", "Ads"]
    rows = []
    for i in range(1, NUM_CUSTOMERS + 1):
        reg_date = START_DATE + timedelta(days=random.randint(0, 400))
        rows.append({"customer_id": i, "name": f"User_{i:04d}",
            "city": random.choice(cities),
            "level": random.choices(levels, weights=[50,30,15,5])[0],
            "channel": random.choice(channels),
            "registration_date": reg_date.strftime("%Y-%m-%d"),
            "is_active": 1 if random.random() > 0.15 else 0})
    return rows

def gen_products():
    categories = ["Electronics", "Clothing", "Food", "Home", "Beauty", "Books"]
    suppliers = ["Supplier_A", "Supplier_B", "Supplier_C", "Supplier_D", "Supplier_E"]
    rows = []
    for i in range(1, NUM_PRODUCTS + 1):
        cost = round(random.uniform(10, 500), 2)
        price = round(cost * random.uniform(1.3, 3.0), 2)
        rows.append({"product_id": i, "product_name": f"Product_{i:04d}",
            "category": random.choice(categories), "supplier": random.choice(suppliers),
            "cost": cost, "price": price, "stock": random.randint(0, 1000),
            "shelf_date": (START_DATE + timedelta(days=random.randint(0, 480))).strftime("%Y-%m-%d")})
    return rows

def gen_orders():
    statuses = ["Completed", "Shipped", "Cancelled", "Refunding"]
    payments = ["WeChat Pay", "Alipay", "Bank Card", "COD"]
    rows = []
    for i in range(1, NUM_ORDERS + 1):
        order_date = START_DATE + timedelta(days=random.randint(0, (END_DATE - START_DATE).days - 1))
        quantity = random.randint(1, 5)
        unit_price = round(random.uniform(20, 800), 2)
        rows.append({"order_id": i, "customer_id": random.randint(1, NUM_CUSTOMERS),
            "product_id": random.randint(1, NUM_PRODUCTS),
            "order_date": order_date.strftime("%Y-%m-%d %H:%M:%S"),
            "quantity": quantity, "unit_price": unit_price,
            "total_amount": round(unit_price * quantity, 2),
            "status": random.choices(statuses, weights=[60,20,15,5])[0],
            "payment_method": random.choice(payments)})
    return rows

def gen_reviews():
    rows = []
    for i in range(1, NUM_REVIEWS + 1):
        review_date = START_DATE + timedelta(days=random.randint(0, (END_DATE - START_DATE).days - 1))
        rows.append({"review_id": i, "order_id": random.randint(1, NUM_ORDERS),
            "product_id": random.randint(1, NUM_PRODUCTS),
            "customer_id": random.randint(1, NUM_CUSTOMERS),
            "rating": random.choices([5,4,3,2,1], weights=[40,30,15,10,5])[0],
            "review_date": review_date.strftime("%Y-%m-%d"),
            "is_verified_purchase": 1 if random.random() > 0.3 else 0})
    return rows

# Write CSV files
os.makedirs(os.path.join(OUTPUT_DIR, "seeds"), exist_ok=True)
for name, gen_fn in [("customers", gen_customers), ("products", gen_products),
                     ("orders", gen_orders), ("reviews", gen_reviews)]:
    rows = gen_fn()
    path = os.path.join(OUTPUT_DIR, "seeds", f"{name}.csv")
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writeheader(); writer.writerows(rows)
    print(f"✅ Generated {path} ({len(rows)} rows)")

Step 2: Configure dbt Project

dbt_project.yml

name: 'duckdb_shop'
version: '1.0.0'
config-version: 2
profile: 'duckdb_shop'

model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]

models:
  duckdb_shop:
    staging:
      +materialized: view     # Cleaning layer — views save disk space
      +schema: staging
    marts:
      +materialized: table    # Analytics layer — tables for speed
      +schema: marts

seeds:
  duckdb_shop:
    +schema: raw

profiles.yml

duckdb_shop:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: duckdb_shop.duckdb
      schema: main
      threads: 4

Step 3: Write dbt Models (Three-Tier Architecture)

Tier 1: Staging — Raw Data Cleaning

Staging models clean, type-cast, and standardize raw CSV data. Materialized as VIEWs (zero storage).

models/staging/stg_customers.sql

-- Clean customer data: standardize fields and cast types
with source as (
    select * from {{ ref('customers') }}
),
cleaned as (
    select
        customer_id, name as customer_name, city,
        case
            when level in ('Regular', 'Silver', 'Gold', 'Diamond') then level
            else 'Regular'
        end as customer_level,
        channel as acquisition_channel,
        registration_date::date as registration_date,
        is_active::boolean as is_active,
        current_timestamp as loaded_at
    from source
)
select * from cleaned

models/staging/stg_orders.sql

-- Clean orders: parse timestamps, add derived fields
with source as (
    select * from {{ ref('orders') }}
),
cleaned as (
    select
        order_id, customer_id, product_id,
        order_date::timestamp as order_timestamp,
        order_date::date as order_date,
        strftime(order_date::timestamp, '%Y') as order_year,
        strftime(order_date::timestamp, '%m') as order_month,
        strftime(order_date::timestamp, '%Y-%m') as order_year_month,
        strftime(order_date::timestamp, '%u') as order_week,
        quantity, unit_price, total_amount,
        status as order_status, payment_method,
        case
            when status in ('Completed', 'Shipped') then 'valid'
            else 'invalid'
        end as is_valid_order,
        current_timestamp as loaded_at
    from source
)
select * from cleaned

models/staging/stg_products.sql

-- Clean products: compute gross margin, stock status
with source as (
    select * from {{ ref('products') }}
),
cleaned as (
    select
        product_id, product_name, category, supplier,
        cost, price,
        round((price - cost) / nullif(price, 0) * 100, 2) as gross_margin_pct,
        stock, shelf_date::date as shelf_date,
        case
            when stock = 0 then 'Out of Stock'
            when stock < 50 then 'Low Stock'
            when stock < 200 then 'Normal'
            else 'Well Stocked'
        end as stock_status,
        current_timestamp as loaded_at
    from source
)
select * from cleaned

Tier 2: Marts — Business Analytics Models

Marts aggregate cleaned data into business-ready analytical tables. Materialized as TABLEs for fast queries.

models/marts/customer_analytics.sql — RFM Segmentation

-- RFM customer segmentation: find high-value customers
with orders as (
    select * from {{ ref('stg_orders') }}
    where is_valid_order = 'valid'
),
customers as (
    select * from {{ ref('stg_customers') }}
    where is_active = true
),
customer_metrics as (
    select
        c.customer_id, c.customer_name, c.city,
        c.customer_level, c.acquisition_channel, c.registration_date,
        count(distinct o.order_id) as total_orders,
        sum(o.total_amount) as total_spent,
        avg(o.total_amount) as avg_order_value,
        max(o.order_date) as last_order_date,
        min(o.order_date) as first_order_date,
        datediff('day', max(o.order_date), current_date) as days_since_last_order,
        count(distinct o.product_id) as unique_products_bought
    from customers c
    left join orders o on c.customer_id = o.customer_id
    group by 1, 2, 3, 4, 5, 6
),
rfm as (
    select *,
        -- Recency (1-5)
        case when days_since_last_order <= 30 then 5
             when days_since_last_order <= 90 then 4
             when days_since_last_order <= 180 then 3
             when days_since_last_order <= 365 then 2
             else 1 end as r_score,
        -- Frequency (1-5)
        case when total_orders >= 10 then 5
             when total_orders >= 6 then 4
             when total_orders >= 3 then 3
             when total_orders >= 1 then 2
             else 1 end as f_score,
        -- Monetary (1-5)
        case when total_spent >= 10000 then 5
             when total_spent >= 5000 then 4
             when total_spent >= 2000 then 3
             when total_spent >= 500 then 2
             else 1 end as m_score
    from customer_metrics
)
select *,
    r_score + f_score + m_score as rfm_total,
    case
        when (r_score >= 4 and f_score >= 4 and m_score >= 4) then '⭐ VIP Customer'
        when (r_score >= 4 and f_score >= 4 and m_score >= 2) then 'Growth Customer'
        when (r_score >= 3 and f_score >= 3) then 'Standard Customer'
        when (r_score >= 1 and total_orders > 0) then 'At-Risk Customer'
        else 'Silent Customer'
    end as customer_segment
from rfm order by rfm_total desc

models/marts/kpi_dashboard.sql — Executive Dashboard

-- Core KPIs: one query to generate the management dashboard
with daily_sales as (select * from {{ ref('daily_sales_summary') }}),
orders as (select * from {{ ref('stg_orders') }}),
customer_metrics as (select * from {{ ref('customer_analytics') }})

select 'Total Revenue' as metric_name,
    round(sum(total_revenue), 2) as metric_value, 'CNY' as unit
from daily_sales
union all
select 'Total Orders', count(*), 'orders'
from orders where is_valid_order = 'valid'
union all
select 'Avg Order Value', round(avg(total_amount), 2), 'CNY'
from orders where is_valid_order = 'valid'
union all
select 'Active Customers', count(*), 'people'
from customer_metrics where total_orders > 0
union all
select 'VIP Customers', count(*), 'people'
from customer_metrics where customer_segment = '⭐ VIP Customer'
union all
select 'Avg RFM Score', round(avg(rfm_total), 2), 'points'
from customer_metrics
union all
select 'Gross Margin',
    round(sum(total_profit) / nullif(sum(total_revenue), 0) * 100, 2), '%'
from {{ ref('product_performance') }}
order by metric_name

Step 4: One-Click Run + Excel Export

#!/usr/bin/env python3
"""day24_run_all.py — One-click data modeling + report export"""
import subprocess, duckdb, pandas as pd
from pathlib import Path

PROJECT_DIR = Path("day24_dbt_project")
DB_PATH = PROJECT_DIR / "duckdb_shop.duckdb"

# 1. Run dbt seed (import CSVs into DuckDB)
print("📥 Importing CSV data...")
subprocess.run(["dbt", "seed", "--profiles-dir", str(PROJECT_DIR)],
               cwd=PROJECT_DIR, check=True)

# 2. Run dbt run (execute all models)
print("🔨 Running dbt models...")
subprocess.run(["dbt", "run", "--profiles-dir", str(PROJECT_DIR)],
               cwd=PROJECT_DIR, check=True)

# 3. Connect to DuckDB and export reports
conn = duckdb.connect(str(DB_PATH))

# KPI Dashboard
kpi = conn.execute("""
    SELECT metric_name, metric_value, unit
    FROM main_marts.kpi_dashboard
    WHERE metric_name IN ('Total Revenue','Total Orders',
          'Avg Order Value','Active Customers','Gross Margin')
""").fetchdf()
print("\n📈 KPI Dashboard:")
print(kpi.to_string(index=False))

# Top 10 Products
top_products = conn.execute("""
    SELECT product_name, category, units_sold,
           total_revenue, gross_margin_pct
    FROM main_marts.product_performance
    WHERE units_sold > 0
    ORDER BY total_revenue DESC LIMIT 10
""").fetchdf()

# Customer Segments
segments = conn.execute("""
    SELECT customer_segment, count(*) as cnt,
           round(avg(total_spent),2) as avg_spent
    FROM main_marts.customer_analytics
    GROUP BY customer_segment
""").fetchdf()

# Export to Excel
with pd.ExcelWriter("day24_dbt_report.xlsx", engine='openpyxl') as writer:
    kpi.to_excel(writer, sheet_name='KPI Dashboard', index=False)
    top_products.to_excel(writer, sheet_name='Top Products', index=False)
    segments.to_excel(writer, sheet_name='Customer Segments', index=False)

print(f"\n✅ Report exported: day24_dbt_report.xlsx")
conn.close()

Sample Output

📋 Data Model Overview:
  ✅ main_raw.customers: 200 rows
  ✅ main_raw.orders: 2000 rows
  ✅ main_marts.customer_analytics: 174 rows
  ✅ main_marts.product_performance: 50 rows

📈 KPI Dashboard:
  Total Revenue     1,998,087 CNY
  Total Orders      1,593 orders
  Avg Order Value   1,254 CNY
  Gross Margin      40.73%

👥 Customer Segments:
  ⭐ VIP Customer      95 people  avg spend ¥12,135
  Standard Customer   64 people  avg spend ¥8,084
  At-Risk Customer    10 people  avg spend ¥5,555

💰 Monetization Framework

Target Clients

  • SMBs with ¥1M ~ ¥50M annual revenue, data scattered in Excel/ERP/POS systems
  • Business owners who need analytics but can’t hire a data engineer
  • Companies that want to upgrade but can’t afford Snowflake + Tableau

Pricing Structure

ServicePriceDescription
Base Setup¥8,000One-time data modeling + report templates
Monthly Maintenance¥500/moMonthly run + data quality checks
Custom Model¥3,000 eachNew analysis requirements add one dbt model
Training¥2,000/sessionTeach client’s staff to query data themselves
Annual Package¥12,000/yrSetup + 12 months maintenance + 2 custom models

Competitive Comparison

SolutionCostSetup TimeBest For
Snowflake + dbt Cloud$2,000+/month2-4 weeksEnterprise
AWS Redshift$1,000+/month1-3 weeksMid-market
DuckDB + dbt¥8K-20K one-timeHalf a daySMBs
Excel / Manual¥0 software + ¥50K laborRepeating weeklyTiny businesses

Delivery Checklist

  • Client provides: Business data exports (CSV/Excel), data dictionary if available
  • You deliver: Complete dbt project code + DuckDB database file + Excel reports + deployment documentation
  • Acceptance criteria: One-click re-run generates latest reports, data is accurate

🔗 Scaling the Service

Combine with Previously Learned Skills

SkillHow to Combine with dbt
Cross-DB JOINsATTACH MySQL/PostgreSQL as dbt sources
Pandas IntegrationUse Python dbt models for complex cleaning
FastAPI APIBuild REST API on dbt output for browser-based queries
Cron AutomationSchedule daily dbt run + email delivery

Industry-Specific Variants

  • E-commerce: Multi-store dashboard + SKU analysis + competitor price tracking
  • Restaurant: Ingredient cost analysis + dish margin ranking + peak hour analysis
  • Logistics: Delivery time analytics + anomaly detection + driver performance
  • Manufacturing: Production capacity + yield rate tracking + supply chain management

The dbt Ecosystem Opportunity

dbt is the fastest-growing tool in data engineering. Adding dbt + DuckDB to your toolkit opens up:

  1. Freelance dbt modeling gigs on Upwork/Fiverr — ¥3,000-5,000/project
  2. Excel-to-dbt migration consulting — ¥5,000-10,000/project
  3. dbt training courses on knowledge platforms
  4. Corporate dbt + DuckDB training — ¥3,000-5,000/day

Summary

dbt + DuckDB is the optimal data warehousing solution for small and medium businesses. It enables any SQL-capable developer to build a production-grade data warehouse in half a day, for less than 1/10th the cost of traditional solutions.

Your new skill package includes:

  • ✅ Three-tier data modeling architecture (Staging → Marts → Dashboard)
  • ✅ dbt project configuration and model authoring
  • ✅ RFM customer segmentation analytics
  • ✅ One-click report export (Python + DuckDB + Excel)
  • ✅ Complete monetization framework and delivery process

Next steps: Save this project template. Next time a client asks “Can you build a data warehouse?” — your answer is: “Yes, ¥8,000, and it’s running by the end of the day.”

All code verified on DuckDB v1.5.2, dbt-core v1.11.11, dbt-duckdb v1.10.1, Python 3.12

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy