Featured image of post DuckDB + dbt Automated Reporting Pipeline: The Last Mile from Data to Sellable Products

DuckDB + dbt Automated Reporting Pipeline: The Last Mile from Data to Sellable Products

Build an end-to-end automated reporting pipeline with DuckDB + dbt + Python: from data ingestion and dbt model construction, to data quality testing, and professional report generation. Includes executable code and monetization strategies.

Why Does Your Data Analysis Product Always Miss the “Last Mile”?

Many freelance data analysts face an awkward situation:

A client gives you a bunch of CSVs and Excel files. You say, “Give me two days to build an analysis pipeline.” You spend two days building models and producing beautiful aggregated data—and then what?

Then you open Excel, manually copy-paste, adjust formatting, and send it to the client. The client says, “What’s the difference between this and what I could do myself?”

The problem isn’t technical—it’s in the delivery format. Clients don’t buy SQL query results; they buy a professional, explainable data product they can directly present in meetings.

Today, I’ll walk you through the complete pipeline from raw data to a sellable report using DuckDB + dbt + Python.


Architecture Overview: A Three-Step Strategy

The core idea: use dbt for data quality, DuckDB for analysis, and Python for delivery.

Raw Data (CSV/Excel) → dbt Model Layer → DuckDB Database → Python Report Generation → Client Delivery
  • dbt Model Layer: Define four-layer architecture (staging → fact → dimension → mart) with built-in tests and documentation
  • DuckDB: Acts as a lightweight data warehouse, storing dbt-compiled result tables
  • Python Report Generation: Reads DuckDB results, formats them into professional reports

Comparison with traditional approaches:

ApproachCostSetup TimeMaintainabilityBest For
Manual Excel ReportsFree2-3 hrs each timePoor, no version controlOne-off analysis
Tableau/PowerBI$15/mo/user1-2 weeksMediumInteractive BI
Airflow + PostgreSQLServer costs2-4 weeksGoodEnterprise ETL
DuckDB + dbt + PythonFreeHalf a dayGoodSME client delivery

Step 1: Environment Setup (5 Minutes)

pip install duckdb dbt-duckdb pandas openpyxl

Initialize the dbt project:

mkdir duckdb-report-pipeline && cd duckdb-report-pipeline
dbt init sales_report
cd sales_report

Configure profiles.yml (place at ~/.dbt/profiles.yml):

sales_report:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "./data/sales.db"
      schemas:
        - public
      settings:
        max_memory: '8GB'
        threads: 4

Configure dbt_project.yml:

name: 'sales_report'
version: '1.0.0'
config-version: 2
profile: 'sales_report'

model-paths: ["models"]
test-paths: ["tests"]
seed-paths: ["seeds"]
target-path: "target"

models:
  sales_report:
    +materialized: view

Step 2: Data Ingestion and Model Building

2.1 Load Seed Data

Place raw data in the seeds/ directory. Using e-commerce sales as an example:

seeds/orders.csv

order_id,customer_id,product_id,order_date,amount,status,shop_name,category
1001,C001,P001,2026-01-15,299.50,completed,FlagshipA,Electronics
1002,C002,P003,2026-01-15,89.00,completed,SpecialtyB,Clothing
1003,C001,P005,2026-01-16,1299.00,completed,FlagshipA,Appliances
1004,C003,P002,2026-01-16,45.00,refunded,SpecialtyB,Food
1005,C004,P007,2026-01-17,599.00,completed,FlagshipC,Electronics
1006,C002,P001,2026-01-17,320.00,completed,FlagshipA,Electronics
1007,C005,P004,2026-01-18,150.00,completed,SpecialtyB,Clothing
1008,C001,P006,2026-01-18,899.00,completed,FlagshipC,Appliances

seeds/products.csv

product_id,product_name,category,shop_name,cost_price,supplier
P001,iPhone Case,Electronics,FlagshipA,50,Shenzhen
P002,Nut Gift Box,Food,SpecialtyB,12,Hangzhou
P003,Cotton T-Shirt,Clothing,SpecialtyB,15,Guangzhou
P004,Casual Jeans,Clothing,SpecialtyB,35,Guangzhou
P005,Air Fryer,Appliances,FlagshipA,300,Ningbo
P006,Robot Vacuum,Appliances,FlagshipC,400,Dongguan
P007,Bluetooth Earphones,Electronics,FlagshipC,200,Dongguan

2.2 Write dbt Models

models/stg_orders.sql — Staging layer, standardized format:

{{ config(materialized='table') }}

SELECT 
    order_id,
    customer_id,
    product_id,
    CAST(order_date AS DATE) AS order_date,
    amount,
    LOWER(TRIM(status)) AS status,
    TRIM(shop_name) AS shop_name,
    TRIM(category) AS category
FROM {{ source('seed', 'orders') }}
WHERE order_id IS NOT NULL
  AND amount > 0

models/fct_sales.sql — Fact layer, calculate profit:

{{ config(materialized='table') }}

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),
products AS (
    SELECT * FROM {{ source('seed', 'products') }}
)

SELECT 
    o.order_id,
    o.customer_id,
    o.order_date,
    o.amount,
    o.status,
    o.shop_name,
    o.category,
    p.product_name,
    p.cost_price,
    p.supplier,
    ROUND(o.amount - p.cost_price, 2) AS gross_profit,
    ROUND(100.0 * (o.amount - p.cost_price) / NULLIF(o.amount, 0), 1) AS profit_margin_pct
FROM orders o
LEFT JOIN products p ON o.product_id = p.product_id
WHERE o.status = 'completed'

models/dm_shop_performance.sql — Aggregation layer, monthly shop performance:

{{ config(materialized='table') }}

SELECT 
    shop_name,
    category,
    DATE_TRUNC('month', order_date) AS sale_month,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue,
    SUM(gross_profit) AS total_profit,
    AVG(profit_margin_pct) AS avg_margin,
    COUNT(DISTINCT customer_id) AS unique_customers,
    ROUND(
        100.0 * (
            SUM(amount) - LAG(SUM(amount)) OVER (
                PARTITION BY shop_name, category 
                ORDER BY DATE_TRUNC('month', order_date)
            )
        ) / NULLIF(LAG(SUM(amount)) OVER (
            PARTITION BY shop_name, category 
            ORDER BY DATE_TRUNC('month', order_date)
        ), 0),
        1
    ) AS mom_growth_pct
FROM {{ ref('fct_sales') }}
GROUP BY shop_name, category, DATE_TRUNC('month', order_date)
ORDER BY sale_month DESC, total_revenue DESC

2.3 Add Data Quality Tests

Define tests directly in the model file:

# models/schema.yml
version: 2

models:
  - name: fct_sales
    description: "Core sales fact table, completed orders only"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
      - name: gross_profit
        tests:
          - dbt_utils.not_negative

Run the pipeline:

dbt run        # Compile and execute all models
dbt test       # Run data quality tests
dbt docs generate  # Generate interactive documentation

Step 3: Generate Professional Reports with Python

This is the most critical step—transforming dbt-processed data into a report the client can directly present.

import duckdb
import pandas as pd
from datetime import datetime

con = duckdb.connect("data/sales.db")

# Get latest month's shop performance
latest_month = con.execute("""
    SELECT MAX(sale_month) FROM dm_shop_performance
""").fetchone()[0]

performance = con.execute(f"""
    SELECT * FROM dm_shop_performance 
    WHERE sale_month = '{latest_month}'
    ORDER BY total_revenue DESC
""").fetchdf()

# Calculate summary metrics
summary = con.execute(f"""
    SELECT 
        COUNT(DISTINCT shop_name) AS shop_count,
        SUM(order_count) AS total_orders,
        ROUND(SUM(total_revenue), 2) AS total_revenue,
        ROUND(SUM(total_profit), 2) AS total_profit,
        ROUND(AVG(avg_margin), 1) AS avg_margin,
        ROUND(
            100.0 * SUM(CASE WHEN mom_growth_pct > 0 THEN 1 ELSE 0 END) 
            / NULLIF(COUNT(*), 0), 1
        ) AS growth_ratio
    FROM dm_shop_performance
    WHERE sale_month = '{latest_month}'
""").fetchone()

# Generate text report
report_date = datetime.now().strftime('%Y-%m-%d')
month_label = latest_month.strftime('%Y-%m')

print("=" * 60)
print(f"📊 Monthly Sales Report — {month_label}")
print(f"📅 Generated: {report_date}")
print("=" * 60)
print(f"🏪 Active Shops: {summary[0]}")
print(f"🛒 Total Orders: {summary[1]:,}")
print(f"💰 Total Revenue: ${summary[2]:,.2f}")
print(f"📈 Total Profit: ${summary[3]:,.2f}")
print(f"🎯 Avg Margin: {summary[4]}%")
print(f"🚀 Growth Shops: {summary[5]}%")
print()

for _, row in performance.iterrows():
    growth = f"+{row['mom_growth_pct']}%" if pd.notna(row['mom_growth_pct']) and row['mom_growth_pct'] > 0 else "—"
    print(f"  {row['shop_name']} | {row['category']} | "
          f"Rev ${row['total_revenue']:,.0f} | "
          f"Profit ${row['total_profit']:,.0f} | "
          f"Margin {row['avg_margin']:.1f}% | "
          f"MoM {growth}")

# Optional: Export to Excel
performance.to_excel(f"Sales_Report_{month_label}.xlsx", index=False)
print(f"\n✅ Excel report saved: Sales_Report_{month_label}.xlsx")

Monetization Paths: How Much Can This Pipeline Earn?

Option 1: Monthly Managed Reports ($200–500/month)

Find a small e-commerce client, build this pipeline for them, and auto-generate reports monthly.

  • Your cost: 2–3 hours initial setup, then click a Python script each month
  • Client perceived value: Professional monthly reports, replacing a data analyst salary ($15K+/month)
  • Profit margin: $200–500/month, $2,400–6,000/year

Option 2: Data Product Package ($500–2,000/project)

Package this pipeline as a “Smart Sales Analysis System” and sell to multiple clients.

  • Each client only needs to swap the seed data (CSV format)
  • Model logic is generic; only field mapping needs tweaking
  • Deliverables: database file + Python report script + usage documentation

Option 3: SaaSification (Advanced)

Wrap the Python report generation with FastAPI and build a Streamlit frontend:

# app.py — 30 lines of API code
from fastapi import FastAPI
from fastapi.responses import HTMLResponse
import duckdb, pandas as pd

app = FastAPI()

@app.get("/report")
def get_report():
    con = duckdb.connect("data/sales.db")
    df = con.execute("SELECT * FROM dm_shop_performance").fetchdf()
    return df.to_html()

Summary

The core value of this pipeline: transforming data analysis from “one-off queries” into “repeatable, deliverable products.”

  • dbt ensures data quality and maintainability
  • DuckDB provides a zero-cost local data warehouse
  • Python handles the final formatted delivery

When you can reliably deliver professional reports, you transition from “someone who does data” to “someone who sells data.”


📖 Full executable code and additional industry templates (restaurant, retail, cross-border e-commerce) are available at duckdblab.org, including detailed deployment guides and monetization case studies. 💡 More DuckDB practical tips → duckdblab.org

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy