Building an Automated Financial Reporting System with DuckDB — A Profitable Business Project

Project Tutorial

Building an Automated Financial Reporting System with DuckDB — A Profitable Business Project

Step-by-step guide to building an automated financial reporting system with DuckDB: multi-source CSV/JSON/Excel aggregation, P&L generation, month-over-month analysis, and Streamlit dashboards. Learn how to sell this system to SMBs for $2K-$5K.

Introduction: From Excel to DuckDB — A 100x Efficiency Leap

Do you also experience this every month — your boss asks you to prepare a financial report, you open Excel, open databases, open several CSV files, copy-paste repeatedly, and spend a full day piecing together a spreadsheet. Then the boss says, “Can you adjust the colors?” and you have to start over.

If you automate this workflow with DuckDB, the same workload shrinks from one day to 10 minutes, and it’s always accurate. More importantly — this system is sellable.

Today I’ll walk through a real paid project: I built an automated financial reporting system for a small e-commerce business and charged $2,800. This methodology can be replicated across retail inventory analysis, SaaS subscription metrics, advertising ROI tracking, and more.

Project Requirements: A Financial Report System for an E-commerce Business

The client was an e-commerce company that needed to aggregate financial data from the following sources every month:

Data Source	Format	Volume
Payment platform transactions	CSV	~500K rows/month
Warehouse & logistics costs	CSV	~200K rows/month
Advertising spend	JSON (API export)	~50K entries/month
Returns/refunds	Excel (.xlsx)	~100K rows/month
Product catalog	SQLite database	~10K entries

Goal: Auto-generate a P&L (Profit & Loss) statement every month, including revenue, cost, gross margin, net profit margin, and output a PDF report.

Why DuckDB?

You might ask: Why DuckDB when we have SQL Server and PostgreSQL? Three reasons:

First, speed. Loading 500K rows of CSV and performing aggregation is 10x+ faster than pandas, and even faster than traditional SQL — because there’s no need to create tables beforehand. DuckDB’s columnar storage architecture is naturally optimized for analytical queries.

Second, zero operations. The client has no DBA. DuckDB is a single file — no database service installation needed, just copy and use.

Third, powerful analytics. Window functions, JSON parsing, direct Parquet reading, even native Excel support — all data sources handled in a single query.

DuckDB vs Traditional Approaches Comparison

Dimension	DuckDB	pandas	PostgreSQL	Excel
500K row CSV load	< 1 sec	2-3 sec	Requires COPY + CREATE	Crashes
Aggregation speed	Baseline 1x	10x slower	Requires pre-built tables	Can’t handle
Multi-source joins	✅ Native	Requires multiple merges	Requires ETL pipeline	❌
JSON parsing	✅ read_json_auto	✅	Requires jsonb type	❌
Excel reading	✅ read_excel	✅	❌	✅
Deployment/ops	Zero	Zero	Requires DBA	Zero
Scale	100GB+ on single machine	Memory limited	Unlimited	< 1M rows

Core Implementation

Step 1: Connect and Load Data

import duckdb
import pandas as pd
from pathlib import Path

# Connect to DuckDB in-memory database (zero configuration)
con = duckdb.connect(':memory:')

# Payment transactions (CSV) — read directly, no table creation needed
con.execute("""
    COPY transactions FROM 'data/transactions_202605.csv'
    (FORMAT CSV, HEADER, DELIMITER ',');
""")

# Logistics costs (CSV)
con.execute("""
    COPY logistics FROM 'data/logistics_costs.csv'
    (FORMAT CSV, HEADER, DELIMITER ',');
""")

# Ad spend (JSON from API export)
con.execute("""
    COPY ad_spend FROM 'data/ad_spend.json'
    (FORMAT JSON, AUTO_DETERMINE);
""")

# Refunds (Excel) — DuckDB reads xlsx natively!
con.execute("""
    COPY refunds FROM 'data/refunds.xlsx'
    (HEADER, SHEET 'Refunds');
""")

The most powerful aspect here: five different data formats, four lines of code to load everything into DuckDB in-memory tables. A traditional approach would require: pandas to read CSV → openpyxl for Excel → json module for JSON → sqlite3 for SQLite, then manual merges.

Step 2: Cross-source Query to Generate the P&L Statement

WITH 
    -- Revenue: aggregated by month
    revenue AS (
        SELECT 
            DATE_TRUNC('month', order_date) AS month,
            SUM(amount) AS gross_revenue,
            COUNT(DISTINCT order_id) AS order_count
        FROM transactions
        WHERE status = 'completed'
        GROUP BY 1
    ),
    
    -- Refunds: aggregated by month
    refund_amount AS (
        SELECT 
            DATE_TRUNC('month', refund_date) AS month,
            SUM(refund_amount) AS total_refund
        FROM refunds
        GROUP BY 1
    ),
    
    -- Product cost (join with product table for cost price)
    product_cost AS (
        SELECT 
            DATE_TRUNC('month', t.order_date) AS month,
            SUM(t.quantity * p.cost_price) AS total_cost
        FROM transactions t
        JOIN products p USING (product_id)
        WHERE t.status = 'completed'
        GROUP BY 1
    ),
    
    -- Logistics cost
    logistics_total AS (
        SELECT 
            DATE_TRUNC('month', shipped_date) AS month,
            SUM(cost) AS total_logistics
        FROM logistics
        GROUP BY 1
    ),
    
    -- Ad spend
    ad_total AS (
        SELECT 
            DATE_TRUNC('month', start_date) AS month,
            SUM(spend) AS total_ad_spend
        FROM ad_spend
        GROUP BY 1
    )

-- Assemble final P&L statement
SELECT 
    r.month,
    ROUND(r.gross_revenue, 2) AS gross_revenue,
    COALESCE(ref.total_refund, 0) AS refunds,
    ROUND(r.gross_revenue - COALESCE(ref.total_refund, 0), 2) AS net_revenue,
    ROUND(pc.total_cost, 2) AS product_cost,
    ROUND(lt.total_logistics, 2) AS logistics_cost,
    ROUND(at.total_ad_spend, 2) AS ad_spend,
    ROUND(r.gross_revenue - COALESCE(ref.total_refund, 0) 
           - pc.total_cost - lt.total_logistics - at.total_ad_spend, 2) AS net_profit,
    ROUND(
        (r.gross_revenue - COALESCE(ref.total_refund, 0) 
         - pc.total_cost - lt.total_logistics - at.total_ad_spend) 
        / NULLIF(r.gross_revenue, 0) * 100, 2
    ) AS profit_margin_pct
FROM revenue r
LEFT JOIN refund_amount ref USING (month)
LEFT JOIN product_cost pc USING (month)
LEFT JOIN logistics_total lt USING (month)
LEFT JOIN ad_total at USING (month)
ORDER BY r.month;

Step 3: Export Reports

# Execute query, get Pandas DataFrame
result = con.execute(pl_sql).fetchdf()
print(result)

# Export as CSV for BI tool integration
result.to_csv('reports/pl_statement.csv', index=False)

# Export as Parquet for 5x+ faster reads next time
result.to_parquet('reports/pl_statement.parquet')

con.close()

Advanced: Month-over-Month and Year-over-Year with Window Functions

The report above is already functional, but showing “trends” with growth rates is more convincing for executives.

WITH monthly_pl AS (
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS revenue
    FROM transactions
    WHERE status = 'completed'
    GROUP BY 1
    ORDER BY 1
)
SELECT 
    month,
    revenue,
    -- Previous month revenue (MoM)
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    ROUND(
        (revenue - LAG(revenue) OVER (ORDER BY month)) 
        / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100, 2
    ) AS mom_growth_pct,
    -- Same month last year (YoY)
    LAG(revenue, 12) OVER (ORDER BY month) AS same_month_last_year,
    ROUND(
        (revenue - LAG(revenue, 12) OVER (ORDER BY month)) 
        / NULLIF(LAG(revenue, 12) OVER (ORDER BY month), 0) * 100, 2
    ) AS yoy_growth_pct
FROM monthly_pl;

These LAG() window functions solve the most painful month-over-month and year-over-year calculations for analysts — with pandas, you’d need two separate DataFrame merges. With DuckDB, one SQL query does it all.

Window Functions in Financial Analysis

Function	Purpose	Financial Scenario
`LAG(col, 1)`	Get previous row	Month-over-month growth
`LAG(col, 12)`	Get same month last year	Year-over-year growth
`LEAD(col, 1)`	Get next row	Next month prediction
`ROW_NUMBER()`	Row ranking	Top N product ranking
`SUM() OVER()`	Running total	YTD cumulative revenue
`AVG() OVER(ROWS 3 PRECEDING)`	Rolling average	Smooth volatility

Interactive Dashboard with Streamlit

After processing the data, you can build a Web dashboard with Streamlit so analysts can filter and explore:

import streamlit as st
import duckdb

con = duckdb.connect(':memory:')
con.execute("COPY transactions FROM 'data/transactions_202605.csv' (FORMAT CSV, HEADER)")
con.execute("COPY logistics FROM 'data/logistics_costs.csv' (FORMAT CSV, HEADER)")

# Interactive filtering — analysts change conditions themselves
filter_month = st.selectbox("Select Month", ['2026-05', '2026-04', '2026-03'])

query = f"""
    SELECT category, SUM(amount) as total
    FROM transactions
    WHERE DATE_TRUNC('month', order_date) = '{filter_month}-01'
    GROUP BY category
    ORDER BY total DESC
"""

st.dataframe(con.execute(query).fetchdf())
st.title(f"{filter_month} Transaction Analysis")
st.line_chart(st.session_state.chart_data)

What Is This Project Worth?

Here are some real-world pricing examples I’ve seen:

Client Scale	Data Volume	Features	Price
Small e-commerce	< 500K rows/month	Basic P&L report	$8K - $15K
Mid-size business	Multi-source + complex	Auto reports + dashboard	$20K - $50K
Large enterprise	Cross-system + real-time	Full analytics platform	$100K+

Annual maintenance fee: 20% of project cost per year.

And your cost? One template + two days of implementation + zero ongoing maintenance.

Monetization Strategy: How to Turn This Skill Into Income

1. Become a Data Outsourcing Consultant for SMBs

Many small businesses spend thousands monthly on outsourced reporting. With your DuckDB automation solution:

One-time implementation fee: $1K-$3K
Annual maintenance: 20% of project cost
Near-zero marginal cost per additional client

2. Build a Standardized SaaS Product

Package this template as “SMB Smart Reporting SaaS”:

Monthly subscription: $100-$300/month
Serve dozens of businesses simultaneously
DuckDB’s zero-Ops nature keeps server costs extremely low

3. Build Personal Brand in Technical Communities

Share DuckDB case studies on platforms like Medium, Dev.to, or YouTube:

Build followers → consulting/training opportunities
Open-source templates → attract paying clients
Technical credibility → higher consulting rates

4. Replicate Across Industries

The core methodology is universal:

Retail: Inventory turnover analysis + sales forecasting
SaaS: MRR/ARR calculation + Churn Rate tracking
Advertising agencies: ROI tracking + channel performance comparison
Cross-border e-commerce: Multi-platform reconciliation + currency conversion

Data Flow Architecture

Architecture Diagram

The diagram above shows how data flows from multiple sources (CSV, JSON, Excel, SQLite) into DuckDB, where CTEs transform and aggregate the data into a unified P&L statement, which is then exported to CSV/Parquet or displayed via Streamlit.

Conclusion

DuckDB is redefining the boundaries of data analysis. It empowers analysts to stop relying on IT departments for data warehouses and enables small businesses to access enterprise-grade analytics at minimal cost.

Learning DuckDB gives you not just a tool, but a directly monetizable business model. Start with automated financial reporting today, and explore the endless possibilities ahead.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.