Featured image of post Building an Automated Financial Reporting System with DuckDB — A Profitable Business Project

Building an Automated Financial Reporting System with DuckDB — A Profitable Business Project

Step-by-step guide to building an automated financial reporting system with DuckDB: multi-source CSV/JSON/Excel aggregation, P&L generation, month-over-month analysis, and Streamlit dashboards. Learn how to sell this system to SMBs for $2K-$5K.

Introduction: From Excel to DuckDB — A 100x Efficiency Leap

Do you also experience this every month — your boss asks you to prepare a financial report, you open Excel, open databases, open several CSV files, copy-paste repeatedly, and spend a full day piecing together a spreadsheet. Then the boss says, “Can you adjust the colors?” and you have to start over.

If you automate this workflow with DuckDB, the same workload shrinks from one day to 10 minutes, and it’s always accurate. More importantly — this system is sellable.

Today I’ll walk through a real paid project: I built an automated financial reporting system for a small e-commerce business and charged $2,800. This methodology can be replicated across retail inventory analysis, SaaS subscription metrics, advertising ROI tracking, and more.

Project Requirements: A Financial Report System for an E-commerce Business

The client was an e-commerce company that needed to aggregate financial data from the following sources every month:

Data SourceFormatVolume
Payment platform transactionsCSV~500K rows/month
Warehouse & logistics costsCSV~200K rows/month
Advertising spendJSON (API export)~50K entries/month
Returns/refundsExcel (.xlsx)~100K rows/month
Product catalogSQLite database~10K entries

Goal: Auto-generate a P&L (Profit & Loss) statement every month, including revenue, cost, gross margin, net profit margin, and output a PDF report.

Why DuckDB?

You might ask: Why DuckDB when we have SQL Server and PostgreSQL? Three reasons:

First, speed. Loading 500K rows of CSV and performing aggregation is 10x+ faster than pandas, and even faster than traditional SQL — because there’s no need to create tables beforehand. DuckDB’s columnar storage architecture is naturally optimized for analytical queries.

Second, zero operations. The client has no DBA. DuckDB is a single file — no database service installation needed, just copy and use.

Third, powerful analytics. Window functions, JSON parsing, direct Parquet reading, even native Excel support — all data sources handled in a single query.

DuckDB vs Traditional Approaches Comparison

DimensionDuckDBpandasPostgreSQLExcel
500K row CSV load< 1 sec2-3 secRequires COPY + CREATECrashes
Aggregation speedBaseline 1x10x slowerRequires pre-built tablesCan’t handle
Multi-source joins✅ NativeRequires multiple mergesRequires ETL pipeline
JSON parsing✅ read_json_autoRequires jsonb type
Excel reading✅ read_excel
Deployment/opsZeroZeroRequires DBAZero
Scale100GB+ on single machineMemory limitedUnlimited< 1M rows

Core Implementation

Step 1: Connect and Load Data

import duckdb
import pandas as pd
from pathlib import Path

# Connect to DuckDB in-memory database (zero configuration)
con = duckdb.connect(':memory:')

# Payment transactions (CSV) — read directly, no table creation needed
con.execute("""
    COPY transactions FROM 'data/transactions_202605.csv'
    (FORMAT CSV, HEADER, DELIMITER ',');
""")

# Logistics costs (CSV)
con.execute("""
    COPY logistics FROM 'data/logistics_costs.csv'
    (FORMAT CSV, HEADER, DELIMITER ',');
""")

# Ad spend (JSON from API export)
con.execute("""
    COPY ad_spend FROM 'data/ad_spend.json'
    (FORMAT JSON, AUTO_DETERMINE);
""")

# Refunds (Excel) — DuckDB reads xlsx natively!
con.execute("""
    COPY refunds FROM 'data/refunds.xlsx'
    (HEADER, SHEET 'Refunds');
""")

The most powerful aspect here: five different data formats, four lines of code to load everything into DuckDB in-memory tables. A traditional approach would require: pandas to read CSV → openpyxl for Excel → json module for JSON → sqlite3 for SQLite, then manual merges.

Step 2: Cross-source Query to Generate the P&L Statement

WITH 
    -- Revenue: aggregated by month
    revenue AS (
        SELECT 
            DATE_TRUNC('month', order_date) AS month,
            SUM(amount) AS gross_revenue,
            COUNT(DISTINCT order_id) AS order_count
        FROM transactions
        WHERE status = 'completed'
        GROUP BY 1
    ),
    
    -- Refunds: aggregated by month
    refund_amount AS (
        SELECT 
            DATE_TRUNC('month', refund_date) AS month,
            SUM(refund_amount) AS total_refund
        FROM refunds
        GROUP BY 1
    ),
    
    -- Product cost (join with product table for cost price)
    product_cost AS (
        SELECT 
            DATE_TRUNC('month', t.order_date) AS month,
            SUM(t.quantity * p.cost_price) AS total_cost
        FROM transactions t
        JOIN products p USING (product_id)
        WHERE t.status = 'completed'
        GROUP BY 1
    ),
    
    -- Logistics cost
    logistics_total AS (
        SELECT 
            DATE_TRUNC('month', shipped_date) AS month,
            SUM(cost) AS total_logistics
        FROM logistics
        GROUP BY 1
    ),
    
    -- Ad spend
    ad_total AS (
        SELECT 
            DATE_TRUNC('month', start_date) AS month,
            SUM(spend) AS total_ad_spend
        FROM ad_spend
        GROUP BY 1
    )

-- Assemble final P&L statement
SELECT 
    r.month,
    ROUND(r.gross_revenue, 2) AS gross_revenue,
    COALESCE(ref.total_refund, 0) AS refunds,
    ROUND(r.gross_revenue - COALESCE(ref.total_refund, 0), 2) AS net_revenue,
    ROUND(pc.total_cost, 2) AS product_cost,
    ROUND(lt.total_logistics, 2) AS logistics_cost,
    ROUND(at.total_ad_spend, 2) AS ad_spend,
    ROUND(r.gross_revenue - COALESCE(ref.total_refund, 0) 
           - pc.total_cost - lt.total_logistics - at.total_ad_spend, 2) AS net_profit,
    ROUND(
        (r.gross_revenue - COALESCE(ref.total_refund, 0) 
         - pc.total_cost - lt.total_logistics - at.total_ad_spend) 
        / NULLIF(r.gross_revenue, 0) * 100, 2
    ) AS profit_margin_pct
FROM revenue r
LEFT JOIN refund_amount ref USING (month)
LEFT JOIN product_cost pc USING (month)
LEFT JOIN logistics_total lt USING (month)
LEFT JOIN ad_total at USING (month)
ORDER BY r.month;

Step 3: Export Reports

# Execute query, get Pandas DataFrame
result = con.execute(pl_sql).fetchdf()
print(result)

# Export as CSV for BI tool integration
result.to_csv('reports/pl_statement.csv', index=False)

# Export as Parquet for 5x+ faster reads next time
result.to_parquet('reports/pl_statement.parquet')

con.close()

Advanced: Month-over-Month and Year-over-Year with Window Functions

The report above is already functional, but showing “trends” with growth rates is more convincing for executives.

WITH monthly_pl AS (
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS revenue
    FROM transactions
    WHERE status = 'completed'
    GROUP BY 1
    ORDER BY 1
)
SELECT 
    month,
    revenue,
    -- Previous month revenue (MoM)
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    ROUND(
        (revenue - LAG(revenue) OVER (ORDER BY month)) 
        / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100, 2
    ) AS mom_growth_pct,
    -- Same month last year (YoY)
    LAG(revenue, 12) OVER (ORDER BY month) AS same_month_last_year,
    ROUND(
        (revenue - LAG(revenue, 12) OVER (ORDER BY month)) 
        / NULLIF(LAG(revenue, 12) OVER (ORDER BY month), 0) * 100, 2
    ) AS yoy_growth_pct
FROM monthly_pl;

These LAG() window functions solve the most painful month-over-month and year-over-year calculations for analysts — with pandas, you’d need two separate DataFrame merges. With DuckDB, one SQL query does it all.

Window Functions in Financial Analysis

FunctionPurposeFinancial Scenario
LAG(col, 1)Get previous rowMonth-over-month growth
LAG(col, 12)Get same month last yearYear-over-year growth
LEAD(col, 1)Get next rowNext month prediction
ROW_NUMBER()Row rankingTop N product ranking
SUM() OVER()Running totalYTD cumulative revenue
AVG() OVER(ROWS 3 PRECEDING)Rolling averageSmooth volatility

Interactive Dashboard with Streamlit

After processing the data, you can build a Web dashboard with Streamlit so analysts can filter and explore:

import streamlit as st
import duckdb

con = duckdb.connect(':memory:')
con.execute("COPY transactions FROM 'data/transactions_202605.csv' (FORMAT CSV, HEADER)")
con.execute("COPY logistics FROM 'data/logistics_costs.csv' (FORMAT CSV, HEADER)")

# Interactive filtering — analysts change conditions themselves
filter_month = st.selectbox("Select Month", ['2026-05', '2026-04', '2026-03'])

query = f"""
    SELECT category, SUM(amount) as total
    FROM transactions
    WHERE DATE_TRUNC('month', order_date) = '{filter_month}-01'
    GROUP BY category
    ORDER BY total DESC
"""

st.dataframe(con.execute(query).fetchdf())
st.title(f"{filter_month} Transaction Analysis")
st.line_chart(st.session_state.chart_data)

What Is This Project Worth?

Here are some real-world pricing examples I’ve seen:

Client ScaleData VolumeFeaturesPrice
Small e-commerce< 500K rows/monthBasic P&L report$8K - $15K
Mid-size businessMulti-source + complexAuto reports + dashboard$20K - $50K
Large enterpriseCross-system + real-timeFull analytics platform$100K+

Annual maintenance fee: 20% of project cost per year.

And your cost? One template + two days of implementation + zero ongoing maintenance.

Monetization Strategy: How to Turn This Skill Into Income

1. Become a Data Outsourcing Consultant for SMBs

Many small businesses spend thousands monthly on outsourced reporting. With your DuckDB automation solution:

  • One-time implementation fee: $1K-$3K
  • Annual maintenance: 20% of project cost
  • Near-zero marginal cost per additional client

2. Build a Standardized SaaS Product

Package this template as “SMB Smart Reporting SaaS”:

  • Monthly subscription: $100-$300/month
  • Serve dozens of businesses simultaneously
  • DuckDB’s zero-Ops nature keeps server costs extremely low

3. Build Personal Brand in Technical Communities

Share DuckDB case studies on platforms like Medium, Dev.to, or YouTube:

  • Build followers → consulting/training opportunities
  • Open-source templates → attract paying clients
  • Technical credibility → higher consulting rates

4. Replicate Across Industries

The core methodology is universal:

  • Retail: Inventory turnover analysis + sales forecasting
  • SaaS: MRR/ARR calculation + Churn Rate tracking
  • Advertising agencies: ROI tracking + channel performance comparison
  • Cross-border e-commerce: Multi-platform reconciliation + currency conversion

Data Flow Architecture

Architecture Diagram

The diagram above shows how data flows from multiple sources (CSV, JSON, Excel, SQLite) into DuckDB, where CTEs transform and aggregate the data into a unified P&L statement, which is then exported to CSV/Parquet or displayed via Streamlit.

Conclusion

DuckDB is redefining the boundaries of data analysis. It empowers analysts to stop relying on IT departments for data warehouses and enables small businesses to access enterprise-grade analytics at minimal cost.

Learning DuckDB gives you not just a tool, but a directly monetizable business model. Start with automated financial reporting today, and explore the endless possibilities ahead.

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy