The Data Analyst’s Ultimate Nightmare: Data Scattered Across Five Places
Picture this scenario:
- Your payment transaction data lives in
transactions_202605.csv, 500K rows - Logistics costs sit in another file
logistics_costs.csv, 200K rows - Ad spend is an API-exported JSON file
- Refund data is trapped in an Excel spreadsheet
- Product catalog is stored in a SQLite database
Your boss asks for a P&L (Profit & Loss) statement. You need to pull all five data sources together, join them, aggregate, and compute margins.
With traditional tools, you need to:
- Load each file with pandas using different functions
- Clean each DataFrame
merge()them all together (multiple times)- Perform aggregation calculations
If the data is large, pandas runs out of memory. If there are many sources, the merge logic becomes a nightmare.
DuckDB takes a completely different approach — it reads all data sources natively and joins them in a single SQL query.
DuckDB’s Native Multi-Source Support
DuckDB’s core advantage is that it doesn’t require you to load data into a unified database first. It reads various file formats and databases directly, and performs join queries in memory.
Supported Formats at a Glance
| Format | DuckDB Read Method |
|---|---|
| CSV | COPY ... FROM 'file.csv' (FORMAT CSV) |
| JSON | COPY ... FROM 'file.json' (FORMAT JSON) |
| Excel | COPY ... FROM 'file.xlsx' (HEADER, SHEET 'Sheet1') |
| Parquet | SELECT * FROM 'file.parquet' |
| SQLite | ATTACH 'db.sqlite' AS alias (TYPE sqlite) |
| Remote HTTP files | INSTALL httpfs; COPY ... FROM 'https://...' |
| Other DuckDB files | ATTACH 'other.db' AS alias |
Here’s the key: all of these operations can be combined in a single query.
In Action: Automated P&L Statement System
Below shows how to combine CSV, JSON, Excel, and SQLite into a complete P&L statement using DuckDB.
Project Structure
project/
├── data/
│ ├── transactions_202605.csv # Payment transactions (500K rows)
│ ├── logistics_costs.csv # Logistics costs (200K rows)
│ ├── ad_spend.json # Ad spend (JSON)
│ ├── refunds.xlsx # Refund data (Excel)
│ └── products.db # Product catalog (SQLite)
├── generate.py # Core code
└── reports/
├── pl_statement.csv # Output
└── pl_statement.parquet # Incremental read acceleration
Core Code
import duckdb
from pathlib import Path
# Connect to DuckDB in-memory database
con = duckdb.connect(':memory:')
# --- Step 1: Read all data sources ---
# 1. CSV - Payment transactions
con.execute("""
COPY transactions FROM 'data/transactions_202605.csv'
(FORMAT CSV, HEADER, DELIMITER ',');
""")
# 2. CSV - Logistics costs
con.execute("""
COPY logistics FROM 'data/logistics_costs.csv'
(FORMAT CSV, HEADER, DELIMITER ',');
""")
# 3. JSON - Ad spend
con.execute("""
COPY ad_spend FROM 'data/ad_spend.json'
(FORMAT JSON, AUTO_DETERMINE);
""")
# 4. Excel - Refund data
con.execute("""
COPY refunds FROM 'data/refunds.xlsx'
(HEADER, SHEET 'Refunds');
""")
# 5. SQLite - Product catalog
con.execute("""
ATTACH 'data/products.db' AS products_db (TYPE sqlite);
CREATE TABLE products AS
SELECT * FROM products_db.main.products;
""")
# --- Step 2: Cross-source joined query ---
pl_sql = """
WITH
-- Revenue: aggregate completed orders by month
revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS gross_revenue,
COUNT(DISTINCT order_id) AS order_count
FROM transactions
WHERE status = 'completed'
GROUP BY 1
),
-- Refunds: aggregate by month
refund_amount AS (
SELECT
DATE_TRUNC('month', refund_date) AS month,
SUM(refund_amount) AS total_refund
FROM refunds
GROUP BY 1
),
-- Product cost: join transactions with products
product_cost AS (
SELECT
DATE_TRUNC('month', t.order_date) AS month,
SUM(t.quantity * p.cost_price) AS total_cost
FROM transactions t
JOIN products p ON t.product_id = p.product_id
WHERE t.status = 'completed'
GROUP BY 1
),
-- Logistics cost: aggregate by month
logistics_total AS (
SELECT
DATE_TRUNC('month', shipped_date) AS month,
SUM(cost) AS total_logistics
FROM logistics
GROUP BY 1
),
-- Ad spend: aggregate by month
ad_total AS (
SELECT
DATE_TRUNC('month', start_date) AS month,
SUM(spend) AS total_ad_spend
FROM ad_spend
GROUP BY 1
)
-- Assemble final P&L statement
SELECT
r.month,
ROUND(r.gross_revenue, 2) AS gross_revenue,
COALESCE(ref.total_refund, 0) AS refunds,
ROUND(r.gross_revenue - COALESCE(ref.total_refund, 0), 2) AS net_revenue,
ROUND(pc.total_cost, 2) AS product_cost,
ROUND(lt.total_logistics, 2) AS logistics_cost,
ROUND(at.total_ad_spend, 2) AS ad_spend,
ROUND(
r.gross_revenue
- COALESCE(ref.total_refund, 0)
- pc.total_cost
- lt.total_logistics
- at.total_ad_spend,
2
) AS net_profit,
ROUND(
(r.gross_revenue
- COALESCE(ref.total_refund, 0)
- pc.total_cost
- lt.total_logistics
- at.total_ad_spend)
/ NULLIF(r.gross_revenue, 0) * 100,
2
) AS profit_margin_pct
FROM revenue r
LEFT JOIN refund_amount ref USING (month)
LEFT JOIN product_cost pc USING (month)
LEFT JOIN logistics_total lt USING (month)
LEFT JOIN ad_total at USING (month)
ORDER BY r.month;
"""
result = con.execute(pl_sql).fetchdf()
print(result)
# Export results
result.to_csv('reports/pl_statement.csv', index=False)
result.to_parquet('reports/pl_statement.parquet')
con.close()
The output looks like:
month gross_revenue refunds net_revenue product_cost \
0 2026-05-01 1250000.00 45000.00 1205000.00 520000.00
1 2026-06-01 1380000.00 38000.00 1342000.00 560000.00
logistics_cost ad_spend net_profit profit_margin_pct
0 85000.0 120000.0 440000.0 36.54
1 92000.0 135000.0 455000.0 33.91
Advanced: MoM & YoY Analysis with Window Functions
Your boss wants trends. Window functions handle this in a single query:
trend_sql = """
WITH monthly_pl AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM transactions
WHERE status = 'completed'
GROUP BY 1
ORDER BY 1
)
SELECT
month,
revenue,
-- Previous month revenue (MoM)
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month))
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100,
2
) AS mom_growth_pct,
-- Same month last year (YoY)
LAG(revenue, 12) OVER (ORDER BY month) AS same_month_last_year,
ROUND(
(revenue - LAG(revenue, 12) OVER (ORDER BY month))
/ NULLIF(LAG(revenue, 12) OVER (ORDER BY month), 0) * 100,
2
) AS yoy_growth_pct
FROM monthly_pl;
"""
In pandas, calculating MoM and YoY requires merging the DataFrame twice, leading to verbose and error-prone code. DuckDB handles it with LAG() window functions in one SQL statement.
Why DuckDB Beats Other Approaches for Multi-Source Integration
Comparison Table
| Solution | Best For | Data Source Compatibility | Ops Cost | Speed |
|---|---|---|---|---|
| DuckDB | Mid-size analytics, automated reports | CSV/JSON/Excel/Parquet/SQLite native | Zero | Extremely fast |
| pandas | Flexible data wrangling | Depends on various read_xxx functions | Zero | Slow with large data |
| PostgreSQL + views | Production-grade data platform | Requires external extensions | Needs DBA | Fast (but requires ETL) |
| Manual Excel | Quick one-off analysis | Can open common formats | Zero | Very slow |
The key difference: DuckDB folds “reading different formats” from the ETL phase into the query phase.
You don’t need to:
- Load CSVs into PostgreSQL first
- Write Python JSON parsing scripts
- Convert Excel to CSV
All format reads become SQL statements.
Next Level: Interactive Streamlit Dashboard
After computing the data, build a simple Streamlit dashboard so your client can explore by month:
import streamlit as st
import duckdb
con = duckdb.connect(':memory:')
con.execute("COPY transactions FROM 'data/transactions_202605.csv' (FORMAT CSV, HEADER)")
con.execute("COPY logistics FROM 'data/logistics_costs.csv' (FORMAT CSV, HEADER)")
filter_month = st.selectbox("Select Month", ['2026-05', '2026-04', '2026-03'])
query = f"""
SELECT category, SUM(amount) as total
FROM transactions
WHERE DATE_TRUNC('month', order_date) = '{filter_month}-01'
GROUP BY category
ORDER BY total DESC
"""
st.dataframe(con.execute(query).fetchdf())
Run streamlit run dashboard.py, and you have a complete interactive data product.
DuckDB vs pandas: Why DuckDB Wins for Multi-Source Joins
Many analysts are used to the pandas read_csv + merge workflow. But for multi-source scenarios, DuckDB has several crushing advantages:
1. Memory efficiency
pandas loads all data into memory, and merge operations create multiple intermediate DataFrames, using 3-5x the original data size in memory. DuckDB uses columnar storage + vectorized execution, using less than 1/3 of the memory for the same data volume.
2. Code conciseness
Merging 5 data sources in pandas typically requires 30-50 lines (including error handling). DuckDB condenses data reading and joins into a few SQL statements, typically 10-20 lines.
3. Performance
For datasets exceeding one million rows, DuckDB’s vectorized execution engine is 10-50x faster than pandas. These aren’t theoretical benchmarks — they’re real-world results.
How Much Is This Worth?
A multi-source automated reporting system can be priced at:
- Small business (< 500K rows/month, 2-3 data sources): $1,000 - $2,000
- Mid-size company (multi-source + automated reports + dashboard): $3,000 - $7,000
- Annual maintenance: 20% of project value/year
And your cost? A template script + two days of implementation + zero operational cost.
This pattern can be applied across industries:
- Retail: Inventory analysis, sales trends
- SaaS: Subscription analysis, churn tracking
- Advertising: ROI tracking, multi-channel attribution
- E-commerce: Multi-dimensional profit analysis, return rate tracking
The core idea never changes: DuckDB for multi-source aggregation → pandas/Streamlit for visualization → clients pay for your insights.
Next Steps
- Change file paths in the code to your local data files
- Run it with your company’s real data
- Export results as Parquet for 5x+ faster reads next time
Multi-source data integration is one of the most common data analysis scenarios. Master this with DuckDB, and you have your ticket into the data monetization arena.
📖 The full project template including data generation scripts, Streamlit dashboard template, and PDF export code is available on duckdblab.org — go grab it and start building.
