Building an Industry Data Report Product with DuckDB: From Custom Services to Standardized Monetization
Difficulty: ⭐⭐⭐ | Estimated Setup Time: 2 hours for prototype, then auto-generates weekly
Many people start their data service journey with “help clients query data”—take a request, write some SQL, deliver a result. The problem with this model is obvious: you’re trading time for money, and your income has a ceiling.
Today I’ll share a complete approach to upgrading “custom data services” into “standardized data products” using DuckDB. The core idea: leverage DuckDB’s SQL parameterization to transform one-off queries into reusable report templates.
Why “Industry Data Reports” Is the Right Direction
In China, millions of small and medium enterprises need industry data to make decisions:
- Bubble tea shop owners want regional beverage consumption trends
- Cross-border e-commerce sellers need category sales rankings
- Startup seekers want to see funding event distributions
- Investors need to understand upstream/downstream industry chain relationships
This data is publicly available but highly fragmented. A regular person needs a week to compile one report. With DuckDB, you can generate a complete industry analysis report in 30 minutes.
There are three monetization approaches:
- Subscription reports: Send standardized industry reports monthly, ¥99-299 per subscriber per month
- On-demand custom reports: Use the same template to quickly generate personalized versions, ¥500-2000 per report
- Data API services: Wrap reporting capability into an API, charge per call
Core Architecture: Parameterized Report Engine
The key is writing SQL as parameterizable templates, so the same code can generate infinite variations of reports.
First, define the core engine class:
import duckdb
import json
from datetime import datetime
class ReportEngine:
"""Industry Data Report Engine"""
def __init__(self, db_path="reports.duckdb"):
self.con = duckdb.connect(db_path)
self._setup_schema()
def _setup_schema(self):
"""Initialize data table schema"""
self.con.execute("""
CREATE TABLE IF NOT EXISTS company_events (
event_id VARCHAR PRIMARY KEY,
company_name VARCHAR,
industry VARCHAR,
event_type VARCHAR,
amount_usd DOUBLE,
investors VARCHAR[],
event_date DATE,
region VARCHAR,
description VARCHAR
);
CREATE TABLE IF NOT EXISTS industry_taxonomy (
code VARCHAR PRIMARY KEY,
name VARCHAR,
parent_code VARCHAR
);
""")
We define two core tables: company_events stores investment events (financing, M&A, IPOs, etc.), and industry_taxonomy stores the industry classification system.
Dynamic Query Builder
This is the soul of the entire system—dynamically constructing different SQL queries based on user-provided parameters:
def _build_query(self, params):
"""Dynamically build queries—the core!"""
industry = params.get("industry", "%")
start_date = params.get("start_date", "1900-01-01")
end_date = params.get("end_date", "2100-12-31")
top_n = params.get("top_n", 10)
queries = {}
# 1. Overall overview
queries["overview"] = f"""
SELECT
COUNT(*) AS total_deals,
ROUND(SUM(amount_usd), 2) AS total_funding,
ROUND(AVG(amount_usd), 2) AS avg_ticket,
ROUND(MEDIAN(amount_usd), 2) AS median_ticket,
MIN(event_date) AS earliest_date,
MAX(event_date) AS latest_date
FROM company_events
WHERE event_date BETWEEN '{start_date}' AND '{end_date}'
"""
# 2. Monthly trend
queries["monthly_trend"] = f"""
SELECT
DATE_TRUNC('month', event_date) AS month,
COUNT(*) AS deal_count,
ROUND(SUM(amount_usd), 2) AS funding_total,
ROUND(AVG(amount_usd), 2) AS avg_ticket
FROM company_events
WHERE event_date BETWEEN '{start_date}' AND '{end_date}'
GROUP BY 1
ORDER BY 1
"""
# 3. Top N investment events
queries["top_events"] = f"""
SELECT
company_name,
event_type,
ROUND(amount_usd, 2) AS amount,
event_date,
region
FROM company_events
WHERE event_date BETWEEN '{start_date}' AND '{end_date}'
ORDER BY amount_usd DESC
LIMIT {top_n}
"""
# 4. Investor activity ranking
queries["investor_ranking"] = f"""
SELECT
UNNEST(investors) AS investor,
COUNT(DISTINCT event_id) AS deal_count,
ROUND(SUM(amount_usd), 2) AS total_invested,
ROUND(AVG(amount_usd), 2) AS avg_ticket
FROM company_events
WHERE event_date BETWEEN '{start_date}' AND '{end_date}'
GROUP BY 1
HAVING COUNT(DISTINCT event_id) >= 2
ORDER BY total_invested DESC
LIMIT {top_n}
"""
# 5. Sub-sector distribution
queries["sub_sector_distribution"] = f"""
SELECT
industry,
COUNT(*) AS deal_count,
ROUND(SUM(amount_usd), 2) AS total_funding,
ROUND(SUM(amount_usd) / NULLIF(COUNT(*), 0), 2) AS avg_ticket,
ROUND(
SUM(amount_usd) * 100.0 /
NULLIF(SUM(SUM(amount_usd)) OVER (), 0), 2
) AS share_pct
FROM company_events
WHERE event_date BETWEEN '{start_date}' AND '{end_date}'
GROUP BY 1
ORDER BY total_funding DESC
LIMIT {top_n}
"""
return queries
def generate_report(self, params):
"""Generate a complete report based on parameters"""
queries = self._build_query(params)
results = {}
for metric_name, sql in queries.items():
df = self.con.execute(sql).fetchdf()
results[metric_name] = df.to_dict('records')
return {
"report_date": datetime.now().strftime("%Y-%m-%d"),
"params": params,
"data": results
}
Key design points of this code:
- The
_build_querymethod returns a dictionary where keys are module names (overview, monthly_trend, etc.) and values are corresponding SQL templates - Each SQL uses f-string to inject parameters, but parameter values come from controlled dictionary keys, not directly from user input, so there’s no SQL injection risk
- The
generate_reportmethod iterates through all queries, executes them, collects results, and returns a structured report dictionary
Populating Data and Generating Reports
# Initialize the engine
engine = ReportEngine("/tmp/industry_report.duckdb")
# Insert sample data (in production, import from public data sources)
sample_data = [
("EVT001", "DeepMindTech", "Artificial Intelligence", "Financing", 50000000,
["Sequoia Capital", "GaoLv Capital", "Google Ventures"], "2026-03-15", "Beijing", "LLM Training Platform"),
("EVT002", "CloudScale", "Cloud Computing", "Financing", 30000000,
["IDG Capital", "Matrix Partners"], "2026-04-02", "Shanghai", "Edge Computing Infrastructure"),
("EVT003", "AutoDriveAI", "Autonomous Driving", "M&A", 120000000,
["BYD", "Huawei Investment"], "2026-05-10", "Shenzhen", "L4 Autonomous Driving Solution"),
("EVT004", "BioGenLab", "Biotech", "Financing", 80000000,
["Qiming Venture", "Eli Lilly Asia Fund"], "2026-02-20", "Suzhou", "mRNA Drug Development"),
("EVT005", "DataFlowAI", "Artificial Intelligence", "Financing", 25000000,
["Sequoia Capital", "Tencent Investment"], "2026-06-01", "Hangzhou", "AI Data Processing Pipeline"),
("EVT006", "GreenEnergyX", "New Energy", "Financing", 45000000,
["GaoLv Capital", "CATL"], "2026-01-15", "Hefei", "Solid-State Battery Technology"),
("EVT007", "CyberShield", "Cybersecurity", "IPO", 200000000,
["SAIF Partners", "SoftBank Vision Fund"], "2026-05-28", "Beijing", "Zero-Trust Security Platform"),
("EVT008", "RoboticsPro", "Robotics", "Financing", 35000000,
["Matrix Partners", "Xiaomi Strategic"], "2026-04-18", "Shenzhen", "Collaborative Robotic Arm"),
("EVT009", "FinTechHub", "FinTech", "M&A", 90000000,
["Ant Group", "Ping An Innovation"], "2026-03-25", "Shanghai", "Intelligent Risk Control System"),
("EVT010", "SpaceData", "Commercial Aerospace", "Financing", 60000000,
["Shenzhen Capital", "China Satellite Network"], "2026-06-10", "Xi'an", "Satellite Remote Sensing Data Processing"),
]
# Bulk insert
for row in sample_data:
engine.con.execute("""
INSERT INTO company_events VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
""", row)
# Generate a complete report
report = engine.generate_report({
"industry": "%",
"start_date": "2026-01-01",
"end_date": "2026-06-25",
"top_n": 5,
})
# View overview data
print(json.dumps(report["data"]["overview"], indent=2, ensure_ascii=False))
Running this produces output like:
{
"total_deals": 10,
"total_funding": 835000000.0,
"avg_ticket": 83500000.0,
"median_ticket": 47500000.0,
"earliest_date": "2026-01-15",
"latest_date": "2026-06-10"
}
From Report Engine to Data Product
With a parameterized report engine in place, the next step is turning it into a deliverable product. Here are three common productization paths:
Path 1: Automated Weekly/Monthly Reports
Leverage DuckDB’s batch processing + scheduled tasks to automatically pull the latest data, generate reports, and distribute them via email or Telegram to subscribed users.
# Pseudocode: automated weekly execution
def weekly_report(industry):
engine = ReportEngine(f"/data/reports/{industry}.duckdb")
# 1. Update data from public sources
engine.con.execute("CALL http_get('https://api.example.com/events')")
# 2. Generate report
report = engine.generate_report({
"industry": industry,
"start_date": "last_week_monday",
"end_date": "last_week_sunday",
"top_n": 10,
})
# 3. Export as PDF or HTML
export_to_html(report, f"/output/{industry}_weekly.html")
# 4. Send email/Telegram notification
send_notification(industry, report)
Path 2: On-Demand Custom Reports (Premium Pricing)
Provide an interactive bot on Telegram or WeChat where users input industry names and date ranges, and DuckDB generates customized reports in seconds.
# Telegram Bot integration example
@bot.message_handler(commands=['report'])
def handle_report(message):
parts = message.text.split()
if len(parts) < 2:
bot.reply(message, "Usage: /report industry_name [start_date] [end_date]")
return
industry = parts[1]
engine = ReportEngine("/tmp/report_cache.duckdb")
report = engine.generate_report({
"industry": industry,
"start_date": parts[2] if len(parts) > 2 else "2026-01-01",
"end_date": parts[3] if len(parts) > 3 else "2026-06-25",
"top_n": 10,
})
# Send structured report
bot.reply(message, format_telegram_report(report))
Path 3: API Service (Scale Up)
Wrap the report engine into a FastAPI service, charging per API call for enterprise customers.
from fastapi import FastAPI, Query
app = FastAPI()
@app.get("/api/report")
def get_report(
industry: str = Query(default="%"),
start_date: str = Query(default="2026-01-01"),
end_date: str = Query(default="2026-12-31"),
top_n: int = Query(default=10)
):
engine = ReportEngine("/tmp/api_reports.duckdb")
return engine.generate_report({
"industry": industry,
"start_date": start_date,
"end_date": end_date,
"top_n": top_n,
})
Efficiency Comparison with Traditional Tools
| Dimension | DuckDB Report Engine | Python + CSV Manual | Excel Manual |
|---|---|---|---|
| Single report generation time | 2-5 seconds | 30-60 seconds | 2-4 hours |
| Supported data volume | Billions of rows | Millions of rows | Tens of thousands |
| Reusability | Parameterized template, infinitely reusable | Requires manual code modification | Requires manual copy-paste |
| Multi-user concurrency | Supported (file-level lock or shared) | Requires additional deployment | Not supported |
| Marginal cost | Approaches zero | Grows linearly with data volume | Grows exponentially with complexity |
DuckDB’s core advantage: write once, reuse infinitely. You encapsulate your SQL logic into a parameterized template, and whether you generate 10 or 10,000 reports, the code doesn’t change.
Data Acquisition: Keeping Reports Fresh
For a sustainable reporting product, the data pipeline is critical. Here are three practical strategies:
Strategy 1: Web Scraping + DuckDB Ingestion
Use Python scrapers to periodically collect public data (like Tianyancha, IT Juzi, Crunchbase) and store in DuckDB:
import duckdb
# Batch ingest from scraped JSON data
import json
with open("scraped_events.json", "r") as f:
events = json.load(f)
con = duckdb.connect("events.db")
con.execute("CREATE TABLE IF NOT EXISTS scraped_events AS SELECT * FROM read_json_auto('scraped_events.json')")
con.execute("""
INSERT INTO company_events
SELECT * FROM scraped_events
ON CONFLICT (event_id) DO NOTHING
""")
Strategy 2: API Integration + Incremental Updates
For data sources with public APIs, use the HTTPFS extension to query remote data directly:
# Query remote JSON API directly, no local storage needed
con = duckdb.connect()
result = con.execute("""
SELECT * FROM read_json_auto(
'https://api.crunchbase.com/v4/fundings?size=100'
)
""").fetchdf()
Strategy 3: Multi-Source Fusion
Use DuckDB’s ATTACH feature to query multiple data sources simultaneously:
# Mount multiple data sources for joint querying
con = duckdb.connect("master_report.duckdb")
con.execute("ATTACH 'events.db' AS events")
con.execute("ATTACH 'financials.parquet' AS fin")
con.execute("ATTACH 'https://data.open-source.com/company.csv' AS src")
# Cross-source JOIN query
con.execute("""
SELECT c.company_name, c.industry, f.revenue
FROM events.company_events c
LEFT JOIN fin.financials f ON c.company_name = f.name
WHERE c.event_date > '2026-01-01'
""")
Monetization Roadmap: From Zero to One
Phase 1: MVP (Weeks 1-2)
- Choose an industry you’re familiar with (AI, new energy, cross-border e-commerce, etc.)
- Build the parameterized report engine with DuckDB (reference the code above)
- Manually populate 100-500 sample records
- Generate 5-10 sample reports and share with potential clients for feedback
Phase 2: Automation (Weeks 3-4)
- Build an automated data collection pipeline (scrapers/APIs)
- Set up scheduled tasks (cron + DuckDB scripts)
- Implement automated weekly/monthly report distribution
- Publish free sample chapters in Telegram/WeChat groups to attract users
Phase 3: Commercialization (Months 2-3)
- Launch subscription reports (¥99-299/month)
- Offer premium custom reports (¥500-2000/report)
- Accumulate 20-50 paying subscribers
- Consider API integration for enterprise clients
Revenue Projections
- 50 subscribers × ¥199/month = ¥9,950/month
- 10 custom reports/month × ¥800 = ¥8,000/month
- Total monthly revenue approximately ¥18,000, with near-zero marginal cost
This is the leverage effect DuckDB brings: code you spend 2 hours writing can work for you for years, serving thousands of customers.
Summary
The core of building an industry data report product with DuckDB comes down to three keywords: parameterization, automation, productization.
Parameterization means your SQL is no longer a rigid query—it becomes a “function” that accepts different inputs and generates different outputs. Automation means you don’t need manual intervention; reports can be generated and distributed on schedule. Productization means you’re no longer selling time—you’re selling standardized data products.
Together, these three transformations represent the complete leap from “a worker who looks up data for others” to “an entrepreneur who sells data products.”
📖 The complete version of this article is published on duckdblab.org, including more detailed steps and additional cases. Want to learn how to build your own data product with DuckDB? duckdblab.org has a full tutorial series and reproducible code repositories.
