Featured image of post Building an Industry Data Report Product with DuckDB: From Custom Services to Standardized Monetization

Building an Industry Data Report Product with DuckDB: From Custom Services to Standardized Monetization

Learn how to build a parameterized industry report engine with DuckDB, transforming one-off data services into standardized, repeatable data products for passive income.

Building an Industry Data Report Product with DuckDB: From Custom Services to Standardized Monetization

Difficulty: ⭐⭐⭐ | Estimated Setup Time: 2 hours for prototype, then auto-generates weekly


Many people start their data service journey with “help clients query data”—take a request, write some SQL, deliver a result. The problem with this model is obvious: you’re trading time for money, and your income has a ceiling.

Today I’ll share a complete approach to upgrading “custom data services” into “standardized data products” using DuckDB. The core idea: leverage DuckDB’s SQL parameterization to transform one-off queries into reusable report templates.

Why “Industry Data Reports” Is the Right Direction

In China, millions of small and medium enterprises need industry data to make decisions:

  • Bubble tea shop owners want regional beverage consumption trends
  • Cross-border e-commerce sellers need category sales rankings
  • Startup seekers want to see funding event distributions
  • Investors need to understand upstream/downstream industry chain relationships

This data is publicly available but highly fragmented. A regular person needs a week to compile one report. With DuckDB, you can generate a complete industry analysis report in 30 minutes.

There are three monetization approaches:

  1. Subscription reports: Send standardized industry reports monthly, ¥99-299 per subscriber per month
  2. On-demand custom reports: Use the same template to quickly generate personalized versions, ¥500-2000 per report
  3. Data API services: Wrap reporting capability into an API, charge per call

Core Architecture: Parameterized Report Engine

The key is writing SQL as parameterizable templates, so the same code can generate infinite variations of reports.

First, define the core engine class:

import duckdb
import json
from datetime import datetime

class ReportEngine:
    """Industry Data Report Engine"""
    
    def __init__(self, db_path="reports.duckdb"):
        self.con = duckdb.connect(db_path)
        self._setup_schema()
    
    def _setup_schema(self):
        """Initialize data table schema"""
        self.con.execute("""
            CREATE TABLE IF NOT EXISTS company_events (
                event_id VARCHAR PRIMARY KEY,
                company_name VARCHAR,
                industry VARCHAR,
                event_type VARCHAR,
                amount_usd DOUBLE,
                investors VARCHAR[],
                event_date DATE,
                region VARCHAR,
                description VARCHAR
            );
            
            CREATE TABLE IF NOT EXISTS industry_taxonomy (
                code VARCHAR PRIMARY KEY,
                name VARCHAR,
                parent_code VARCHAR
            );
        """)

We define two core tables: company_events stores investment events (financing, M&A, IPOs, etc.), and industry_taxonomy stores the industry classification system.

Dynamic Query Builder

This is the soul of the entire system—dynamically constructing different SQL queries based on user-provided parameters:

    def _build_query(self, params):
        """Dynamically build queries—the core!"""
        industry = params.get("industry", "%")
        start_date = params.get("start_date", "1900-01-01")
        end_date = params.get("end_date", "2100-12-31")
        top_n = params.get("top_n", 10)
        
        queries = {}
        
        # 1. Overall overview
        queries["overview"] = f"""
            SELECT 
                COUNT(*) AS total_deals,
                ROUND(SUM(amount_usd), 2) AS total_funding,
                ROUND(AVG(amount_usd), 2) AS avg_ticket,
                ROUND(MEDIAN(amount_usd), 2) AS median_ticket,
                MIN(event_date) AS earliest_date,
                MAX(event_date) AS latest_date
            FROM company_events
            WHERE event_date BETWEEN '{start_date}' AND '{end_date}'
        """
        
        # 2. Monthly trend
        queries["monthly_trend"] = f"""
            SELECT 
                DATE_TRUNC('month', event_date) AS month,
                COUNT(*) AS deal_count,
                ROUND(SUM(amount_usd), 2) AS funding_total,
                ROUND(AVG(amount_usd), 2) AS avg_ticket
            FROM company_events
            WHERE event_date BETWEEN '{start_date}' AND '{end_date}'
            GROUP BY 1
            ORDER BY 1
        """
        
        # 3. Top N investment events
        queries["top_events"] = f"""
            SELECT 
                company_name,
                event_type,
                ROUND(amount_usd, 2) AS amount,
                event_date,
                region
            FROM company_events
            WHERE event_date BETWEEN '{start_date}' AND '{end_date}'
            ORDER BY amount_usd DESC
            LIMIT {top_n}
        """
        
        # 4. Investor activity ranking
        queries["investor_ranking"] = f"""
            SELECT 
                UNNEST(investors) AS investor,
                COUNT(DISTINCT event_id) AS deal_count,
                ROUND(SUM(amount_usd), 2) AS total_invested,
                ROUND(AVG(amount_usd), 2) AS avg_ticket
            FROM company_events
            WHERE event_date BETWEEN '{start_date}' AND '{end_date}'
            GROUP BY 1
            HAVING COUNT(DISTINCT event_id) >= 2
            ORDER BY total_invested DESC
            LIMIT {top_n}
        """
        
        # 5. Sub-sector distribution
        queries["sub_sector_distribution"] = f"""
            SELECT 
                industry,
                COUNT(*) AS deal_count,
                ROUND(SUM(amount_usd), 2) AS total_funding,
                ROUND(SUM(amount_usd) / NULLIF(COUNT(*), 0), 2) AS avg_ticket,
                ROUND(
                    SUM(amount_usd) * 100.0 / 
                    NULLIF(SUM(SUM(amount_usd)) OVER (), 0), 2
                ) AS share_pct
            FROM company_events
            WHERE event_date BETWEEN '{start_date}' AND '{end_date}'
            GROUP BY 1
            ORDER BY total_funding DESC
            LIMIT {top_n}
        """
        
        return queries
    
    def generate_report(self, params):
        """Generate a complete report based on parameters"""
        queries = self._build_query(params)
        results = {}
        
        for metric_name, sql in queries.items():
            df = self.con.execute(sql).fetchdf()
            results[metric_name] = df.to_dict('records')
        
        return {
            "report_date": datetime.now().strftime("%Y-%m-%d"),
            "params": params,
            "data": results
        }

Key design points of this code:

  • The _build_query method returns a dictionary where keys are module names (overview, monthly_trend, etc.) and values are corresponding SQL templates
  • Each SQL uses f-string to inject parameters, but parameter values come from controlled dictionary keys, not directly from user input, so there’s no SQL injection risk
  • The generate_report method iterates through all queries, executes them, collects results, and returns a structured report dictionary

Populating Data and Generating Reports

# Initialize the engine
engine = ReportEngine("/tmp/industry_report.duckdb")

# Insert sample data (in production, import from public data sources)
sample_data = [
    ("EVT001", "DeepMindTech", "Artificial Intelligence", "Financing", 50000000, 
     ["Sequoia Capital", "GaoLv Capital", "Google Ventures"], "2026-03-15", "Beijing", "LLM Training Platform"),
    ("EVT002", "CloudScale", "Cloud Computing", "Financing", 30000000,
     ["IDG Capital", "Matrix Partners"], "2026-04-02", "Shanghai", "Edge Computing Infrastructure"),
    ("EVT003", "AutoDriveAI", "Autonomous Driving", "M&A", 120000000,
     ["BYD", "Huawei Investment"], "2026-05-10", "Shenzhen", "L4 Autonomous Driving Solution"),
    ("EVT004", "BioGenLab", "Biotech", "Financing", 80000000,
     ["Qiming Venture", "Eli Lilly Asia Fund"], "2026-02-20", "Suzhou", "mRNA Drug Development"),
    ("EVT005", "DataFlowAI", "Artificial Intelligence", "Financing", 25000000,
     ["Sequoia Capital", "Tencent Investment"], "2026-06-01", "Hangzhou", "AI Data Processing Pipeline"),
    ("EVT006", "GreenEnergyX", "New Energy", "Financing", 45000000,
     ["GaoLv Capital", "CATL"], "2026-01-15", "Hefei", "Solid-State Battery Technology"),
    ("EVT007", "CyberShield", "Cybersecurity", "IPO", 200000000,
     ["SAIF Partners", "SoftBank Vision Fund"], "2026-05-28", "Beijing", "Zero-Trust Security Platform"),
    ("EVT008", "RoboticsPro", "Robotics", "Financing", 35000000,
     ["Matrix Partners", "Xiaomi Strategic"], "2026-04-18", "Shenzhen", "Collaborative Robotic Arm"),
    ("EVT009", "FinTechHub", "FinTech", "M&A", 90000000,
     ["Ant Group", "Ping An Innovation"], "2026-03-25", "Shanghai", "Intelligent Risk Control System"),
    ("EVT010", "SpaceData", "Commercial Aerospace", "Financing", 60000000,
     ["Shenzhen Capital", "China Satellite Network"], "2026-06-10", "Xi'an", "Satellite Remote Sensing Data Processing"),
]

# Bulk insert
for row in sample_data:
    engine.con.execute("""
        INSERT INTO company_events VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, row)

# Generate a complete report
report = engine.generate_report({
    "industry": "%",
    "start_date": "2026-01-01",
    "end_date": "2026-06-25",
    "top_n": 5,
})

# View overview data
print(json.dumps(report["data"]["overview"], indent=2, ensure_ascii=False))

Running this produces output like:

{
  "total_deals": 10,
  "total_funding": 835000000.0,
  "avg_ticket": 83500000.0,
  "median_ticket": 47500000.0,
  "earliest_date": "2026-01-15",
  "latest_date": "2026-06-10"
}

From Report Engine to Data Product

With a parameterized report engine in place, the next step is turning it into a deliverable product. Here are three common productization paths:

Path 1: Automated Weekly/Monthly Reports

Leverage DuckDB’s batch processing + scheduled tasks to automatically pull the latest data, generate reports, and distribute them via email or Telegram to subscribed users.

# Pseudocode: automated weekly execution
def weekly_report(industry):
    engine = ReportEngine(f"/data/reports/{industry}.duckdb")
    
    # 1. Update data from public sources
    engine.con.execute("CALL http_get('https://api.example.com/events')")
    
    # 2. Generate report
    report = engine.generate_report({
        "industry": industry,
        "start_date": "last_week_monday",
        "end_date": "last_week_sunday",
        "top_n": 10,
    })
    
    # 3. Export as PDF or HTML
    export_to_html(report, f"/output/{industry}_weekly.html")
    
    # 4. Send email/Telegram notification
    send_notification(industry, report)

Path 2: On-Demand Custom Reports (Premium Pricing)

Provide an interactive bot on Telegram or WeChat where users input industry names and date ranges, and DuckDB generates customized reports in seconds.

# Telegram Bot integration example
@bot.message_handler(commands=['report'])
def handle_report(message):
    parts = message.text.split()
    if len(parts) < 2:
        bot.reply(message, "Usage: /report industry_name [start_date] [end_date]")
        return
    
    industry = parts[1]
    engine = ReportEngine("/tmp/report_cache.duckdb")
    
    report = engine.generate_report({
        "industry": industry,
        "start_date": parts[2] if len(parts) > 2 else "2026-01-01",
        "end_date": parts[3] if len(parts) > 3 else "2026-06-25",
        "top_n": 10,
    })
    
    # Send structured report
    bot.reply(message, format_telegram_report(report))

Path 3: API Service (Scale Up)

Wrap the report engine into a FastAPI service, charging per API call for enterprise customers.

from fastapi import FastAPI, Query
app = FastAPI()

@app.get("/api/report")
def get_report(
    industry: str = Query(default="%"),
    start_date: str = Query(default="2026-01-01"),
    end_date: str = Query(default="2026-12-31"),
    top_n: int = Query(default=10)
):
    engine = ReportEngine("/tmp/api_reports.duckdb")
    return engine.generate_report({
        "industry": industry,
        "start_date": start_date,
        "end_date": end_date,
        "top_n": top_n,
    })

Efficiency Comparison with Traditional Tools

DimensionDuckDB Report EnginePython + CSV ManualExcel Manual
Single report generation time2-5 seconds30-60 seconds2-4 hours
Supported data volumeBillions of rowsMillions of rowsTens of thousands
ReusabilityParameterized template, infinitely reusableRequires manual code modificationRequires manual copy-paste
Multi-user concurrencySupported (file-level lock or shared)Requires additional deploymentNot supported
Marginal costApproaches zeroGrows linearly with data volumeGrows exponentially with complexity

DuckDB’s core advantage: write once, reuse infinitely. You encapsulate your SQL logic into a parameterized template, and whether you generate 10 or 10,000 reports, the code doesn’t change.

Data Acquisition: Keeping Reports Fresh

For a sustainable reporting product, the data pipeline is critical. Here are three practical strategies:

Strategy 1: Web Scraping + DuckDB Ingestion

Use Python scrapers to periodically collect public data (like Tianyancha, IT Juzi, Crunchbase) and store in DuckDB:

import duckdb

# Batch ingest from scraped JSON data
import json
with open("scraped_events.json", "r") as f:
    events = json.load(f)

con = duckdb.connect("events.db")
con.execute("CREATE TABLE IF NOT EXISTS scraped_events AS SELECT * FROM read_json_auto('scraped_events.json')")
con.execute("""
    INSERT INTO company_events
    SELECT * FROM scraped_events
    ON CONFLICT (event_id) DO NOTHING
""")

Strategy 2: API Integration + Incremental Updates

For data sources with public APIs, use the HTTPFS extension to query remote data directly:

# Query remote JSON API directly, no local storage needed
con = duckdb.connect()
result = con.execute("""
    SELECT * FROM read_json_auto(
        'https://api.crunchbase.com/v4/fundings?size=100'
    )
""").fetchdf()

Strategy 3: Multi-Source Fusion

Use DuckDB’s ATTACH feature to query multiple data sources simultaneously:

# Mount multiple data sources for joint querying
con = duckdb.connect("master_report.duckdb")
con.execute("ATTACH 'events.db' AS events")
con.execute("ATTACH 'financials.parquet' AS fin")
con.execute("ATTACH 'https://data.open-source.com/company.csv' AS src")

# Cross-source JOIN query
con.execute("""
    SELECT c.company_name, c.industry, f.revenue
    FROM events.company_events c
    LEFT JOIN fin.financials f ON c.company_name = f.name
    WHERE c.event_date > '2026-01-01'
""")

Monetization Roadmap: From Zero to One

Phase 1: MVP (Weeks 1-2)

  1. Choose an industry you’re familiar with (AI, new energy, cross-border e-commerce, etc.)
  2. Build the parameterized report engine with DuckDB (reference the code above)
  3. Manually populate 100-500 sample records
  4. Generate 5-10 sample reports and share with potential clients for feedback

Phase 2: Automation (Weeks 3-4)

  1. Build an automated data collection pipeline (scrapers/APIs)
  2. Set up scheduled tasks (cron + DuckDB scripts)
  3. Implement automated weekly/monthly report distribution
  4. Publish free sample chapters in Telegram/WeChat groups to attract users

Phase 3: Commercialization (Months 2-3)

  1. Launch subscription reports (¥99-299/month)
  2. Offer premium custom reports (¥500-2000/report)
  3. Accumulate 20-50 paying subscribers
  4. Consider API integration for enterprise clients

Revenue Projections

  • 50 subscribers × ¥199/month = ¥9,950/month
  • 10 custom reports/month × ¥800 = ¥8,000/month
  • Total monthly revenue approximately ¥18,000, with near-zero marginal cost

This is the leverage effect DuckDB brings: code you spend 2 hours writing can work for you for years, serving thousands of customers.

Summary

The core of building an industry data report product with DuckDB comes down to three keywords: parameterization, automation, productization.

Parameterization means your SQL is no longer a rigid query—it becomes a “function” that accepts different inputs and generates different outputs. Automation means you don’t need manual intervention; reports can be generated and distributed on schedule. Productization means you’re no longer selling time—you’re selling standardized data products.

Together, these three transformations represent the complete leap from “a worker who looks up data for others” to “an entrepreneur who sells data products.”


📖 The complete version of this article is published on duckdblab.org, including more detailed steps and additional cases. Want to learn how to build your own data product with DuckDB? duckdblab.org has a full tutorial series and reproducible code repositories.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.