Featured image of post Building a 'Data Lake as a Service' with DuckDB: Query All Data Sources with One SQL, Earn Per Query

Building a 'Data Lake as a Service' with DuckDB: Query All Data Sources with One SQL, Earn Per Query

Use DuckDB's ATTACH capability to build a zero-ETL unified query layer. Let customers join PostgreSQL, S3 Parquet, HTTP APIs in one SQL. Monetize via query volume or SaaS subscriptions with near-zero marginal cost.

Data Lake as a Service Architecture

The Business Logic: Why This Works

Most small and medium enterprises face a core dilemma: their data is scattered everywhere and can’t be analyzed together.

Orders live in PostgreSQL, logs are in S3 Parquet files, Excel reports sit on shared drives, and real-time API data sits in caches. The traditional approach requires a data engineer to build ETL pipelines, set up a data warehouse, and write scheduling scripts — costing tens of thousands of dollars and taking weeks.

What if you offered a “Data Lake as a Service”? Customers connect all their data sources to a single DuckDB instance, and a single SQL query joins everything together. You charge per query or by subscription.

The moat: DuckDB’s ATTACH + httpfs + native Parquet reading lets you achieve “data stays put, queries move” — zero ETL required. Customers don’t need to migrate data, and your operational costs approach zero.

Technical Architecture: Zero-ETL Unified Query Layer

Customer Data Sources              Your Service
┌──────────────┐
│ PostgreSQL   │ ──ATTACH postgres──┐
└──────────────┘                   │
┌──────────────┐                   │
│ S3/OSS       │ ──ATTACH s3───────┼── DuckDB Engine
│ Parquet Files │   read_parquet()  │   (Serverless)
└──────────────┘                   │
┌──────────────┐                   │
│ HTTP API     │ ──ATTACH httpfs───┘
│ JSON/CSV     │   read_json_auto()
└──────────────┘

No data is moved at any layer. DuckDB performs predicate pushdown directly on the data sources, reading only the columns and rows needed.

Complete Code Implementation

Step 1: Building the Unified Query Engine

import duckdb

# Serverless mode: in-memory database, destroyed after use
con = duckdb.connect(":memory:")

# Register PostgreSQL data source (customer's production DB)
con.execute("""
    ATTACH 'postgresql://user:***@host:5432/orders' AS orders (TYPE POSTGRES);
""")

# Register S3/OSS Parquet data lake
con.execute("""
    ATTACH 's3://my-data-lake/' AS datalake (TYPE duckdb_io_s3, REGION us-east-1);
""")

# Register HTTP data source (external API)
con.execute("""
    ATTACH 'http://api.example.com/data/' AS api_data (TYPE httpfs);
""")

Step 2: Cross-Source Join Query — The Core Selling Point

SELECT 
    o.order_id,
    o.customer_id,
    o.amount,
    p.category,
    p.unit_cost,
    o.order_date,
    ROUND(o.amount - p.unit_cost * o.quantity, 2) AS profit,
    weather.temperature,
    weather.rainfall
FROM orders.main.orders o
JOIN datalake.products.parquet p 
    ON o.product_id = p.product_id
LEFT JOIN api_data.weather_data weather
    ON DATE(o.order_date) = weather.date
WHERE o.order_date >= '2026-01-01'
ORDER BY profit DESC
LIMIT 100;

How this SQL executes:

  1. DuckDB parses the SQL and generates a physical execution plan
  2. The order_date predicate is pushed down to PostgreSQL, fetching only 2026 data
  3. datalake/products.parquet uses Parquet zone maps for partition pruning
  4. An HTTP request is made to api_data for weather data
  5. JOINs and calculations happen in memory
  6. Results are returned

Zero full data transfers occur during the entire process.

Step 3: Encapsulating Reusable Report Templates

def generate_profit_report(start_date, end_date, region=None):
    sql = """
    SELECT 
        p.region,
        p.category,
        COUNT(*) AS order_count,
        SUM(o.amount) AS total_revenue,
        SUM(o.amount - p.unit_cost * o.quantity) AS total_profit,
        AVG(o.amount - p.unit_cost * o.quantity) AS avg_margin,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY o.amount) AS median_order_value
    FROM orders.main.orders o
    JOIN datalake.products.parquet p 
        ON o.product_id = p.product_id
    WHERE o.order_date BETWEEN '{start}' AND '{end}'
    """.format(start=start_date, end=end_date)
    
    if region:
        sql += f" AND p.region = '{region}'"
    
    sql += " GROUP BY ALL ORDER BY total_profit DESC"
    
    result = con.execute(sql).fetchdf()
    return result

Step 4: Automated Scheduled Reports

import duckdb
import smtplib
from email.mime.text import MIMEText
from datetime import datetime, timedelta

def daily_report():
    con = duckdb.connect(":memory:")
    con.execute("ATTACH 'postgresql://user:***@host:5432/orders' AS orders (TYPE POSTGRES)")
    con.execute("ATTACH 's3://my-data-lake/' AS datalake (TYPE duckdb_io_s3, REGION us-east-1)")
    
    yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
    
    sql = f"""
    SELECT 
        DATE(order_date) AS sale_date,
        COUNT(*) AS orders,
        SUM(amount) AS revenue,
        SUM(amount) - SUM(unit_cost * quantity) AS profit
    FROM orders.main.orders o
    JOIN datalake.products.parquet p ON o.product_id = p.product_id
    WHERE DATE(order_date) = '{yesterday}'
    GROUP BY ALL
    """
    
    result = con.execute(sql).fetchdf()
    
    html = f"""
    <h2>📊 Yesterday's Business Report | {yesterday}</h2>
    <p>Orders: <b>{result['orders'].sum()}</b></p>
    <p>Revenue: <b>${result['revenue'].sum():,.2f}</b></p>
    <p>Profit: <b>${result['profit'].sum():,.2f}</b></p>
    <p>Margin: <b>{result['profit'].sum()/result['revenue'].sum()*100:.1f}%</b></p>
    """
    
    msg = MIMEText(html, 'html')
    msg['Subject'] = f'Daily Report {yesterday}'
    msg['From'] = '[email protected]'
    msg['To'] = '[email protected]'
    
    with smtplib.SMTP('smtp.company.com', 587) as server:
        server.starttls()
        server.login('[email protected]', 'password')
        server.send_message(msg)

Comparison with Traditional Approaches

ApproachData MigrationDev CycleOps CostScalability
Traditional ETL + DWFull transfer needed2-4 weeksHigh (scheduling, monitoring)Poor
Flink/Kafka Real-time PipelineRequires system changes1-3 monthsVery highModerate
DuckDB ATTACH Zero-ETLNone needed1-2 daysMinimalGood

The DuckDB advantage: customers’ existing data infrastructure stays untouched. You’re adding a SQL query layer on top. For customers, the risk is near zero — no sunk cost loss. For you, customer acquisition friction is minimal.

Detailed Monetization Paths

Path 1: Charge Per Query (Lightweight)

Ideal for initial market validation:

  • Serve 10 SMEs with unified data query service
  • 500 queries/month × ¥0.5/query = ¥250/customer
  • Monthly revenue: ¥2,500 with near-zero marginal cost (DuckDB is serverless)

Customer acquisition: Share “one SQL query across all data sources” tips in communities to attract early adopters.

Path 2: SaaSification (Scale Up)

Wrap the code into a web application:

  1. FastAPI provides SQL query API
  2. Streamlit provides visual frontend
  3. Customer uploads data source config → you host DuckDB instance → customer writes SQL via web UI
  4. Pricing: ¥999/month/enterprise, targeting 50 enterprises = ¥50,000/month

Tech stack example:

# FastAPI example
from fastapi import FastAPI, HTTPException
import duckdb

app = FastAPI()

@app.post("/query")
def run_query(config: dict, sql: str):
    con = duckdb.connect(":memory:")
    for source in config.get("sources", []):
        con.execute(f"ATTACH ...")
    try:
        result = con.execute(sql).fetchdf()
        return {"columns": result.columns.tolist(), "data": result.values.tolist()}
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))

Path 3: Data Products (High Value-Add)

Package query results as industry reports:

  • Deep industry profit analysis report priced at ¥299
  • 100 sales/month = ¥29,900
  • Zero marginal cost — develop once, sell repeatedly

Key insight: Target industries with high profit margins and low data transparency (e.g., cross-border e-commerce, local life services).

Key Considerations

  1. Security: ATTACH credentials must be encrypted — never hardcode. Use environment variables or a secrets manager.
  2. Performance: For large datasets, use EXPLAIN ANALYZE to verify predicate pushdown is working.
  3. Cost: DuckDB is CPU-intensive. Choose appropriately sized instances (8+ cores offer the best price-performance ratio).
  4. Concurrency: DuckDB natively supports multi-threading, but watch concurrent connection limits when ATTACHing remote sources. Implement connection pooling at the API layer.
  5. Timeout Control: HTTP API data sources may respond slowly. Set reasonable timeout parameters to avoid blocking.

Why Now Is the Right Time

DuckDB in 2026 has matured to production-grade data lake querying:

  • PostgreSQL extension: duckdb_attach_postgres() connects directly to production databases
  • Native S3/OSS support: duckdb_io_s3 type requires no extra configuration
  • httpfs extension: Query remote JSON/CSV/Parquet directly
  • Parquet predicate pushdown: Automatically prunes unnecessary columns and data blocks
  • GROUP BY ALL: Simplifies aggregation queries, reducing code by 80%

The combination of these capabilities turns “Data Lake as a Service” from a concept into a product you can launch today.

Monetization Roadmap Summary

PhaseActionExpected RevenueTime Investment
Week 1Build MVP, test 3 data sources¥0 (validation)1-2 days
Weeks 2-4Onboard 3-5 seed customers for free trial¥0 (exchange for feedback)5 hrs/week
Month 2Start charging, target 10 customers¥2,500-5,000/month10 hrs/week
Months 3-6SaaSify, expand to 50 customers¥10,000-50,000/month15 hrs/week
Month 6+Launch industry report products¥30,000-80,000/monthNear-zero marginal cost

The biggest advantage of using DuckDB for data lake querying is no data migration needed. Customers’ existing data infrastructure remains untouched — you’re simply adding a SQL query interface on top. For customers, the risk is extremely low with no sunk cost loss. For you, customer acquisition resistance is minimal.

📖 The complete code templates and deployment scripts from this article, including S3 authentication configuration, secure PostgreSQL connections, and Streamlit frontend wrapping, are compiled into a full tutorial series at duckdblab.org.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.