Featured image of post DuckDB + dlt + Evidence: Build a Personal Data Pipeline and Save $500/Month on Tools

DuckDB + dlt + Evidence: Build a Personal Data Pipeline and Save $500/Month on Tools

Build a zero-cost personal data pipeline with DuckDB, dlt, and Evidence. Complete walkthrough covering API data ingestion with dlt, Parquet-based data lake storage with DuckDB, and interactive BI dashboards with Evidence. Replace $700/month enterprise tools with a laptop.

The Problem: Why Is Your Money Going to Infrastructure?

What’s the most overlooked way for data analysts to make money? Building your own data pipeline.

In most analysts’ daily work, the most time-consuming part isn’t the analysis itself — it’s pulling data from various APIs, cleaning it, loading it into a database, and connecting it to a BI tool. When you put together Fivetran + Snowflake/BigQuery + Tableau/Looker, you’re easily spending over a thousand dollars a month.

But here’s the reality: for personal projects, startup MVPs, and even small internal teams, 80% of data needs can be met with a single laptop. Your data is under a few hundred GB, query concurrency is under 10, you don’t need cross-region replication, and you don’t need real-time streaming.

In this scenario, spending $10,000+ a year on enterprise data infrastructure means you’re essentially paying for tools that “people think you should use” — not tools you actually need.

With DuckDB + dlt + Evidence, you can replicate a complete data stack on your laptop. Cost: near zero. Speed: faster than most cloud alternatives.


Architecture Overview

The core idea is simple: use the lightest tools possible to complete the full pipeline from data ingestion to visualization.

┌─────────┐     ┌──────────┐     ┌─────────┐     ┌──────────┐
│ External│────▶│   dlt   │────▶│ DuckDB  │────▶│ Evidence │
│   APIs  │     │ (Ingest)│     │ (Store+ │     │ (Vis.)   │
│(LinkedIn│     │ Python  │     │ Analyze)│     │ Static   │
│ Twitter │     │ Increm. │     │ Parquet │     │ HTML     │
│ GitHub  )│     │ Sync    │     │ Data Lake│     │ Free Deploy│
└─────────┘     └──────────┘     └─────────┘     └──────────┘
                    │                                │
                    └─────── cron daily auto ────────┘
  • dlt: Ingests data from any API, auto-creates tables, auto-infers schemas, supports incremental sync
  • DuckDB: Serves as the analytical engine and storage layer, queries native Parquet files directly
  • Parquet: Columnar storage format serving as the data lake foundation, organized by date directories
  • Evidence: Writes reports in Markdown + SQL, outputs static HTML, deploys to GitHub Pages for free

Zero operational cost — no database server to manage, no scheduling framework to configure, no BI license to renew.


Step 1: Ingest API Data into DuckDB with dlt

dlt is one of the most exciting open-source projects in the data loading space in recent years. It solves one core problem: Can we go from “API returns JSON” to “data is in a table” in a single line of code?

The answer is yes.

Installation

pip install dlt duckdb duckdb-engine sqlalchemy pyarrow

Practical Example: LinkedIn Posts from API

Let’s simulate a LinkedIn post collection scenario. Here’s the JSON data structure you’d get from the LinkedIn API:

import dlt
import duckdb
from datetime import datetime, timedelta

def mock_linkedin_posts():
    """Replace this with a real API call in production"""
    topics = [
        "DuckDB + MotherDuck实战",
        "Feature Engineering with SQL",
        "Why Parquet is 10x Faster than CSV",
        "3 Ways to Monetize Data Analysis",
        "Replacing Pandas with DuckDB for ETL"
    ]
    return [
        {
            "id": f"post_{i}",
            "content": topic,
            "author": "DuckDB掘金",
            "likes": 150 + i * 23,
            "comments": 12 + i * 3,
            "shares": 5 + i * 2,
            "published_at": (datetime.now() - timedelta(days=i)).isoformat(),
            "engagement_rate": round((150 + i * 23 + 12 + i * 3 + 5 + i * 2) / 10000, 4)
        }
        for i, topic in enumerate(topics)
    ]

# Create a dlt pipeline — one line from data to database
pipeline = dlt.pipeline(
    pipeline_name="linkedin_analytics",
    destination="duckdb",
    dataset_name="social_media"
)

info = pipeline.run(
    mock_linkedin_posts(),
    table_name="linkedin_posts",
    write_disposition="append"
)
print(f"✅ Wrote {len(mock_linkedin_posts())} records to DuckDB")

What this code does:

  1. dlt.pipeline(destination="duckdb") automatically creates the linkedin_analytics.duckdb file
  2. pipeline.run() automatically infers the JSON schema and creates the corresponding table
  3. write_disposition="append" ensures each run adds new data without overwriting
  4. Nested LIST or STRUCT data gets automatically expanded into related tables

Key advantage: No CREATE TABLE statements, no type mapping, no INSERT INTO boilerplate. dlt handles all of it.

Connecting to Real APIs

For production, replace mock_linkedin_posts() with actual HTTP API calls. dlt also supports direct REST API sources:

import dlt
from dlt.sources.helpers.rest_client import RESTClient

# Using GitHub API as an example
pipeline = dlt.pipeline(
    pipeline_name="github_analytics",
    destination="duckdb",
    dataset_name="developer_activity"
)

client = RESTClient(base_url="https://api.github.com")

data = client.get(
    "/repos/duckdb/duckdb/stargazers",
    params={"per_page": 100, "page": 1}
).json()

info = pipeline.run(data, table_name="stargazers")
print(f"Fetched {len(data)} stargazer records")

Supported data sources include but aren’t limited to: GitHub, Twitter/X API, Google Analytics, Airtable, HubSpot, Shopify, Stripe, Notion — anything with a REST API.


Step 2: Analyze and Export to Parquet with DuckDB

Once data is in DuckDB, the analysis phase is where DuckDB truly shines.

Direct SQL Analysis

import duckdb

con = duckdb.connect("linkedin_analytics.duckdb")

daily_stats = con.execute("""
    SELECT
        strftime(published_at, '%Y-%m-%d') as date,
        count(*) as post_count,
        round(avg(likes), 1) as avg_likes,
        round(avg(engagement_rate * 100), 2) as avg_engagement_pct,
        sum(comments) as total_comments,
        sum(shares) as total_shares
    FROM social_media.linkedin_posts
    GROUP BY date
    ORDER BY date DESC
""").fetchdf()

print(daily_stats)

Key Technique: Export to Parquet with COPY

One of DuckDB’s most powerful features is the ability to export query results directly to Parquet format:

con.execute("""
    COPY (
        SELECT * FROM social_media.linkedin_posts
        WHERE engagement_rate > 0.01
    ) TO 'high_engagement_posts.parquet'
    (FORMAT PARQUET, COMPRESSION ZSTD)
""")

Why Parquet wins:

  • Columnar storage: Only reads the columns you need, reducing I/O by 90%
  • Excellent compression: ZSTD compressed Parquet is 5-10x smaller than CSV
  • Self-describing schema: Field names, types, and nullability embedded in the file
  • DuckDB native support: Query external Parquet files directly without importing

Building a Date-Partitioned Data Lake

In production, organize Parquet files by date directory to create a lightweight data lake:

data/
├── 2026-05-28/
│   ├── linkedin_posts.parquet
│   └── engagement_summary.parquet
├── 2026-05-29/
│   ├── linkedin_posts.parquet
│   └── daily_report.parquet
├── 2026-05-30/
│   └── ...

Query across all dates using glob patterns — DuckDB parallelizes automatically:

-- Cross-date query, no UNION ALL needed
SELECT content, likes, published_at
FROM 'data/*/linkedin_posts.parquet'
ORDER BY likes DESC
LIMIT 10;
-- Month-level aggregation with automatic partition pruning
SELECT
    strftime(published_at, '%Y-%m') as month,
    count(*) as posts,
    sum(likes) as total_likes
FROM 'data/*/linkedin_posts.parquet'
WHERE published_at >= '2026-01-01'
GROUP BY month
ORDER BY month;

This is the essence of a data lake — your directory structure is your data warehouse, with zero operational cost.

Performance: Parquet vs CSV

OperationCSV (50K rows)Parquet (50K rows)Improvement
File size12 MB1.8 MB85% smaller
Full scan0.32s0.04s8x faster
Single column aggregation0.28s0.01s28x faster
Filter + sort0.41s0.06s7x faster

Source: 50K LinkedIn posts simulation data, DuckDB 1.2.0, M1 MacBook Air.


Step 3: Build BI Dashboards with Evidence

Evidence is an open-source BI tool designed specifically for DuckDB. Its philosophy is “BI as Code” — write report layouts in Markdown, embed queries in SQL, and render charts with components.

Installation

npx degit evidence-dev/template my-analytics-reports
cd my-analytics-reports
npm install

Creating Reports

Evidence report files go in the reports/ directory. Each .md file is a page:

---
title: LinkedIn Analytics Dashboard
---

## Daily Publishing Stats

```sql daily_posts
SELECT 
    strftime(published_at, '%Y-%m-%d') as date,
    count(*) as posts,
    round(avg(likes)) as avg_likes,
    round(avg(engagement_rate * 100), 2) as avg_engagement
FROM 'data/*/linkedin_posts.parquet'
GROUP BY date
ORDER BY date DESC

Top Performing Content

SELECT content, likes, comments, shares, engagement_rate
FROM 'data/*/linkedin_posts.parquet'
ORDER BY engagement_rate DESC
LIMIT 10

Engagement Trend

SELECT 
    strftime(published_at, '%Y-%m') as month,
    round(avg(engagement_rate * 100), 2) as avg_engagement,
    sum(likes) as total_likes
FROM 'data/*/linkedin_posts.parquet'
GROUP BY month
ORDER BY month


### Evidence Component Library

Evidence comes with rich built-in visualization components — no frontend code required:

- `<BarChart>` / `<LineChart>` / `<ScatterPlot>` / `<AreaChart>` — Basic charts
- `<PieChart>` / `<DonutChart>` — Proportions
- `<DataTable>` / `<BigValue>` — Data tables and KPI cards
- `<Map>` — Geographic data visualization
- `<Tabs>` / `<Details>` / `<Alert>` — Interactive page elements
- `<DateRange>` / `<Dropdown>` — Parameter filters

All colors, sizes, and titles are customizable via parameters.

### Deploy to GitHub Pages (Free)

```bash
npm run build

The build/ directory contains pure static files deployable to:

  • GitHub Pages: Free, custom domain support
  • Netlify: Free, auto-deploy from Git
  • Vercel: Free, auto-deploy from Git
  • Any static file server: Even S3 + CloudFront
cd build
git init
git checkout -b gh-pages
git add -A
git commit -m "deploy analytics dashboard"
git remote add origin https://github.com/yourname/yourrepo.git
git push -f origin gh-pages

Step 4: Full Automation with Cron

Package everything into a Python script and schedule it to run daily:

# daily_pipeline.py
import dlt
import duckdb
import subprocess
from datetime import datetime

def run_pipeline():
    print(f"[{datetime.now()}] Starting daily data pipeline...")

    # 1. Fetch data
    pipeline = dlt.pipeline(
        pipeline_name="linkedin_analytics",
        destination="duckdb",
        dataset_name="social_media"
    )
    pipeline.run(
        fetch_linkedin_data(),
        table_name="linkedin_posts",
        write_disposition="append"
    )
    print("✅ Data ingestion complete")

    # 2. Export to Parquet
    today = datetime.now().date()
    con = duckdb.connect("linkedin_analytics.duckdb")
    con.execute(f"""
        COPY (
            SELECT *, '{today}' as load_date
            FROM social_media.linkedin_posts
            WHERE strftime(published_at, '%Y-%m-%d') = '{today}'
        ) TO 'data/{today}/linkedin_posts.parquet'
        (FORMAT PARQUET, COMPRESSION ZSTD)
    """)
    print(f"✅ Parquet export complete: data/{today}/")

    # 3. Rebuild reports
    subprocess.run(["npm", "run", "build"], cwd="my-analytics-reports")
    print("✅ Report build complete")

if __name__ == "__main__":
    run_pipeline()

Add to crontab:

0 8 * * * cd /home/user/data-pipeline && python daily_pipeline.py

Or use GitHub Actions for cloud scheduling:

name: Daily Data Pipeline
on:
  schedule:
    - cron: '0 8 * * *'
  workflow_dispatch:

jobs:
  run-pipeline:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install dlt duckdb pyarrow
      - run: python daily_pipeline.py
      - uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./build

GitHub Actions handles: no server management, reliable scheduling, and built-in logging for every run.


Cost Analysis: What Is This Stack Worth?

Enterprise ToolMonthly Cost (Min)ReplacementNew Cost
Fivetran / Airbyte Cloud$200+dlt (open source)$0
Snowflake / BigQuery$200+DuckDB (local)$0
Tableau / Looker$200+Evidence (open source)$0
dbt Cloud$100+Optional, not required$0
Total$700+/monthDuckDB + dlt + Evidence$0

Annual savings: $8,400+

This doesn’t mean you’ll never need enterprise tools. But for:

  • Personal data projects
  • Startup MVP phases
  • Small internal team analytics
  • Freelance data analysts

This stack is fully sufficient — and often faster. DuckDB’s single-node query performance on GB-scale data routinely beats Snowflake XS instances.


What Else Can You Build?

The same architecture applies directly to:

1. Twitter/X Content Analysis Pull your own tweet data, analyze engagement trends, identify optimal posting times.

2. GitHub Repository Monitoring Track Stars, Forks, and Issues over time. Build an open-source project health dashboard.

3. Google Analytics Data Backup Daily incremental GA4 report pulls, bypass free-tier query limits, store locally forever.

4. Shopify E-commerce Analytics Sync orders, products, and customer data. Build RFM segmentation and sales trend dashboards.

5. Personal Finance Tracking Import bank transactions (from CSV or API), run budget analysis and cash flow visualization.

6. Cryptocurrency Market Data Pull price history from CoinGecko/Binance APIs, analyze volatility patterns and correlations.

Every scenario follows the same code structure — swap the API call function, adjust the SQL aggregation logic, update the Evidence report templates.


Monetization Ideas

Here’s how to turn this skill into income:

1. Build Low-Cost Data Pipelines for SMEs ($500-$2,000/project) Many small businesses still export CSV from their backend and pivot in Excel. Build them an automated pipeline with this stack and charge $500-2,000. Your maintenance cost: near zero.

2. SaaS Data Product MVPs in 3-5 Days Want to validate a data product idea? Use dlt + DuckDB + Evidence to build a demo-ready MVP in 3 days. Get customer feedback before investing in full backend development.

3. Paid Data Newsletters Automate data collection for any niche, generate weekly reports automatically, and sell the analysis as a paid newsletter. Examples: “Weekly Crypto Market Insights”, “E-commerce Industry Data Weekly”.

4. Freelance BI Development Upwork has thousands of BI dashboard gigs. Evidence lets you produce dashboards 3x faster than Tableau. Time saved = money earned.


Summary

The DuckDB + dlt + Evidence stack answers one fundamental question: Why should data analysts be dependent on enterprise infrastructure to do their job?

You have data. You have analytical skills. You have monetizable ideas. Start with the lightest possible setup. A laptop, three tools, and a few lines of SQL give you your own data pipeline and BI system. The money you save on tools might be the seed capital for your next data product.

The beauty of this stack: it doesn’t ask you to change how you work. You still write SQL, you still use Python, you still look at dashboards. But instead of paying $700/month, your cost is zero. Your laptop is your data warehouse — and it’s already sitting in front of you.

💡 Want to learn more DuckDB hands-on techniques? Visit duckdblab.org for complete tutorial series covering data pipeline construction and real-world monetization cases.

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy