Featured image of post Replace Web Scrapers with One SQL Statement: DuckDB Data Acquisition

Replace Web Scrapers with One SQL Statement: DuckDB Data Acquisition

Stop writing Python web scrapers for data acquisition. DuckDB's httpfs extension lets you query remote CSV, Parquet, and JSON files directly from HTTP URLs — no requests library, no BeautifulSoup, no local downloads. This post compares traditional scraping with DuckDB's zero-ETL approach, demonstrates URL glob patterns for batch file ingestion, S3 data lake queries, and provides full executable Python scripts with monetization strategies.

1. The Problem: You’re Still Writing Web Scrapers?

What’s the first step of any data analysis project?

It’s not writing SQL. It’s not tuning models. It’s getting the data.

The traditional workflow for acquiring public data:

  1. Find the data URL (GitHub CSV, government open data, S3 Parquet files)
  2. Download it in browser, or write a Python script:
    import requests
    import pandas as pd
    from io import StringIO
    
    resp = requests.get('https://example.com/data.csv')
    df = pd.read_csv(StringIO(resp.text))
    
  3. Open in Pandas/Excel → OOM crash (file too large)
  4. Finally switch to DuckDB and start analyzing

20 minutes gone, and you haven’t written a single line of analysis code.

Even worse:

  • 100 CSV files on a webpage — you need to write a loop to download and merge them
  • Data updates daily — you need a cron job to fetch it every time
  • File is 2GB+ — Pandas read_csv immediately runs out of memory

What if you could query the data directly at its URL — no download, no scraper scripts, no intermediate files?

That’s exactly what DuckDB’s zero-ETL data acquisition does.


2. The Solution: DuckDB Replaces Your Scraper Scripts

DuckDB’s built-in httpfs extension allows you to read CSV, Parquet, and JSON files directly from HTTP/HTTPS URLs in SQL.

Core idea: Query data where it lives. Don’t move it first.

2.1 Three Steps to Remote Queries

INSTALL httpfs;   -- Install once (built-in since DuckDB 1.0)
LOAD httpfs;      -- Load per session

-- Now query any URL directly
SELECT * FROM read_csv_auto('https://example.com/data.csv');

That’s it. No requests.get(), no wget, no temp directory cleanup.

2.2 Quick Comparison

Traditional ApproachDuckDB Approach
requests.get(url)pandas.read_csv()read_csv_auto('url') — direct query
Loop over 100 CSVs, concat DataFramesread_csv_auto('https://.../*.csv') — glob pattern
Daily cron: wget → unzip → analyzecron → duckdb -c "SELECT ..." — one command
Download entire file to see if it’s usefulParquet remote: fetch metadata only (5-50KB)

3. Hands-On Examples

3.1 Example 1: Query a Public GitHub CSV/Parquet

GitHub is full of public datasets. Traditional approach: git clone the entire repo. DuckDB: one SQL statement.

-- NYC taxi sample data on GitHub
INSTALL httpfs;
LOAD httpfs;

SELECT 
    VendorID,
    COUNT(*) AS trips,
    ROUND(AVG(total_amount), 2) AS avg_amount,
    ROUND(SUM(total_amount), 2) AS total_revenue
FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
WHERE total_amount > 0
GROUP BY VendorID
ORDER BY total_revenue DESC;

Execution time: 3-5 seconds (only pulls Parquet metadata and needed columns).

Traditional approach:

  1. Download 42MB Parquet → 10 seconds
  2. Load into memory → 5 seconds
  3. Execute query → 2 seconds Total: 17 seconds

DuckDB remote query: 5 seconds, zero temp files.

3.2 Example 2: URL Glob Patterns for Batch Ingestion

This is DuckDB’s most underrated capability — URL glob patterns work on remote files too.

Suppose a government open data site organizes files by date:

https://data.gov.example/traffic/2026/01/traffic_20260101.csv
https://data.gov.example/traffic/2026/01/traffic_20260102.csv
...
https://data.gov.example/traffic/2026/05/traffic_20260524.csv

Traditional approach: Python loop + requests.get() + DataFrame concat.

DuckDB approach:

-- Read all CSVs for a given month
SELECT 
    strftime(date, '%Y-%m-%d') AS day,
    COUNT(*) AS records,
    AVG(speed) AS avg_speed
FROM read_csv_auto(
    'https://data.gov.example/traffic/2026/05/*.csv'
)
GROUP BY day
ORDER BY day;

* wildcard: matches all CSV files in that directory.

-- Recursive pattern — all directories, all CSVs
SELECT *
FROM read_csv_auto(
    'https://data.gov.example/traffic/**/*.csv'
);

** recursive wildcard: matches nested directories — perfect for multi-level data warehouses.

-- More precise: only May 2026 data
SELECT *
FROM read_csv_auto(
    'https://data.gov.example/traffic/2026/05/traffic_*.csv'
);

3.3 Example 3: S3 / Object Storage + Parquet Column Pruning

When data lives on AWS S3 or compatible object storage, Parquet remote queries become a superpower.

-- Query sales data on S3 with column pruning
SELECT 
    region,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT customer_id) AS customers
FROM read_parquet(
    's3://my-bucket/sales/2026/*/*.parquet'
)
WHERE revenue > 0
GROUP BY region
ORDER BY total_revenue DESC;

How column pruning works: DuckDB uses HTTP Range Requests to fetch only the columns it needs (region, revenue, customer_id). If the original file has 30 columns at 1GB, actual transfer might be only 30-80MB — a 90%+ reduction.

DuckDB also uses Parquet’s row group statistics for predicate pushdown (WHERE revenue > 0), skipping row groups that don’t match — further reducing transfer.

3.4 Example 4: Complete Python Script (Copy-Paste Ready)

#!/usr/bin/env python3
"""
DuckDB Data Acquisition + Analysis Demo
Fetches remote Parquet/CSV via HTTP, runs SQL analysis, exports HTML report
Prerequisites: pip install duckdb pandas
"""

import duckdb
import time
import os

def main():
    # Connect to in-memory database
    con = duckdb.connect()
    
    # Enable httpfs extension
    con.execute("INSTALL httpfs")
    con.execute("LOAD httpfs")
    
    # Configuration (optional)
    con.execute("SET httpfs_timeout = 30")
    con.execute("SET httpfs_retry_count = 3")
    
    # ========== Example 1: GitHub Public Data ==========
    print("=" * 60)
    print("📦 Example 1: GitHub Public Dataset Query")
    print("=" * 60)
    
    start = time.time()
    
    result = con.execute("""
        SELECT 
            VendorID,
            payment_type,
            COUNT(*) AS trips,
            ROUND(AVG(total_amount), 2) AS avg_amount,
            ROUND(SUM(total_amount), 2) AS total_revenue
        FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
        WHERE total_amount > 0
          AND total_amount < 500
        GROUP BY VendorID, payment_type
        ORDER BY total_revenue DESC
        LIMIT 15
    """).fetchdf()
    
    elapsed = time.time() - start
    print(f"✅ Query completed in {elapsed:.2f}s")
    print(f"📊 {len(result)} rows returned\n")
    print(result.to_string(index=False))
    print()
    
    # ========== Example 2: Remote CSV Analysis ==========
    print("=" * 60)
    print("📄 Example 2: Remote CSV Streaming Analysis")
    print("=" * 60)
    
    start = time.time()
    
    result2 = con.execute("""
        SELECT 
            column0 AS year,
            COUNT(*) AS records
        FROM read_csv_auto(
            'https://raw.githubusercontent.com/plotly/datasets/master/gapminder_unfiltered.csv'
        )
        WHERE column0 > 2000
        GROUP BY year
        ORDER BY year
    """).fetchdf()
    
    elapsed2 = time.time() - start
    print(f"✅ CSV remote query completed in {elapsed2:.2f}s")
    print(f"📊 {len(result2)} rows returned\n")
    print(result2.to_string(index=False))
    print()
    
    # ========== Generate HTML Report ==========
    print("=" * 60)
    print("📝 Generating HTML Report")
    print("=" * 60)
    
    html_report = f"""
    <!DOCTYPE html>
    <html>
    <head><meta charset="utf-8">
    <title>DuckDB Data Acquisition Report</title>
    <style>
        body {{ font-family: -apple-system, sans-serif; max-width: 900px; margin: 40px auto; padding: 0 20px; }}
        h1 {{ color: #0d9488; }}
        table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }}
        th, td {{ border: 1px solid #ddd; padding: 8px; text-align: left; }}
        th {{ background: #0d9488; color: white; }}
        tr:nth-child(even) {{ background: #f5f5f5; }}
        .summary {{ background: #f0fdf4; padding: 15px; border-radius: 8px; margin: 20px 0; }}
    </style>
    </head>
    <body>
    <h1>🦆 DuckDB Data Acquisition Report</h1>
    <p>Generated: {time.strftime('%Y-%m-%d %H:%M:%S')}</p>
    
    <div class="summary">
        <h2>Example 1: NYC Taxi Data Analysis</h2>
        <p>Query time: {elapsed:.2f}s | Rows: {len(result)}</p>
    </div>
    
    {result.to_html(index=False)}
    
    <div class="summary">
        <h2>Example 2: Remote CSV Analysis</h2>
        <p>Query time: {elapsed2:.2f}s | Rows: {len(result2)}</p>
    </div>
    
    {result2.to_html(index=False)}
    
    <hr>
    <p><em>Powered by DuckDB httpfs — query remote data without downloading</em></p>
    </body>
    </html>
    """
    
    output_path = "duckdb_remote_report.html"
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(html_report)
    
    print(f"✅ Report saved: {os.path.abspath(output_path)}")
    
    con.close()
    print("\n🎉 All done!")

if __name__ == "__main__":
    main()

Run it:

pip install duckdb pandas
python3 duckdb_remote_data.py

3.5 One-Liner with DuckDB CLI

No Python needed — DuckDB CLI can run SQL and output results in one command:

# Query remote Parquet, output as CSV
duckdb -c "INSTALL httpfs; LOAD httpfs; COPY (
    SELECT VendorID, COUNT(*) AS cnt
    FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
    GROUP BY VendorID
) TO '/tmp/results.csv' (HEADER, DELIMITER ',');"

# Or just print to terminal
duckdb -c "
INSTALL httpfs; LOAD httpfs;
SELECT VendorID, COUNT(*) AS cnt
FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
GROUP BY VendorID;
"

This can replace daily cron scraper scripts entirely:

# crontab entry: fetch latest data at 8 AM daily
0 8 * * * duckdb -c "INSTALL httpfs; LOAD httpfs; COPY (SELECT * FROM read_parquet('https://data-bucket.s3.amazonaws.com/daily/*.parquet') WHERE date = current_date) TO '/tmp/daily_report.csv' (HEADER);"

4. Comparison with Traditional Web Scraping

DimensionTraditional Python ScraperDuckDB Direct Query
Lines of code30-100 (requests + pandas + error handling)1 line of SQL
Learning curveNeed requests, BeautifulSoup, anti-scrapingJust SQL
Disk usageDownload consumes disk spaceZero temp files
Memory usagePandas OOM on large filesStreaming, memory-friendly
Transfer efficiencyFull downloadParquet column pruning (10-20% of data)
Batch processingLoop + merge logicURL glob pattern, one-shot
Scheduled executioncron + Python script (Python env required)cron + duckdb CLI, zero dependencies
Format supportManual CSV/JSON/Parquet parsingAuto-detect format and types

Bottom line: For public data acquisition + analysis, DuckDB is an end-to-end solution. No Python environment, no third-party libraries, no intermediate files.


5. Limitations and Caveats

Not every scenario is suitable for DuckDB’s remote queries.

5.1 CSV/JSON Require Full Transfer

CSV and JSON are not columnar formats — DuckDB must download the complete file before parsing. For large files (500MB+ CSV), transfer time is comparable to downloading.

Workaround: For frequently-queried large files, convert to Parquet before uploading to your server/S3.

5.2 Server Must Support HTTP Range Requests

Parquet’s column pruning depends on HTTP Range Requests. Most CDNs and object stores (AWS S3, Cloudflare R2, MinIO) support this. Simple HTTP servers may not.

Verify with:

curl -I -H "Range: bytes=0-100" https://your-data-url

If it returns 206 Partial Content, you’re good.

5.3 Network Latency

Each HTTP Range Request has round-trip overhead. For small files (<1MB), local files are faster.

Guideline: Files under 1MB → download locally. Files 100MB+ Parquet → remote query.

5.4 Authentication

Private data sources need credentials:

-- S3 authentication
SET s3_region = 'us-east-1';
SET s3_access_key_id = 'your_key';
SET s3_secret_access_key = 'your_secret';

-- Bearer Token (for some APIs)
SET httpfs_bearer_token = 'your_token';

6. Monetization Strategies

1. Public Data Collection Service ($30-100/session)

Target clients: Small business owners, market analysts who need industry data but can’t code

Scenario: “Pull all real estate prices from this government website” / “Analyze all AI project trends on GitHub”

Delivery: One DuckDB SQL statement, output as Excel/CSV. No scraper to maintain, no scripts to break.

Pricing: $30-100/session per data source; bulk monthly $200-500/month

2. Automated Data Integration + Reporting ($50-250/month/client)

Target clients: E-commerce sellers, SaaS companies with data spread across platforms

Scenario: Client’s sales data on Shopify (CSV export), ad data on Google Ads (API→CSV), inventory in local Excel

Solution: DuckDB directly reads these remote CSV URLs, cron job generates daily reports

Delivery: cron + DuckDB CLI, daily auto-fetch → analyze → email/WeChat/DingTalk

Pricing: $50-250/month/client, near-zero maintenance cost

3. Data Lake Optimization Consulting ($200-1,000/project)

Target clients: Small-medium companies with data on S3/MinIO doing daily ETL to local

Solution: Switch to DuckDB querying S3 Parquet directly, eliminating ETL steps and intermediate storage

Delivery: Configure DuckDB httpfs + S3 credentials + write remote query SQL

Pricing: $200-1,000/project (depends on data volume and complexity)

4. Technical Training ($200-600/session)

Target clients: Company data teams, IT departments

Content: Teach teams how to replace traditional ETL and scraper workflows with DuckDB

Pricing: $200-600/session (2-3 hour online/offline workshop)


ServiceTarget ClientPrice RangeMonthly Revenue Potential
Public Data CollectionSmall biz owners, analysts$30-100/session$300-1,000
Automated Report SubscriptionE-commerce, SaaS$50-250/month$500-2,500
Data Lake ConsultingSMEs$200-1,000/project$800-2,000
Technical TrainingData teams$200-600/session$400-1,800

7. Summary

DuckDB’s remote file query capability has an underrated superpower that’s not about query performance — it’s about zero-cost data acquisition.

  • Public data that previously required a scraper → one SQL statement
  • Remote CSVs that needed local download → read_csv_auto('url') — direct query
  • S3 data that required ETL pipelines → read_parquet('s3://...')
  • Scheduled collection scripts needing constant maintenance → cron + duckdb -c "SELECT ..."

Core principle: Query data where it lives. Don’t move it first.

Next time someone sends you a data link, don’t wget. Don’t write requests.get(). Try DuckDB’s read_csv_auto('URL'). Five seconds and the data is right in front of you.


DuckDB version: 1.0+ (httpfs built-in)
Python dependencies: pip install duckdb pandas
CLI version: duckdb -c "SELECT ..." — no Python needed
License: MIT (fully open-source, commercial use OK)

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy