Replace Web Scrapers with One SQL Statement: DuckDB Data Acquisition

Replace Web Scrapers with One SQL Statement: DuckDB Data Acquisition

Stop writing Python web scrapers for data acquisition. DuckDB's httpfs extension lets you query remote CSV, Parquet, and JSON files directly from HTTP URLs — no requests library, no BeautifulSoup, no local downloads. This post compares traditional scraping with DuckDB's zero-ETL approach, demonstrates URL glob patterns for batch file ingestion, S3 data lake queries, and provides full executable Python scripts with monetization strategies.

1. The Problem: You’re Still Writing Web Scrapers?

What’s the first step of any data analysis project?

It’s not writing SQL. It’s not tuning models. It’s getting the data.

The traditional workflow for acquiring public data:

Find the data URL (GitHub CSV, government open data, S3 Parquet files)

Download it in browser, or write a Python script:

import requests
import pandas as pd
from io import StringIO

resp = requests.get('https://example.com/data.csv')
df = pd.read_csv(StringIO(resp.text))

Open in Pandas/Excel → OOM crash (file too large)
Finally switch to DuckDB and start analyzing

20 minutes gone, and you haven’t written a single line of analysis code.

Even worse:

100 CSV files on a webpage — you need to write a loop to download and merge them
Data updates daily — you need a cron job to fetch it every time
File is 2GB+ — Pandas read_csv immediately runs out of memory

What if you could query the data directly at its URL — no download, no scraper scripts, no intermediate files?

That’s exactly what DuckDB’s zero-ETL data acquisition does.

2. The Solution: DuckDB Replaces Your Scraper Scripts

DuckDB’s built-in httpfs extension allows you to read CSV, Parquet, and JSON files directly from HTTP/HTTPS URLs in SQL.

Core idea: Query data where it lives. Don’t move it first.

2.1 Three Steps to Remote Queries

INSTALL httpfs;   -- Install once (built-in since DuckDB 1.0)
LOAD httpfs;      -- Load per session

-- Now query any URL directly
SELECT * FROM read_csv_auto('https://example.com/data.csv');

That’s it. No requests.get(), no wget, no temp directory cleanup.

2.2 Quick Comparison

Traditional Approach	DuckDB Approach
`requests.get(url)` → `pandas.read_csv()`	`read_csv_auto('url')` — direct query
Loop over 100 CSVs, concat DataFrames	`read_csv_auto('https://.../*.csv')` — glob pattern
Daily cron: `wget` → unzip → analyze	`cron → duckdb -c "SELECT ..."` — one command
Download entire file to see if it’s useful	Parquet remote: fetch metadata only (5-50KB)

3. Hands-On Examples

3.1 Example 1: Query a Public GitHub CSV/Parquet

GitHub is full of public datasets. Traditional approach: git clone the entire repo. DuckDB: one SQL statement.

-- NYC taxi sample data on GitHub
INSTALL httpfs;
LOAD httpfs;

SELECT 
    VendorID,
    COUNT(*) AS trips,
    ROUND(AVG(total_amount), 2) AS avg_amount,
    ROUND(SUM(total_amount), 2) AS total_revenue
FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
WHERE total_amount > 0
GROUP BY VendorID
ORDER BY total_revenue DESC;

Execution time: 3-5 seconds (only pulls Parquet metadata and needed columns).

Traditional approach:

Download 42MB Parquet → 10 seconds
Load into memory → 5 seconds
Execute query → 2 seconds Total: 17 seconds

DuckDB remote query: 5 seconds, zero temp files.

3.2 Example 2: URL Glob Patterns for Batch Ingestion

This is DuckDB’s most underrated capability — URL glob patterns work on remote files too.

Suppose a government open data site organizes files by date:

https://data.gov.example/traffic/2026/01/traffic_20260101.csv
https://data.gov.example/traffic/2026/01/traffic_20260102.csv
...
https://data.gov.example/traffic/2026/05/traffic_20260524.csv

Traditional approach: Python loop + requests.get() + DataFrame concat.

DuckDB approach:

-- Read all CSVs for a given month
SELECT 
    strftime(date, '%Y-%m-%d') AS day,
    COUNT(*) AS records,
    AVG(speed) AS avg_speed
FROM read_csv_auto(
    'https://data.gov.example/traffic/2026/05/*.csv'
)
GROUP BY day
ORDER BY day;

* wildcard: matches all CSV files in that directory.

-- Recursive pattern — all directories, all CSVs
SELECT *
FROM read_csv_auto(
    'https://data.gov.example/traffic/**/*.csv'
);

** recursive wildcard: matches nested directories — perfect for multi-level data warehouses.

-- More precise: only May 2026 data
SELECT *
FROM read_csv_auto(
    'https://data.gov.example/traffic/2026/05/traffic_*.csv'
);

3.3 Example 3: S3 / Object Storage + Parquet Column Pruning

When data lives on AWS S3 or compatible object storage, Parquet remote queries become a superpower.

-- Query sales data on S3 with column pruning
SELECT 
    region,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT customer_id) AS customers
FROM read_parquet(
    's3://my-bucket/sales/2026/*/*.parquet'
)
WHERE revenue > 0
GROUP BY region
ORDER BY total_revenue DESC;

How column pruning works: DuckDB uses HTTP Range Requests to fetch only the columns it needs (region, revenue, customer_id). If the original file has 30 columns at 1GB, actual transfer might be only 30-80MB — a 90%+ reduction.

DuckDB also uses Parquet’s row group statistics for predicate pushdown (WHERE revenue > 0), skipping row groups that don’t match — further reducing transfer.

3.4 Example 4: Complete Python Script (Copy-Paste Ready)

#!/usr/bin/env python3
"""
DuckDB Data Acquisition + Analysis Demo
Fetches remote Parquet/CSV via HTTP, runs SQL analysis, exports HTML report
Prerequisites: pip install duckdb pandas
"""

import duckdb
import time
import os

def main():
    # Connect to in-memory database
    con = duckdb.connect()
    
    # Enable httpfs extension
    con.execute("INSTALL httpfs")
    con.execute("LOAD httpfs")
    
    # Configuration (optional)
    con.execute("SET httpfs_timeout = 30")
    con.execute("SET httpfs_retry_count = 3")
    
    # ========== Example 1: GitHub Public Data ==========
    print("=" * 60)
    print("📦 Example 1: GitHub Public Dataset Query")
    print("=" * 60)
    
    start = time.time()
    
    result = con.execute("""
        SELECT 
            VendorID,
            payment_type,
            COUNT(*) AS trips,
            ROUND(AVG(total_amount), 2) AS avg_amount,
            ROUND(SUM(total_amount), 2) AS total_revenue
        FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
        WHERE total_amount > 0
          AND total_amount < 500
        GROUP BY VendorID, payment_type
        ORDER BY total_revenue DESC
        LIMIT 15
    """).fetchdf()
    
    elapsed = time.time() - start
    print(f"✅ Query completed in {elapsed:.2f}s")
    print(f"📊 {len(result)} rows returned\n")
    print(result.to_string(index=False))
    print()
    
    # ========== Example 2: Remote CSV Analysis ==========
    print("=" * 60)
    print("📄 Example 2: Remote CSV Streaming Analysis")
    print("=" * 60)
    
    start = time.time()
    
    result2 = con.execute("""
        SELECT 
            column0 AS year,
            COUNT(*) AS records
        FROM read_csv_auto(
            'https://raw.githubusercontent.com/plotly/datasets/master/gapminder_unfiltered.csv'
        )
        WHERE column0 > 2000
        GROUP BY year
        ORDER BY year
    """).fetchdf()
    
    elapsed2 = time.time() - start
    print(f"✅ CSV remote query completed in {elapsed2:.2f}s")
    print(f"📊 {len(result2)} rows returned\n")
    print(result2.to_string(index=False))
    print()
    
    # ========== Generate HTML Report ==========
    print("=" * 60)
    print("📝 Generating HTML Report")
    print("=" * 60)
    
    html_report = f"""
    <!DOCTYPE html>
    <html>
    <head><meta charset="utf-8">
    <title>DuckDB Data Acquisition Report</title>
    <style>
        body {{ font-family: -apple-system, sans-serif; max-width: 900px; margin: 40px auto; padding: 0 20px; }}
        h1 {{ color: #0d9488; }}
        table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }}
        th, td {{ border: 1px solid #ddd; padding: 8px; text-align: left; }}
        th {{ background: #0d9488; color: white; }}
        tr:nth-child(even) {{ background: #f5f5f5; }}
        .summary {{ background: #f0fdf4; padding: 15px; border-radius: 8px; margin: 20px 0; }}
    </style>
    </head>
    <body>
    <h1>🦆 DuckDB Data Acquisition Report</h1>
    <p>Generated: {time.strftime('%Y-%m-%d %H:%M:%S')}</p>
    
    <div class="summary">
        <h2>Example 1: NYC Taxi Data Analysis</h2>
        <p>Query time: {elapsed:.2f}s | Rows: {len(result)}</p>
    </div>
    
    {result.to_html(index=False)}
    
    <div class="summary">
        <h2>Example 2: Remote CSV Analysis</h2>
        <p>Query time: {elapsed2:.2f}s | Rows: {len(result2)}</p>
    </div>
    
    {result2.to_html(index=False)}
    
    <hr>
    <p><em>Powered by DuckDB httpfs — query remote data without downloading</em></p>
    </body>
    </html>
    """
    
    output_path = "duckdb_remote_report.html"
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(html_report)
    
    print(f"✅ Report saved: {os.path.abspath(output_path)}")
    
    con.close()
    print("\n🎉 All done!")

if __name__ == "__main__":
    main()

Run it:

pip install duckdb pandas
python3 duckdb_remote_data.py

3.5 One-Liner with DuckDB CLI

No Python needed — DuckDB CLI can run SQL and output results in one command:

# Query remote Parquet, output as CSV
duckdb -c "INSTALL httpfs; LOAD httpfs; COPY (
    SELECT VendorID, COUNT(*) AS cnt
    FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
    GROUP BY VendorID
) TO '/tmp/results.csv' (HEADER, DELIMITER ',');"

# Or just print to terminal
duckdb -c "
INSTALL httpfs; LOAD httpfs;
SELECT VendorID, COUNT(*) AS cnt
FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
GROUP BY VendorID;
"

This can replace daily cron scraper scripts entirely:

# crontab entry: fetch latest data at 8 AM daily
0 8 * * * duckdb -c "INSTALL httpfs; LOAD httpfs; COPY (SELECT * FROM read_parquet('https://data-bucket.s3.amazonaws.com/daily/*.parquet') WHERE date = current_date) TO '/tmp/daily_report.csv' (HEADER);"

4. Comparison with Traditional Web Scraping

Dimension	Traditional Python Scraper	DuckDB Direct Query
Lines of code	30-100 (requests + pandas + error handling)	1 line of SQL
Learning curve	Need requests, BeautifulSoup, anti-scraping	Just SQL
Disk usage	Download consumes disk space	Zero temp files
Memory usage	Pandas OOM on large files	Streaming, memory-friendly
Transfer efficiency	Full download	Parquet column pruning (10-20% of data)
Batch processing	Loop + merge logic	URL glob pattern, one-shot
Scheduled execution	cron + Python script (Python env required)	cron + duckdb CLI, zero dependencies
Format support	Manual CSV/JSON/Parquet parsing	Auto-detect format and types

Bottom line: For public data acquisition + analysis, DuckDB is an end-to-end solution. No Python environment, no third-party libraries, no intermediate files.

5. Limitations and Caveats

Not every scenario is suitable for DuckDB’s remote queries.

5.1 CSV/JSON Require Full Transfer

CSV and JSON are not columnar formats — DuckDB must download the complete file before parsing. For large files (500MB+ CSV), transfer time is comparable to downloading.

Workaround: For frequently-queried large files, convert to Parquet before uploading to your server/S3.

5.2 Server Must Support HTTP Range Requests

Parquet’s column pruning depends on HTTP Range Requests. Most CDNs and object stores (AWS S3, Cloudflare R2, MinIO) support this. Simple HTTP servers may not.

Verify with:

curl -I -H "Range: bytes=0-100" https://your-data-url

If it returns 206 Partial Content, you’re good.

5.3 Network Latency

Each HTTP Range Request has round-trip overhead. For small files (<1MB), local files are faster.

Guideline: Files under 1MB → download locally. Files 100MB+ Parquet → remote query.

5.4 Authentication

Private data sources need credentials:

-- S3 authentication
SET s3_region = 'us-east-1';
SET s3_access_key_id = 'your_key';
SET s3_secret_access_key = 'your_secret';

-- Bearer Token (for some APIs)
SET httpfs_bearer_token = 'your_token';

6. Monetization Strategies

1. Public Data Collection Service ($30-100/session)

Target clients: Small business owners, market analysts who need industry data but can’t code

Scenario: “Pull all real estate prices from this government website” / “Analyze all AI project trends on GitHub”

Delivery: One DuckDB SQL statement, output as Excel/CSV. No scraper to maintain, no scripts to break.

Pricing: $30-100/session per data source; bulk monthly $200-500/month

2. Automated Data Integration + Reporting ($50-250/month/client)

Target clients: E-commerce sellers, SaaS companies with data spread across platforms

Scenario: Client’s sales data on Shopify (CSV export), ad data on Google Ads (API→CSV), inventory in local Excel

Solution: DuckDB directly reads these remote CSV URLs, cron job generates daily reports

Delivery: cron + DuckDB CLI, daily auto-fetch → analyze → email/WeChat/DingTalk

Pricing: $50-250/month/client, near-zero maintenance cost

3. Data Lake Optimization Consulting ($200-1,000/project)

Target clients: Small-medium companies with data on S3/MinIO doing daily ETL to local

Solution: Switch to DuckDB querying S3 Parquet directly, eliminating ETL steps and intermediate storage

Delivery: Configure DuckDB httpfs + S3 credentials + write remote query SQL

Pricing: $200-1,000/project (depends on data volume and complexity)

4. Technical Training ($200-600/session)

Target clients: Company data teams, IT departments

Content: Teach teams how to replace traditional ETL and scraper workflows with DuckDB

Pricing: $200-600/session (2-3 hour online/offline workshop)

Service	Target Client	Price Range	Monthly Revenue Potential
Public Data Collection	Small biz owners, analysts	$30-100/session	$300-1,000
Automated Report Subscription	E-commerce, SaaS	$50-250/month	$500-2,500
Data Lake Consulting	SMEs	$200-1,000/project	$800-2,000
Technical Training	Data teams	$200-600/session	$400-1,800

7. Summary

DuckDB’s remote file query capability has an underrated superpower that’s not about query performance — it’s about zero-cost data acquisition.

Public data that previously required a scraper → one SQL statement
Remote CSVs that needed local download → read_csv_auto('url') — direct query
S3 data that required ETL pipelines → read_parquet('s3://...')
Scheduled collection scripts needing constant maintenance → cron + duckdb -c "SELECT ..."

Core principle: Query data where it lives. Don’t move it first.

Next time someone sends you a data link, don’t wget. Don’t write requests.get(). Try DuckDB’s read_csv_auto('URL'). Five seconds and the data is right in front of you.

DuckDB version: 1.0+ (httpfs built-in)
Python dependencies: pip install duckdb pandas
CLI version: duckdb -c "SELECT ..." — no Python needed
License: MIT (fully open-source, commercial use OK)

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.