1. The Problem: You’re Still Writing Web Scrapers?
What’s the first step of any data analysis project?
It’s not writing SQL. It’s not tuning models. It’s getting the data.
The traditional workflow for acquiring public data:
- Find the data URL (GitHub CSV, government open data, S3 Parquet files)
- Download it in browser, or write a Python script:
import requests import pandas as pd from io import StringIO resp = requests.get('https://example.com/data.csv') df = pd.read_csv(StringIO(resp.text)) - Open in Pandas/Excel → OOM crash (file too large)
- Finally switch to DuckDB and start analyzing
20 minutes gone, and you haven’t written a single line of analysis code.
Even worse:
- 100 CSV files on a webpage — you need to write a loop to download and merge them
- Data updates daily — you need a cron job to fetch it every time
- File is 2GB+ — Pandas
read_csvimmediately runs out of memory
What if you could query the data directly at its URL — no download, no scraper scripts, no intermediate files?
That’s exactly what DuckDB’s zero-ETL data acquisition does.
2. The Solution: DuckDB Replaces Your Scraper Scripts
DuckDB’s built-in httpfs extension allows you to read CSV, Parquet, and JSON files directly from HTTP/HTTPS URLs in SQL.
Core idea: Query data where it lives. Don’t move it first.
2.1 Three Steps to Remote Queries
INSTALL httpfs; -- Install once (built-in since DuckDB 1.0)
LOAD httpfs; -- Load per session
-- Now query any URL directly
SELECT * FROM read_csv_auto('https://example.com/data.csv');
That’s it. No requests.get(), no wget, no temp directory cleanup.
2.2 Quick Comparison
| Traditional Approach | DuckDB Approach |
|---|---|
requests.get(url) → pandas.read_csv() | read_csv_auto('url') — direct query |
| Loop over 100 CSVs, concat DataFrames | read_csv_auto('https://.../*.csv') — glob pattern |
Daily cron: wget → unzip → analyze | cron → duckdb -c "SELECT ..." — one command |
| Download entire file to see if it’s useful | Parquet remote: fetch metadata only (5-50KB) |
3. Hands-On Examples
3.1 Example 1: Query a Public GitHub CSV/Parquet
GitHub is full of public datasets. Traditional approach: git clone the entire repo. DuckDB: one SQL statement.
-- NYC taxi sample data on GitHub
INSTALL httpfs;
LOAD httpfs;
SELECT
VendorID,
COUNT(*) AS trips,
ROUND(AVG(total_amount), 2) AS avg_amount,
ROUND(SUM(total_amount), 2) AS total_revenue
FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
WHERE total_amount > 0
GROUP BY VendorID
ORDER BY total_revenue DESC;
Execution time: 3-5 seconds (only pulls Parquet metadata and needed columns).
Traditional approach:
- Download 42MB Parquet → 10 seconds
- Load into memory → 5 seconds
- Execute query → 2 seconds Total: 17 seconds
DuckDB remote query: 5 seconds, zero temp files.
3.2 Example 2: URL Glob Patterns for Batch Ingestion
This is DuckDB’s most underrated capability — URL glob patterns work on remote files too.
Suppose a government open data site organizes files by date:
https://data.gov.example/traffic/2026/01/traffic_20260101.csv
https://data.gov.example/traffic/2026/01/traffic_20260102.csv
...
https://data.gov.example/traffic/2026/05/traffic_20260524.csv
Traditional approach: Python loop + requests.get() + DataFrame concat.
DuckDB approach:
-- Read all CSVs for a given month
SELECT
strftime(date, '%Y-%m-%d') AS day,
COUNT(*) AS records,
AVG(speed) AS avg_speed
FROM read_csv_auto(
'https://data.gov.example/traffic/2026/05/*.csv'
)
GROUP BY day
ORDER BY day;
* wildcard: matches all CSV files in that directory.
-- Recursive pattern — all directories, all CSVs
SELECT *
FROM read_csv_auto(
'https://data.gov.example/traffic/**/*.csv'
);
** recursive wildcard: matches nested directories — perfect for multi-level data warehouses.
-- More precise: only May 2026 data
SELECT *
FROM read_csv_auto(
'https://data.gov.example/traffic/2026/05/traffic_*.csv'
);
3.3 Example 3: S3 / Object Storage + Parquet Column Pruning
When data lives on AWS S3 or compatible object storage, Parquet remote queries become a superpower.
-- Query sales data on S3 with column pruning
SELECT
region,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS customers
FROM read_parquet(
's3://my-bucket/sales/2026/*/*.parquet'
)
WHERE revenue > 0
GROUP BY region
ORDER BY total_revenue DESC;
How column pruning works: DuckDB uses HTTP Range Requests to fetch only the columns it needs (region, revenue, customer_id). If the original file has 30 columns at 1GB, actual transfer might be only 30-80MB — a 90%+ reduction.
DuckDB also uses Parquet’s row group statistics for predicate pushdown (WHERE revenue > 0), skipping row groups that don’t match — further reducing transfer.
3.4 Example 4: Complete Python Script (Copy-Paste Ready)
#!/usr/bin/env python3
"""
DuckDB Data Acquisition + Analysis Demo
Fetches remote Parquet/CSV via HTTP, runs SQL analysis, exports HTML report
Prerequisites: pip install duckdb pandas
"""
import duckdb
import time
import os
def main():
# Connect to in-memory database
con = duckdb.connect()
# Enable httpfs extension
con.execute("INSTALL httpfs")
con.execute("LOAD httpfs")
# Configuration (optional)
con.execute("SET httpfs_timeout = 30")
con.execute("SET httpfs_retry_count = 3")
# ========== Example 1: GitHub Public Data ==========
print("=" * 60)
print("📦 Example 1: GitHub Public Dataset Query")
print("=" * 60)
start = time.time()
result = con.execute("""
SELECT
VendorID,
payment_type,
COUNT(*) AS trips,
ROUND(AVG(total_amount), 2) AS avg_amount,
ROUND(SUM(total_amount), 2) AS total_revenue
FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
WHERE total_amount > 0
AND total_amount < 500
GROUP BY VendorID, payment_type
ORDER BY total_revenue DESC
LIMIT 15
""").fetchdf()
elapsed = time.time() - start
print(f"✅ Query completed in {elapsed:.2f}s")
print(f"📊 {len(result)} rows returned\n")
print(result.to_string(index=False))
print()
# ========== Example 2: Remote CSV Analysis ==========
print("=" * 60)
print("📄 Example 2: Remote CSV Streaming Analysis")
print("=" * 60)
start = time.time()
result2 = con.execute("""
SELECT
column0 AS year,
COUNT(*) AS records
FROM read_csv_auto(
'https://raw.githubusercontent.com/plotly/datasets/master/gapminder_unfiltered.csv'
)
WHERE column0 > 2000
GROUP BY year
ORDER BY year
""").fetchdf()
elapsed2 = time.time() - start
print(f"✅ CSV remote query completed in {elapsed2:.2f}s")
print(f"📊 {len(result2)} rows returned\n")
print(result2.to_string(index=False))
print()
# ========== Generate HTML Report ==========
print("=" * 60)
print("📝 Generating HTML Report")
print("=" * 60)
html_report = f"""
<!DOCTYPE html>
<html>
<head><meta charset="utf-8">
<title>DuckDB Data Acquisition Report</title>
<style>
body {{ font-family: -apple-system, sans-serif; max-width: 900px; margin: 40px auto; padding: 0 20px; }}
h1 {{ color: #0d9488; }}
table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }}
th, td {{ border: 1px solid #ddd; padding: 8px; text-align: left; }}
th {{ background: #0d9488; color: white; }}
tr:nth-child(even) {{ background: #f5f5f5; }}
.summary {{ background: #f0fdf4; padding: 15px; border-radius: 8px; margin: 20px 0; }}
</style>
</head>
<body>
<h1>🦆 DuckDB Data Acquisition Report</h1>
<p>Generated: {time.strftime('%Y-%m-%d %H:%M:%S')}</p>
<div class="summary">
<h2>Example 1: NYC Taxi Data Analysis</h2>
<p>Query time: {elapsed:.2f}s | Rows: {len(result)}</p>
</div>
{result.to_html(index=False)}
<div class="summary">
<h2>Example 2: Remote CSV Analysis</h2>
<p>Query time: {elapsed2:.2f}s | Rows: {len(result2)}</p>
</div>
{result2.to_html(index=False)}
<hr>
<p><em>Powered by DuckDB httpfs — query remote data without downloading</em></p>
</body>
</html>
"""
output_path = "duckdb_remote_report.html"
with open(output_path, "w", encoding="utf-8") as f:
f.write(html_report)
print(f"✅ Report saved: {os.path.abspath(output_path)}")
con.close()
print("\n🎉 All done!")
if __name__ == "__main__":
main()
Run it:
pip install duckdb pandas
python3 duckdb_remote_data.py
3.5 One-Liner with DuckDB CLI
No Python needed — DuckDB CLI can run SQL and output results in one command:
# Query remote Parquet, output as CSV
duckdb -c "INSTALL httpfs; LOAD httpfs; COPY (
SELECT VendorID, COUNT(*) AS cnt
FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
GROUP BY VendorID
) TO '/tmp/results.csv' (HEADER, DELIMITER ',');"
# Or just print to terminal
duckdb -c "
INSTALL httpfs; LOAD httpfs;
SELECT VendorID, COUNT(*) AS cnt
FROM 'https://github.com/duckdb/duckdb-data/raw/main/nyc-taxi-data.parquet'
GROUP BY VendorID;
"
This can replace daily cron scraper scripts entirely:
# crontab entry: fetch latest data at 8 AM daily
0 8 * * * duckdb -c "INSTALL httpfs; LOAD httpfs; COPY (SELECT * FROM read_parquet('https://data-bucket.s3.amazonaws.com/daily/*.parquet') WHERE date = current_date) TO '/tmp/daily_report.csv' (HEADER);"
4. Comparison with Traditional Web Scraping
| Dimension | Traditional Python Scraper | DuckDB Direct Query |
|---|---|---|
| Lines of code | 30-100 (requests + pandas + error handling) | 1 line of SQL |
| Learning curve | Need requests, BeautifulSoup, anti-scraping | Just SQL |
| Disk usage | Download consumes disk space | Zero temp files |
| Memory usage | Pandas OOM on large files | Streaming, memory-friendly |
| Transfer efficiency | Full download | Parquet column pruning (10-20% of data) |
| Batch processing | Loop + merge logic | URL glob pattern, one-shot |
| Scheduled execution | cron + Python script (Python env required) | cron + duckdb CLI, zero dependencies |
| Format support | Manual CSV/JSON/Parquet parsing | Auto-detect format and types |
Bottom line: For public data acquisition + analysis, DuckDB is an end-to-end solution. No Python environment, no third-party libraries, no intermediate files.
5. Limitations and Caveats
Not every scenario is suitable for DuckDB’s remote queries.
5.1 CSV/JSON Require Full Transfer
CSV and JSON are not columnar formats — DuckDB must download the complete file before parsing. For large files (500MB+ CSV), transfer time is comparable to downloading.
Workaround: For frequently-queried large files, convert to Parquet before uploading to your server/S3.
5.2 Server Must Support HTTP Range Requests
Parquet’s column pruning depends on HTTP Range Requests. Most CDNs and object stores (AWS S3, Cloudflare R2, MinIO) support this. Simple HTTP servers may not.
Verify with:
curl -I -H "Range: bytes=0-100" https://your-data-url
If it returns 206 Partial Content, you’re good.
5.3 Network Latency
Each HTTP Range Request has round-trip overhead. For small files (<1MB), local files are faster.
Guideline: Files under 1MB → download locally. Files 100MB+ Parquet → remote query.
5.4 Authentication
Private data sources need credentials:
-- S3 authentication
SET s3_region = 'us-east-1';
SET s3_access_key_id = 'your_key';
SET s3_secret_access_key = 'your_secret';
-- Bearer Token (for some APIs)
SET httpfs_bearer_token = 'your_token';
6. Monetization Strategies
1. Public Data Collection Service ($30-100/session)
Target clients: Small business owners, market analysts who need industry data but can’t code
Scenario: “Pull all real estate prices from this government website” / “Analyze all AI project trends on GitHub”
Delivery: One DuckDB SQL statement, output as Excel/CSV. No scraper to maintain, no scripts to break.
Pricing: $30-100/session per data source; bulk monthly $200-500/month
2. Automated Data Integration + Reporting ($50-250/month/client)
Target clients: E-commerce sellers, SaaS companies with data spread across platforms
Scenario: Client’s sales data on Shopify (CSV export), ad data on Google Ads (API→CSV), inventory in local Excel
Solution: DuckDB directly reads these remote CSV URLs, cron job generates daily reports
Delivery: cron + DuckDB CLI, daily auto-fetch → analyze → email/WeChat/DingTalk
Pricing: $50-250/month/client, near-zero maintenance cost
3. Data Lake Optimization Consulting ($200-1,000/project)
Target clients: Small-medium companies with data on S3/MinIO doing daily ETL to local
Solution: Switch to DuckDB querying S3 Parquet directly, eliminating ETL steps and intermediate storage
Delivery: Configure DuckDB httpfs + S3 credentials + write remote query SQL
Pricing: $200-1,000/project (depends on data volume and complexity)
4. Technical Training ($200-600/session)
Target clients: Company data teams, IT departments
Content: Teach teams how to replace traditional ETL and scraper workflows with DuckDB
Pricing: $200-600/session (2-3 hour online/offline workshop)
| Service | Target Client | Price Range | Monthly Revenue Potential |
|---|---|---|---|
| Public Data Collection | Small biz owners, analysts | $30-100/session | $300-1,000 |
| Automated Report Subscription | E-commerce, SaaS | $50-250/month | $500-2,500 |
| Data Lake Consulting | SMEs | $200-1,000/project | $800-2,000 |
| Technical Training | Data teams | $200-600/session | $400-1,800 |
7. Summary
DuckDB’s remote file query capability has an underrated superpower that’s not about query performance — it’s about zero-cost data acquisition.
- Public data that previously required a scraper → one SQL statement
- Remote CSVs that needed local download →
read_csv_auto('url')— direct query - S3 data that required ETL pipelines →
read_parquet('s3://...') - Scheduled collection scripts needing constant maintenance →
cron + duckdb -c "SELECT ..."
Core principle: Query data where it lives. Don’t move it first.
Next time someone sends you a data link, don’t wget. Don’t write requests.get(). Try DuckDB’s read_csv_auto('URL'). Five seconds and the data is right in front of you.
DuckDB version: 1.0+ (httpfs built-in)
Python dependencies:pip install duckdb pandas
CLI version:duckdb -c "SELECT ..."— no Python needed
License: MIT (fully open-source, commercial use OK)
