Query Remote Files with DuckDB httpfs: Zero-Download Analytics on Cloud CSV, Parquet, and JSON

Query Remote Files with DuckDB httpfs: Zero-Download Analytics on Cloud CSV, Parquet, and JSON

Stop downloading files before analyzing them. DuckDB's httpfs extension lets you query remote CSV, Parquet, and JSON files over HTTP directly in SQL — zero download, zero copy, column-pruned transfers. This post covers Range Request optimization, HuggingFace dataset queries, S3 data lake patterns, full executable Python scripts, and performance benchmarks against traditional download-and-analyze workflows.

1. The Problem: Data Download Is Your Biggest Bottleneck

What’s the most time-consuming part of data analysis?

It’s not writing SQL. It’s not tuning parameters. It’s waiting for data to download.

Here’s the typical workflow:

A colleague shares a link: “Here’s the data, can you analyze it?”
You wget or browser-download a 500MB CSV — wait 5 minutes
Unzip it (if gz format) — another 1 minute
Open in Excel or Pandas — OOM crash, the file is too large
Switch to DuckDB — finally start querying

15 minutes gone, and you haven’t written a single line of analysis code.

More painful scenarios:

Exploratory analysis: You just want to preview a dataset, but must download the whole thing first
Data lake queries: Your company has thousands of Parquet files on S3, but you have to pull everything to query a week’s data
HuggingFace datasets: You want to evaluate a dataset for your ML project, but first you need to git clone dozens of gigabytes

What if there was a way to query without downloading?

DuckDB’s httpfs extension is exactly the answer.

2. The Solution: DuckDB httpfs Remote Query Capabilities

2.1 What Is httpfs?

httpfs is a core DuckDB extension (built-in since v1.0) that enables DuckDB to read and write remote files over HTTP/HTTPS. But it’s far more sophisticated than simply supporting URL paths — it leverages two key technologies:

1. HTTP Range Requests

When you run read_parquet('https://.../data.parquet'), DuckDB does NOT download the entire file. Instead, it sends a Range: bytes=0-1023 HTTP header to request only the metadata portion (the Parquet footer). After parsing column locations and statistics, it fetches data blocks on demand, column by column.

This means:

If your query references only 3 columns, DuckDB downloads only those columns’ data blocks
If you have WHERE clause filters, DuckDB first reads min/max statistics per row group and skips irrelevant ones
Actual network transfer can be as low as 5%-20% of the original file size

2. Columnar File Format (Parquet)

Parquet’s columnar storage is naturally suited for remote query patterns. Each column’s data is organized in row groups, with independent statistics (min/max/null count) per row group. DuckDB can:

Read only the columns referenced in your query
Skip row groups that don’t satisfy WHERE conditions
Answer aggregate queries (COUNT/SUM/AVG) directly from metadata

2.2 Supported File Formats

Format	Function	Remote Support	Column Pruning	Predicate Pushdown
Parquet	`read_parquet()`	✅ Efficient	✅	✅
CSV	`read_csv_auto()`	✅ Full download	❌	❌
JSON	`read_json_auto()`	✅ Full download	❌	❌
CSV (gz)	`read_csv_auto()`	✅ Full download	❌	❌

Key principle: Only remote Parquet queries benefit from column pruning. CSV/JSON must be fully downloaded before parsing — great for small files or fast networks, but for large datasets, convert to Parquet first.

2.3 One-Line Summary

For cloud Parquet data: zero download, query directly, fetch only needed columns — 10-50x faster.
For cloud CSV/JSON data: skip manual download, one SQL query — perfect for small-to-medium files.

3. Hands-On Examples

3.1 Enable httpfs Extension

INSTALL httpfs;   -- One-time installation
LOAD httpfs;      -- Required per session

3.2 Example 1: Querying HuggingFace Movie Dataset

HuggingFace hosts vast public datasets in Parquet format. Query directly without downloading:

-- Query rating distribution from TMDB movie dataset
SELECT 
    genre,
    ROUND(AVG(vote_average), 2) AS avg_rating,
    ROUND(AVG(vote_count), 0) AS avg_votes,
    COUNT(*) AS movie_count
FROM read_parquet(
    'https://huggingface.co/datasets/TMDB/tmdb-movie-metadata/resolve/main/data/movies.parquet'
)
WHERE vote_count > 50
GROUP BY genre
ORDER BY avg_rating DESC
LIMIT 10;

This query downloads only the genre, vote_average, and vote_count columns from the Parquet file. If the original file has 20 columns and is 500MB, actual transfer may be as low as 30-50MB.

3.3 Example 2: Remote CSV Analysis (GitHub Public Data)

CSV files can’t use column pruning, but the convenience of zero-download is still enormous:

-- Analyze public event data directly from GitHub
SELECT 
    strftime(date, '%Y-%m') AS month,
    COUNT(*) AS event_count,
    COUNT(DISTINCT repo_name) AS repos
FROM read_csv_auto(
    'https://raw.githubusercontent.com/example/public-data/main/events.csv'
)
WHERE date >= '2026-01-01'
GROUP BY month
ORDER BY month;

3.4 Example 3: Multi-File Remote Query (Glob Pattern)

Remote files support glob wildcards — extremely useful for data lake scenarios:

-- Query a date range of all Parquet files on S3
SELECT 
    region,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT customer_id) AS customers
FROM read_parquet(
    'https://data-bucket.s3.amazonaws.com/sales/*/2026/05/*/*.parquet'
)
WHERE amount > 0
GROUP BY region
ORDER BY total_revenue DESC;

3.5 Complete Executable Python Script

#!/usr/bin/env python3
"""
DuckDB Remote File Query Demo
Query a HuggingFace remote Parquet dataset and export results
Prerequisites: pip install duckdb
"""

import duckdb
import time

def main():
    # Connect to in-memory database
    con = duckdb.connect()
    
    # Enable httpfs extension
    con.execute("INSTALL httpfs")
    con.execute("LOAD httpfs")
    
    # Optional: configure httpfs parameters
    con.execute("SET httpfs_retry_count = 3")
    con.execute("SET httpfs_timeout = 30")
    
    # Remote Parquet URL (HuggingFace TMDB movie data)
    remote_url = (
        "https://huggingface.co/datasets/TMDB/"
        "tmdb-movie-metadata/resolve/main/data/movies.parquet"
    )
    
    print(f"🔍 Querying remote: {remote_url}")
    print(f"⏳ Transferring only needed columns (not the whole file)...\n")
    
    start = time.time()
    
    # Query: DuckDB fetches only requested columns via Range Requests
    result = con.execute(f"""
        SELECT 
            title,
            vote_average,
            vote_count,
            release_date,
            genres
        FROM read_parquet('{remote_url}')
        WHERE vote_count > 100
          AND vote_average > 7.0
        ORDER BY vote_average DESC
        LIMIT 20
    """).fetchdf()
    
    elapsed = time.time() - start
    
    print(f"✅ Query completed in {elapsed:.2f}s")
    print(f"📊 {len(result)} rows returned\n")
    
    # Display results
    print("=" * 80)
    print(f"{'Rank':<4} {'Title':<40} {'Rating':<6} {'Votes':<8} {'Genre'}")
    print("-" * 80)
    for i, row in result.iterrows():
        title = str(row['title'])[:38] + '..' if len(str(row['title'])) > 38 else row['title']
        genres = str(row['genres'])[:30] if row['genres'] else 'N/A'
        print(f"{i+1:<4} {title:<40} {row['vote_average']:<6.1f} {row['vote_count']:<8} {genres}")
    
    # Export to local CSV
    output_path = "top_movies.csv"
    con.execute(f"""
        COPY (
            SELECT * FROM read_parquet('{remote_url}')
            WHERE vote_count > 100 AND vote_average > 7.0
            ORDER BY vote_average DESC
            LIMIT 20
        ) TO '{output_path}' (HEADER, DELIMITER ',')
    """)
    print(f"\n💾 Results exported: {output_path}")
    
    # Query dataset statistics (uses Parquet metadata, near-zero transfer)
    stats = con.execute(f"""
        SELECT 
            COUNT(*) AS total_movies,
            ROUND(AVG(vote_average), 2) AS avg_rating,
            ROUND(AVG(vote_count), 0) AS avg_vote_count,
            MIN(release_date) AS earliest,
            MAX(release_date) AS latest
        FROM read_parquet('{remote_url}')
    """).fetchone()
    
    print(f"\n📈 Dataset Overview (metadata query)")
    print(f"   Total movies: {stats[0]:,}")
    print(f"   Average rating: {stats[1]}")
    print(f"   Average votes: {stats[2]:,.0f}")
    print(f"   Date range: {stats[3]} ~ {stats[4]}")
    
    con.close()

if __name__ == "__main__":
    main()

Run it:

pip install duckdb pandas
python3 duckdb_remote_query.py

3.6 DuckDB CLI Example (Copy-Paste Ready)

# Launch DuckDB CLI
duckdb

# In the CLI
INSTALL httpfs;
LOAD httpfs;

SELECT title, vote_average, vote_count
FROM read_parquet('https://huggingface.co/datasets/TMDB/tmdb-movie-metadata/resolve/main/data/movies.parquet')
WHERE vote_count > 1000
ORDER BY vote_average DESC
LIMIT 10;

4. Performance Comparison

Scenario	Traditional Approach	DuckDB httpfs	Time Saved
100MB Parquet (5 columns)	Download 100MB + load + query ≈ 30s	Range Request pulls 20MB ≈ 5s	83%
500MB Parquet (3 cols + aggregate)	Download 500MB + load + aggregate ≈ 2min	Metadata-only query ≈ 2s	98%
1GB CSV (full analysis)	Download 1GB + Pandas load + analyze ≈ 5min	DuckDB streaming ≈ 30s	90%
10 remote Parquets (exploration)	Download all 10GB + inspect ≈ 10min	Per-column per-file fetch ≈ 15s	97%
API JSON data (daily analysis)	Python script + parse + clean ≈ 30min	One SQL query ≈ 1min	97%

5. Advanced Techniques

5.1 HTTP Configuration

-- Set retry count (for unstable networks)
SET httpfs_retry_count = 5;

-- Set request timeout (seconds)
SET httpfs_timeout = 60;

-- Configure S3-compatible storage (MinIO, Alibaba OSS, etc.)
SET s3_region = 'us-east-1';
SET s3_access_key_id = '...';
SET s3_secret_access_key = '...';
SET s3_endpoint = 'https://my-minio-server.com';

5.2 Remote Query + Local Materialization

Sometimes you want to pull a filtered copy locally:

-- Materialize remote data into a local table
CREATE TABLE local_movies AS
SELECT * FROM read_parquet('https://.../movies.parquet')
WHERE year >= 2020;

-- Now query locally at lightning speed
SELECT genre, COUNT(*) FROM local_movies GROUP BY genre;

5.3 Multi-Source JOIN Queries

DuckDB excels at combining remote and local data in a single query:

-- Remote Parquet JOIN local CSV
SELECT 
    r.region,
    r.revenue,
    l.store_name
FROM read_parquet('https://s3-bucket/revenue/*.parquet') r
JOIN read_csv_auto('local_stores.csv') l
    ON r.store_id = l.store_id
WHERE r.date >= '2026-01-01';

5.4 S3-Compatible Object Storage

Beyond public HTTP, httpfs supports AWS S3 and compatible storage:

-- AWS S3 (requires credentials)
SELECT * FROM read_parquet('s3://my-bucket/sales/*.parquet');

-- MinIO / Alibaba OSS / Tencent COS
SELECT * FROM read_parquet('s3://my-bucket/data/*.parquet');

6. Limitations & Caveats

CSV/JSON requires full transfer: These formats aren’t columnar, so DuckDB must download the entire file before parsing. For frequent queries on large files, convert to Parquet first.
Network latency sensitive: Each Range Request incurs round-trip overhead. For very small files (<1MB), local files are faster.
Authentication required for private data: S3/MinIO needs access keys configured. Public URLs (e.g., HuggingFace datasets) need no configuration.
Concurrency limits: Multiple concurrent queries on the same remote file may hit server rate limits.
Limited writes: httpfs is primarily for reads. Writing remote files (COPY TO) works only on some S3-compatible stores.

7. Monetization Ideas

This skill solves real problems that clients will pay for:

1. Data Exploration Consulting (¥300-800/session)

Scenario: A client has cloud data but doesn’t know if it’s worth downloading. You use DuckDB remote queries to preview datasets — fields, quality, size — and deliver a report in 5 minutes.

2. Data Lake Query Optimization (¥2,000-5,000/project)

Scenario: A company’s data sits on S3, and they traditionally ETL everything locally before analysis. You migrate them to DuckDB direct S3 Parquet queries, eliminating ETL and storage costs.

3. Automated Remote Data Reports (¥500-2,000/month/client)

Scenario: A client’s business data updates daily on object storage. You set up a DuckDB cron job that queries remote data and outputs PDF/Excel reports — monthly subscription.

4. HuggingFace Dataset Evaluation Service (¥200-500/session)

Scenario: AI/ML teams need to evaluate public datasets. You remotely query dataset distribution and statistics, delivering a quick evaluation report.

5. Corporate Training (¥2,000-5,000/session)

Scenario: Train a company’s data team on “How to efficiently query cloud data with DuckDB” — covering httpfs configuration, S3 integration, and performance optimization.

Service	Target Clients	Price Range	Monthly Revenue Potential
Data Exploration	SMBs, startups	¥300-800/session	¥3,000-8,000
Data Lake Optimization	Companies with S3/cloud storage	¥2,000-5,000/project	¥10,000-30,000
Remote Report Subscription	E-com, SaaS companies	¥500-2,000/month	¥5,000-20,000
Dataset Evaluation	AI/ML teams	¥200-500/session	¥2,000-5,000
Corporate Training	Enterprise data teams	¥2,000-5,000/session	¥4,000-10,000

Architecture Overview

8. Conclusion

DuckDB’s remote file query capability transforms “download → analyze” into “analyze directly.” Key takeaways:

Remote Parquet is the real killer feature — column pruning and predicate pushdown reduce transfer to 5%-20%
CSV/JSON works well for small files or one-off analysis — eliminates manual download hassle
S3 / object storage + DuckDB = lightweight data lake query engine
Perfect for exploratory analysis, automated reporting, and data previews

Next time someone shares a data link, don’t wget. Try read_parquet('https://...') with DuckDB instead.

DuckDB version: 1.0+ (httpfs built-in) Python dependency: pip install duckdb License: MIT (fully open source, commercial-friendly)

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.