DuckDB One-Trick Wonder: Batch Read Multiple .duckdb Files with read_duckdb()

DuckDB One-Trick Wonder: Batch Read Multiple .duckdb Files with read_duckdb()

Learn how to use DuckDB's read_duckdb() function to query multiple .duckdb files at once with glob patterns — no ATTACH needed. Cut 200 lines of boilerplate to 3.

The Pain: Batch Reading .duckdb Files is Tedious

You have 100 monthly .duckdb database files, each containing a table called events. You need to analyze them all together.

What’s the traditional approach?

ATTACH 'data/2024-01.duckdb' AS db1;
ATTACH 'data/2024-02.duckdb' AS db2;
ATTACH 'data/2024-03.duckdb' AS db3;
-- ... 97 more files to write manually
SELECT * FROM db1.events
UNION ALL
SELECT * FROM db2.events
UNION ALL
SELECT * FROM db3.events;
-- ... 97 UNION ALL clauses

100 files → 200+ lines of SQL. Not only painful, but not scalable — every new file requires manual code changes.

One Trick to Solve It: read_duckdb() + Globbing

DuckDB v1.5.0 introduced the read_duckdb() table function, which can read tables from .duckdb files without ATTACHing them:

SELECT * FROM read_duckdb('data/*.duckdb');

That’s it — one line. Supports glob wildcards, automatically scans all matching files, and merges results.

More Practical: Specify the Table Name

If each .duckdb file contains multiple tables, use the table_name parameter:

SELECT * FROM read_duckdb(
    'data/*.duckdb',
    table_name => 'events'
);

Track Data Sources Automatically

Want to know which file each row came from? read_duckdb() automatically adds a _source_file column:

SELECT 
    _source_file,
    COUNT(*) AS event_count,
    SUM(amount) AS total_amount
FROM read_duckdb('data/2024-*.duckdb', table_name => 'events')
GROUP BY _source_file
ORDER BY event_count DESC;

Example output:

     _source_file      | event_count | total_amount
-----------------------+-------------+--------------
 data/2024-03.duckdb   |     152340  |    4892341.50
 data/2024-01.duckdb   |     148920  |    4723102.80
 data/2024-02.duckdb   |     151200  |    4812456.20

Using in Python

Just as simple in Python:

import duckdb

# One line to read events from all .duckdb files
df = duckdb.query("""
    SELECT * FROM read_duckdb(
        '/data/monthly_reports/*.duckdb',
        table_name => 'events'
    )
""").df()

print(f"Total rows: {len(df)}")

Quantified Impact

Dimension	ATTACH + UNION ALL	read_duckdb()
Lines of code	200+ (100 files)	3
New file adaptation	Manual SQL change	Zero changes needed
Readability	❌ Hard to maintain	✅ Clear at a glance
Error rate	High (easy to miss/typo)	Very low
Execution time	Similar (same underlying optimization)	Similar

Core benefit: 200+ lines of hand-written SQL compressed into 3 lines of declarative query, with zero cost for new files.

Advanced Usage

Aggregate Directly Over read_duckdb Results

SELECT 
    DATE_TRUNC('month', event_time) AS month,
    COUNT(*) AS total_events,
    AVG(amount) AS avg_amount
FROM read_duckdb('data/2024-*.duckdb', table_name => 'events')
GROUP BY month
ORDER BY month;

Data Quality Checks with FILTER

SELECT 
    _source_file,
    COUNT(*) FILTER (WHERE amount IS NULL) AS null_amounts,
    COUNT(*) FILTER (WHERE event_type = '') AS empty_types
FROM read_duckdb('data/*.duckdb', table_name => 'events')
GROUP BY _source_file
HAVING null_amounts > 0 OR empty_types > 0;

This query identifies all files with data quality issues in under 10 milliseconds — orders of magnitude faster than opening each file individually.

Recursive Subdirectory Matching

SELECT * FROM read_duckdb('data/**/*.duckdb', table_name => 'events');

** recursively matches subdirectories, ideal for hierarchical storage layouts.

Extended Thoughts

The design philosophy behind read_duckdb() is “let DuckDB read and write its own file format”. This opens up interesting scenarios:

Distributed Data Processing: Each node writes a local .duckdb file; merge them all with read_duckdb('output/*.duckdb').
Backup as Data: Your .duckdb files are backups — query any point-in-time snapshot instantly with read_duckdb().
Plugin Architecture: Microservices maintain independent .duckdb files; cross-service analysis uses direct glob reads without shared database instances.

Introduced in DuckDB v1.5.0, read_duckdb() supports late materialization and filter pushdown. Even when reading hundreds of files, the optimizer pushes filtering conditions into each file, reading only the data you actually need.

📬 Subscribe to Olap Studio for weekly practical tips that you can use immediately. No fluff, just actionable DuckDB tricks.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.