The Pain: Batch Reading .duckdb Files is Tedious
You have 100 monthly .duckdb database files, each containing a table called events. You need to analyze them all together.
What’s the traditional approach?
ATTACH 'data/2024-01.duckdb' AS db1;
ATTACH 'data/2024-02.duckdb' AS db2;
ATTACH 'data/2024-03.duckdb' AS db3;
-- ... 97 more files to write manually
SELECT * FROM db1.events
UNION ALL
SELECT * FROM db2.events
UNION ALL
SELECT * FROM db3.events;
-- ... 97 UNION ALL clauses
100 files → 200+ lines of SQL. Not only painful, but not scalable — every new file requires manual code changes.
One Trick to Solve It: read_duckdb() + Globbing
DuckDB v1.5.0 introduced the read_duckdb() table function, which can read tables from .duckdb files without ATTACHing them:
SELECT * FROM read_duckdb('data/*.duckdb');
That’s it — one line. Supports glob wildcards, automatically scans all matching files, and merges results.
More Practical: Specify the Table Name
If each .duckdb file contains multiple tables, use the table_name parameter:
SELECT * FROM read_duckdb(
'data/*.duckdb',
table_name => 'events'
);
Track Data Sources Automatically
Want to know which file each row came from? read_duckdb() automatically adds a _source_file column:
SELECT
_source_file,
COUNT(*) AS event_count,
SUM(amount) AS total_amount
FROM read_duckdb('data/2024-*.duckdb', table_name => 'events')
GROUP BY _source_file
ORDER BY event_count DESC;
Example output:
_source_file | event_count | total_amount
-----------------------+-------------+--------------
data/2024-03.duckdb | 152340 | 4892341.50
data/2024-01.duckdb | 148920 | 4723102.80
data/2024-02.duckdb | 151200 | 4812456.20
Using in Python
Just as simple in Python:
import duckdb
# One line to read events from all .duckdb files
df = duckdb.query("""
SELECT * FROM read_duckdb(
'/data/monthly_reports/*.duckdb',
table_name => 'events'
)
""").df()
print(f"Total rows: {len(df)}")
Quantified Impact
| Dimension | ATTACH + UNION ALL | read_duckdb() |
|---|---|---|
| Lines of code | 200+ (100 files) | 3 |
| New file adaptation | Manual SQL change | Zero changes needed |
| Readability | ❌ Hard to maintain | ✅ Clear at a glance |
| Error rate | High (easy to miss/typo) | Very low |
| Execution time | Similar (same underlying optimization) | Similar |
Core benefit: 200+ lines of hand-written SQL compressed into 3 lines of declarative query, with zero cost for new files.
Advanced Usage
Aggregate Directly Over read_duckdb Results
SELECT
DATE_TRUNC('month', event_time) AS month,
COUNT(*) AS total_events,
AVG(amount) AS avg_amount
FROM read_duckdb('data/2024-*.duckdb', table_name => 'events')
GROUP BY month
ORDER BY month;
Data Quality Checks with FILTER
SELECT
_source_file,
COUNT(*) FILTER (WHERE amount IS NULL) AS null_amounts,
COUNT(*) FILTER (WHERE event_type = '') AS empty_types
FROM read_duckdb('data/*.duckdb', table_name => 'events')
GROUP BY _source_file
HAVING null_amounts > 0 OR empty_types > 0;
This query identifies all files with data quality issues in under 10 milliseconds — orders of magnitude faster than opening each file individually.
Recursive Subdirectory Matching
SELECT * FROM read_duckdb('data/**/*.duckdb', table_name => 'events');
** recursively matches subdirectories, ideal for hierarchical storage layouts.
Extended Thoughts
The design philosophy behind read_duckdb() is “let DuckDB read and write its own file format”. This opens up interesting scenarios:
- Distributed Data Processing: Each node writes a local
.duckdbfile; merge them all withread_duckdb('output/*.duckdb'). - Backup as Data: Your
.duckdbfiles are backups — query any point-in-time snapshot instantly withread_duckdb(). - Plugin Architecture: Microservices maintain independent
.duckdbfiles; cross-service analysis uses direct glob reads without shared database instances.
Introduced in DuckDB v1.5.0, read_duckdb() supports late materialization and filter pushdown. Even when reading hundreds of files, the optimizer pushes filtering conditions into each file, reading only the data you actually need.
📬 Subscribe to DuckDB Lab for weekly practical tips that you can use immediately. No fluff, just actionable DuckDB tricks.