DuckDB One-Trick Wonder: Batch Read Multiple .duckdb Files with read_duckdb()

Learn how to use DuckDB's read_duckdb() function to query multiple .duckdb files at once with glob patterns — no ATTACH needed. Cut 200 lines of boilerplate to 3.

The Pain: Batch Reading .duckdb Files is Tedious

You have 100 monthly .duckdb database files, each containing a table called events. You need to analyze them all together.

What’s the traditional approach?

ATTACH 'data/2024-01.duckdb' AS db1;
ATTACH 'data/2024-02.duckdb' AS db2;
ATTACH 'data/2024-03.duckdb' AS db3;
-- ... 97 more files to write manually
SELECT * FROM db1.events
UNION ALL
SELECT * FROM db2.events
UNION ALL
SELECT * FROM db3.events;
-- ... 97 UNION ALL clauses

100 files → 200+ lines of SQL. Not only painful, but not scalable — every new file requires manual code changes.

One Trick to Solve It: read_duckdb() + Globbing

DuckDB v1.5.0 introduced the read_duckdb() table function, which can read tables from .duckdb files without ATTACHing them:

SELECT * FROM read_duckdb('data/*.duckdb');

That’s it — one line. Supports glob wildcards, automatically scans all matching files, and merges results.

More Practical: Specify the Table Name

If each .duckdb file contains multiple tables, use the table_name parameter:

SELECT * FROM read_duckdb(
    'data/*.duckdb',
    table_name => 'events'
);

Track Data Sources Automatically

Want to know which file each row came from? read_duckdb() automatically adds a _source_file column:

SELECT 
    _source_file,
    COUNT(*) AS event_count,
    SUM(amount) AS total_amount
FROM read_duckdb('data/2024-*.duckdb', table_name => 'events')
GROUP BY _source_file
ORDER BY event_count DESC;

Example output:

     _source_file      | event_count | total_amount
-----------------------+-------------+--------------
 data/2024-03.duckdb   |     152340  |    4892341.50
 data/2024-01.duckdb   |     148920  |    4723102.80
 data/2024-02.duckdb   |     151200  |    4812456.20

Using in Python

Just as simple in Python:

import duckdb

# One line to read events from all .duckdb files
df = duckdb.query("""
    SELECT * FROM read_duckdb(
        '/data/monthly_reports/*.duckdb',
        table_name => 'events'
    )
""").df()

print(f"Total rows: {len(df)}")

Quantified Impact

DimensionATTACH + UNION ALLread_duckdb()
Lines of code200+ (100 files)3
New file adaptationManual SQL changeZero changes needed
Readability❌ Hard to maintain✅ Clear at a glance
Error rateHigh (easy to miss/typo)Very low
Execution timeSimilar (same underlying optimization)Similar

Core benefit: 200+ lines of hand-written SQL compressed into 3 lines of declarative query, with zero cost for new files.

Advanced Usage

Aggregate Directly Over read_duckdb Results

SELECT 
    DATE_TRUNC('month', event_time) AS month,
    COUNT(*) AS total_events,
    AVG(amount) AS avg_amount
FROM read_duckdb('data/2024-*.duckdb', table_name => 'events')
GROUP BY month
ORDER BY month;

Data Quality Checks with FILTER

SELECT 
    _source_file,
    COUNT(*) FILTER (WHERE amount IS NULL) AS null_amounts,
    COUNT(*) FILTER (WHERE event_type = '') AS empty_types
FROM read_duckdb('data/*.duckdb', table_name => 'events')
GROUP BY _source_file
HAVING null_amounts > 0 OR empty_types > 0;

This query identifies all files with data quality issues in under 10 milliseconds — orders of magnitude faster than opening each file individually.

Recursive Subdirectory Matching

SELECT * FROM read_duckdb('data/**/*.duckdb', table_name => 'events');

** recursively matches subdirectories, ideal for hierarchical storage layouts.

Extended Thoughts

The design philosophy behind read_duckdb() is “let DuckDB read and write its own file format”. This opens up interesting scenarios:

  1. Distributed Data Processing: Each node writes a local .duckdb file; merge them all with read_duckdb('output/*.duckdb').
  2. Backup as Data: Your .duckdb files are backups — query any point-in-time snapshot instantly with read_duckdb().
  3. Plugin Architecture: Microservices maintain independent .duckdb files; cross-service analysis uses direct glob reads without shared database instances.

Introduced in DuckDB v1.5.0, read_duckdb() supports late materialization and filter pushdown. Even when reading hundreds of files, the optimizer pushes filtering conditions into each file, reading only the data you actually need.


📬 Subscribe to DuckDB Lab for weekly practical tips that you can use immediately. No fluff, just actionable DuckDB tricks.

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy