Merge Multiple DuckDB Files in One SQL Line: The Complete read_duckdb() Guide

Advanced Techniques

Merge Multiple DuckDB Files in One SQL Line: The Complete read_duckdb() Guide

DuckDB 1.5+ introduces read_duckdb(), a built-in function that lets you read and merge multiple .duckdb files directly in SQL — no Python loops needed. Covers syntax, performance tuning, real-world scenarios, and monetization paths.

The Pain Point That Plagues Countless Data Workers

Imagine you’re a data engineer or indie developer who needs to process a large volume of DuckDB database files every day. Common scenarios include:

Daily-sharded business databases: Each daily_20260601.duckdb file stores one day of order data
Multi-project consolidated analysis: Three teams each maintain their own .duckdb file, and you need cross-project comparisons
Backup file retrieval: Hundreds of historical backups in backup/ directory, need to quickly find a specific record
Data lake archive queries: Cold data stored as independent DuckDB files, needs temporary aggregation analysis

Before read_duckdb(), your solution looked like this:

import duckdb
import glob

results = []
for f in glob.glob("data_*.duckdb"):
    conn = duckdb.connect(f)
    results.append(conn.execute("SELECT * FROM orders").fetchdf())
    conn.close()

merged = pd.concat(results)

The problems with this code are obvious:

Each file requires a connect-close cycle — high overhead, easy to leak resources
All data loads into memory — OOM when facing hundreds of GB of files
No predicate pushdown — must read everything before filtering
Verbose — a simple merge operation takes 10+ lines of Python

read_duckdb() changes all of this.

What Is read_duckdb()?

read_duckdb() is a built-in function introduced in DuckDB 1.5.0 (Variegata) that allows you to read table data from .duckdb files directly in SQL, with glob wildcard support for multiple files.

Its core syntax is straightforward:

read_duckdb(file_pattern [, table_name])

file_pattern: File path pattern, supports * and ? wildcards
table_name (optional): The table to read; if omitted, returns all tables in the file

Basic Usage: Simple to Complex

Reading a Single File

-- Read the first table from backup.db
SELECT * FROM read_duckdb('backup.db');

-- Read a specific table
SELECT * FROM read_duckdb('backup.db', 'orders');

Merging Multiple Files

-- Merge orders tables from all daily_*.duckdb files
SELECT * FROM read_duckdb('daily_*.duckdb', 'orders');

-- Recursively read all .duckdb files in subdirectories
SELECT * FROM read_duckdb('archive/**/*.duckdb', 'metrics');

Combining with SQL Operations

-- Merged query with filters
SELECT order_id, customer_id, total_amount
FROM read_duckdb('daily_*.duckdb', 'orders')
WHERE order_date >= '2026-06-01'
  AND total_amount > 100;

-- Aggregation statistics
SELECT 
    DATE_TRUNC('day', order_date) AS day,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue,
    AVG(total_amount) AS avg_order_value
FROM read_duckdb('daily_*.duckdb', 'orders')
GROUP BY DATE_TRUNC('day', order_date)
ORDER BY day;

Python Integration in Production

In real projects, you’ll typically use DuckDB’s Python bindings:

import duckdb
import glob

# Method 1: Direct SQL usage
conn = duckdb.connect()
result = conn.execute("""
    SELECT 
        _source_file AS file_source,
        COUNT(*) AS order_count,
        SUM(total_amount) AS total_revenue
    FROM read_duckdb('daily_reports/*.duckdb', 'orders')
    GROUP BY _source_file
    ORDER BY order_count DESC
""").fetchdf()

print(result)

Key Feature: The `_source_file` Pseudo-Column

When using read_duckdb() to read multiple files, DuckDB automatically adds a column named _source_file that records which file each row came from. This is invaluable for debugging and data lineage:

SELECT 
    _source_file,
    order_id,
    customer_id,
    total_amount,
    order_date
FROM read_duckdb('daily_*.duckdb', 'orders')
LIMIT 10;

Example output:

┌───────────────────────────┬──────────┬─────────────┬──────────────┬────────────┐
│       _source_file        │ order_id │ customer_id │ total_amount │ order_date │
├───────────────────────────┼──────────┼─────────────┼──────────────┼────────────┤
│ daily_20260601.duckdb     │ 10001    │ C-5023      │ 259.99       │ 2026-06-01 │
│ daily_20260601.duckdb     │ 10002    │ C-7891      │ 89.50        │ 2026-06-01 │
│ daily_20260602.duckdb     │ 10003    │ C-3344      │ 432.00       │ 2026-06-02 │
│ daily_20260602.duckdb     │ 10004    │ C-1122      │ 67.80        │ 2026-06-02 │
└───────────────────────────┴──────────┴─────────────┴──────────────┴────────────┘

Performance Tuning: Make Merging Fly

1. Predicate Pushdown

read_duckdb() supports predicate pushdown, meaning WHERE conditions are passed to each file’s query, reading only matching data:

-- Only files with data after 2026-06-15 will be scanned
SELECT * FROM read_duckdb('daily_*.duckdb', 'events')
WHERE event_time >= '2026-06-15';

2. Verify Optimization with EXPLAIN ANALYZE

EXPLAIN ANALYZE
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS monthly_revenue
FROM read_duckdb('monthly_*.duckdb', 'orders')
WHERE order_date BETWEEN '2026-01-01' AND '2026-06-30'
GROUP BY DATE_TRUNC('month', order_date);

You’ll see in the execution plan that each file’s read is optimized to scan only relevant partitions.

3. Control Concurrent Reads

import duckdb

# Set maximum concurrent file reads
conn = duckdb.connect(config={
    'max_threads': 8,
    'temp_directory': '/tmp/duckdb_temp'
})

result = conn.execute("""
    SELECT * FROM read_duckdb('large_set/*.duckdb', 'data')
""").fetchdf()

4. External Sort for Large Datasets

When merged data exceeds memory, leverage DuckDB’s external sorting:

SET GLOBAL memory_limit = '4GB';
SET GLOBAL temp_directory = '/tmp/duckdb_temp';

SELECT * FROM read_duckdb('archive/*.duckdb', 'transactions')
ORDER BY transaction_date DESC
LIMIT 1000;

Comparison with Traditional Approaches

Dimension	Python Loop + pandas	read_duckdb()
Code Lines	10–20 lines	1–3 lines of SQL
Memory Efficiency	Full load into memory	Lazy read + pushdown
Filtering	Filter after reading	Predicate pushdown to file level
Error Recovery	Manual exception handling	Unified SQL handling
Learning Curve	Requires Python programming	SQL knowledge only
Parallel Processing	Manual multi-threading	Automatic parallel scanning
Debug Tracing	Extra logic needed	Automatic `_source_file` column

Real-World Scenarios: Building Data Products

Scenario 1: Competitor Price Monitoring SaaS

Build an automated competitor price monitoring system using read_duckdb():

import duckdb
from datetime import datetime, timedelta

class PriceMonitor:
    def __init__(self, db_dir="price_data"):
        self.conn = duckdb.connect(
            config={'temp_directory': '/tmp/duckdb_monitor'}
        )
        self.db_dir = db_dir
    
    def get_price_trend(self, product_id, days=30):
        """Get price trend for a product over the last N days"""
        cutoff = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')
        
        query = f"""
        SELECT 
            _source_file,
            price,
            competitor,
            timestamp
        FROM read_duckdb('{self.db_dir}/*.duckdb', 'prices')
        WHERE product_id = '{product_id}'
          AND timestamp >= '{cutoff}'
        ORDER BY timestamp
        """
        
        return self.conn.execute(query).fetchdf()
    
    def get_best_deal(self, category, days=7):
        """Find the best price for a category in the last N days"""
        query = f"""
        SELECT 
            competitor,
            AVG(price) AS avg_price,
            MIN(price) AS min_price,
            MAX(price) AS max_price,
            COUNT(*) AS sample_count
        FROM read_duckdb('{self.db_dir}/*.duckdb', 'prices')
        WHERE category = '{category}'
        GROUP BY competitor
        ORDER BY avg_price ASC
        LIMIT 5
        """
        
        return self.conn.execute(query).fetchdf()

# Usage example
monitor = PriceMonitor()
trend = monitor.get_price_trend("SKU-12345")
best_deals = monitor.get_best_deal("electronics")

The business model is simple: collect competitor price data from various e-commerce platforms, store it in independent DuckDB files by day, then provide API access for price trend queries and optimal purchase channels. Charge $7–28/month per subscriber.

Scenario 2: Automated Weekly Report Generator

import duckdb

def generate_weekly_report():
    """Auto-generate weekly sales reports"""
    conn = duckdb.connect()
    
    report = conn.execute("""
        WITH daily_stats AS (
            SELECT 
                _source_file AS report_date,
                COUNT(DISTINCT customer_id) AS unique_customers,
                COUNT(*) AS total_orders,
                SUM(total_amount) AS total_revenue,
                AVG(total_amount) AS avg_order_value
            FROM read_duckdb('weekly_sales/*.duckdb', 'orders')
            WHERE order_date >= CURRENT_DATE - INTERVAL 7 DAY
            GROUP BY _source_file
        )
        SELECT 
            report_date,
            unique_customers,
            total_orders,
            total_revenue,
            avg_order_value,
            LAG(total_revenue) OVER (ORDER BY report_date) AS prev_day_revenue,
            ROUND(
                (total_revenue - LAG(total_revenue) OVER (ORDER BY report_date)) 
                / NULLIF(LAG(total_revenue) OVER (ORDER BY report_date), 0) * 100, 2
            ) AS revenue_change_pct
        FROM daily_stats
        ORDER BY report_date
    """).fetchdf()
    
    return report

report = generate_weekly_report()
print(report.to_markdown(index=False))

Scenario 3: Data Quality Audit Tool

-- Check data consistency across multiple files
SELECT 
    _source_file,
    COUNT(*) AS row_count,
    COUNT(DISTINCT id) AS unique_ids,
    COUNT(*) - COUNT(DISTINCT id) AS duplicates,
    COUNT(CASE WHEN email IS NULL THEN 1 END) AS missing_emails,
    COUNT(CASE WHEN phone IS NULL THEN 1 END) AS missing_phones
FROM read_duckdb('customer_db/*.duckdb', 'customers')
GROUP BY _source_file
HAVING duplicates > 0 OR missing_emails > 100
ORDER BY row_count DESC;

Comparison with Similar Functions

Besides read_duckdb(), DuckDB provides other file reading functions:

Function	Purpose	Use Case
`read_duckdb()`	Read .duckdb files	Merge homogeneous database files
`read_csv_auto()`	Read CSV files	Text data sources
`read_parquet()`	Read Parquet files	Large columnar datasets
`read_json_auto()`	Read JSON files	Semi-structured data
`read_ndjson()`	Read NDJSON files	Streaming JSON data

The unique advantage of read_duckdb() is that it reads DuckDB’s internal format directly, preserving full type information and index structures — something other formats cannot match.

Important Notes and Best Practices

1. Version Compatibility

read_duckdb() requires DuckDB ≥ 1.5.0. Upgrade if needed:

pip install --upgrade duckdb

2. File Path Conventions

Use relative paths relative to DuckDB’s working directory
On Windows, escape backslashes or use raw strings
* in glob patterns matches files in the current directory; ** matches recursively in subdirectories

3. Security Considerations

# ⚠️ Don't directly concatenate user input into file paths
# Unsafe
pattern = f"{user_input}/*.duckdb"

# ✅ Use whitelist validation
import pathlib
allowed_dir = pathlib.Path("/data/reports")
user_path = pathlib.Path(user_input)
if user_path.resolve().is_relative_to(allowed_dir.resolve()):
    pattern = str(user_path) + "/*.duckdb"

4. Performance Tuning Checklist

For merges exceeding 100 files, set max_threads to control concurrency
Use temp_directory for external sort scratch space
Create appropriate indexes on files; read_duckdb() leverages internal file indexes
Avoid calling read_duckdb() on the same file multiple times in one query

Monetization Paths

While read_duckdb() seems like a small feature, it solves a real and frequent need — multi-file data merging. Here are actionable monetization directions:

Direction 1: Data Integration as a Service (DIaaS)

Provide cross-system data integration for SMEs. Many companies use different tools to collect data (CRM, ERP, e-commerce platforms), each exporting independent .duckdb files. You can merge them in minutes using read_duckdb() and deliver visualized analysis reports.

Pricing: One-time integration $70–280, monthly data monitoring service $140–700/month.

Direction 2: Automated Reporting SaaS

Build automated weekly/monthly reporting systems based on read_duckdb(). Clients generate independent database files daily; you merge and analyze via scheduled tasks, auto-generating charts and insights.

Tech stack: DuckDB + Streamlit/Evidence + cron jobs = a complete reporting product.

Pricing: Basic $14/month, Pro $42/month.

Direction 3: Data Audit Tool

Serve data teams with cross-file data quality audits. Check data consistency, duplicate records, missing values, and auto-generate audit reports.

Pricing: Per-file-count pricing, $14–70 per run.

Direction 4: Education Courses

Package read_duckdb() techniques into courses teaching data analysts how to efficiently handle multi-file scenarios. Such practical tutorials are popular on knowledge payment platforms.

Pricing: Single course $28–70, bundled series $140.

Summary

read_duckdb() is one of the most practical features in DuckDB 1.5+. It reduces what used to be multiple lines of Python code for multi-file merging into a single SQL line:

SELECT * FROM read_duckdb('data_*.duckdb', 'orders');

More importantly, it delivers real performance advantages — predicate pushdown, lazy reading, automatic parallelism — that traditional Python loop solutions struggle to match.

If you’re dealing with multi-file data merging scenarios, stop writing verbose Python loops. Try read_duckdb() and you’ll see a qualitative leap in both speed and data quality.

📖 More DuckDB practical functions and real-world cases → duckdblab.org

💡 More DuckDB advanced techniques → duckdblab.org

Olap Studio · Focused on DuckDB Practical Skills · 2026-06-21

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.