Featured image of post Merge Multiple DuckDB Files in One SQL Line: The Complete read_duckdb() Guide

Merge Multiple DuckDB Files in One SQL Line: The Complete read_duckdb() Guide

DuckDB 1.5+ introduces read_duckdb(), a built-in function that lets you read and merge multiple .duckdb files directly in SQL — no Python loops needed. Covers syntax, performance tuning, real-world scenarios, and monetization paths.

The Pain Point That Plagues Countless Data Workers

Imagine you’re a data engineer or indie developer who needs to process a large volume of DuckDB database files every day. Common scenarios include:

  • Daily-sharded business databases: Each daily_20260601.duckdb file stores one day of order data
  • Multi-project consolidated analysis: Three teams each maintain their own .duckdb file, and you need cross-project comparisons
  • Backup file retrieval: Hundreds of historical backups in backup/ directory, need to quickly find a specific record
  • Data lake archive queries: Cold data stored as independent DuckDB files, needs temporary aggregation analysis

Before read_duckdb(), your solution looked like this:

import duckdb
import glob

results = []
for f in glob.glob("data_*.duckdb"):
    conn = duckdb.connect(f)
    results.append(conn.execute("SELECT * FROM orders").fetchdf())
    conn.close()

merged = pd.concat(results)

The problems with this code are obvious:

  1. Each file requires a connect-close cycle — high overhead, easy to leak resources
  2. All data loads into memory — OOM when facing hundreds of GB of files
  3. No predicate pushdown — must read everything before filtering
  4. Verbose — a simple merge operation takes 10+ lines of Python

read_duckdb() changes all of this.

What Is read_duckdb()?

read_duckdb() is a built-in function introduced in DuckDB 1.5.0 (Variegata) that allows you to read table data from .duckdb files directly in SQL, with glob wildcard support for multiple files.

Its core syntax is straightforward:

read_duckdb(file_pattern [, table_name])
  • file_pattern: File path pattern, supports * and ? wildcards
  • table_name (optional): The table to read; if omitted, returns all tables in the file

Basic Usage: Simple to Complex

Reading a Single File

-- Read the first table from backup.db
SELECT * FROM read_duckdb('backup.db');

-- Read a specific table
SELECT * FROM read_duckdb('backup.db', 'orders');

Merging Multiple Files

-- Merge orders tables from all daily_*.duckdb files
SELECT * FROM read_duckdb('daily_*.duckdb', 'orders');

-- Recursively read all .duckdb files in subdirectories
SELECT * FROM read_duckdb('archive/**/*.duckdb', 'metrics');

Combining with SQL Operations

-- Merged query with filters
SELECT order_id, customer_id, total_amount
FROM read_duckdb('daily_*.duckdb', 'orders')
WHERE order_date >= '2026-06-01'
  AND total_amount > 100;

-- Aggregation statistics
SELECT 
    DATE_TRUNC('day', order_date) AS day,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue,
    AVG(total_amount) AS avg_order_value
FROM read_duckdb('daily_*.duckdb', 'orders')
GROUP BY DATE_TRUNC('day', order_date)
ORDER BY day;

Python Integration in Production

In real projects, you’ll typically use DuckDB’s Python bindings:

import duckdb
import glob

# Method 1: Direct SQL usage
conn = duckdb.connect()
result = conn.execute("""
    SELECT 
        _source_file AS file_source,
        COUNT(*) AS order_count,
        SUM(total_amount) AS total_revenue
    FROM read_duckdb('daily_reports/*.duckdb', 'orders')
    GROUP BY _source_file
    ORDER BY order_count DESC
""").fetchdf()

print(result)

Key Feature: The _source_file Pseudo-Column

When using read_duckdb() to read multiple files, DuckDB automatically adds a column named _source_file that records which file each row came from. This is invaluable for debugging and data lineage:

SELECT 
    _source_file,
    order_id,
    customer_id,
    total_amount,
    order_date
FROM read_duckdb('daily_*.duckdb', 'orders')
LIMIT 10;

Example output:

┌───────────────────────────┬──────────┬─────────────┬──────────────┬────────────┐
│       _source_file        │ order_id │ customer_id │ total_amount │ order_date │
├───────────────────────────┼──────────┼─────────────┼──────────────┼────────────┤
│ daily_20260601.duckdb     │ 10001    │ C-5023      │ 259.99       │ 2026-06-01 │
│ daily_20260601.duckdb     │ 10002    │ C-7891      │ 89.50        │ 2026-06-01 │
│ daily_20260602.duckdb     │ 10003    │ C-3344      │ 432.00       │ 2026-06-02 │
│ daily_20260602.duckdb     │ 10004    │ C-1122      │ 67.80        │ 2026-06-02 │
└───────────────────────────┴──────────┴─────────────┴──────────────┴────────────┘

Performance Tuning: Make Merging Fly

1. Predicate Pushdown

read_duckdb() supports predicate pushdown, meaning WHERE conditions are passed to each file’s query, reading only matching data:

-- Only files with data after 2026-06-15 will be scanned
SELECT * FROM read_duckdb('daily_*.duckdb', 'events')
WHERE event_time >= '2026-06-15';

2. Verify Optimization with EXPLAIN ANALYZE

EXPLAIN ANALYZE
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS monthly_revenue
FROM read_duckdb('monthly_*.duckdb', 'orders')
WHERE order_date BETWEEN '2026-01-01' AND '2026-06-30'
GROUP BY DATE_TRUNC('month', order_date);

You’ll see in the execution plan that each file’s read is optimized to scan only relevant partitions.

3. Control Concurrent Reads

import duckdb

# Set maximum concurrent file reads
conn = duckdb.connect(config={
    'max_threads': 8,
    'temp_directory': '/tmp/duckdb_temp'
})

result = conn.execute("""
    SELECT * FROM read_duckdb('large_set/*.duckdb', 'data')
""").fetchdf()

4. External Sort for Large Datasets

When merged data exceeds memory, leverage DuckDB’s external sorting:

SET GLOBAL memory_limit = '4GB';
SET GLOBAL temp_directory = '/tmp/duckdb_temp';

SELECT * FROM read_duckdb('archive/*.duckdb', 'transactions')
ORDER BY transaction_date DESC
LIMIT 1000;

Comparison with Traditional Approaches

DimensionPython Loop + pandasread_duckdb()
Code Lines10–20 lines1–3 lines of SQL
Memory EfficiencyFull load into memoryLazy read + pushdown
FilteringFilter after readingPredicate pushdown to file level
Error RecoveryManual exception handlingUnified SQL handling
Learning CurveRequires Python programmingSQL knowledge only
Parallel ProcessingManual multi-threadingAutomatic parallel scanning
Debug TracingExtra logic neededAutomatic _source_file column

Real-World Scenarios: Building Data Products

Scenario 1: Competitor Price Monitoring SaaS

Build an automated competitor price monitoring system using read_duckdb():

import duckdb
from datetime import datetime, timedelta

class PriceMonitor:
    def __init__(self, db_dir="price_data"):
        self.conn = duckdb.connect(
            config={'temp_directory': '/tmp/duckdb_monitor'}
        )
        self.db_dir = db_dir
    
    def get_price_trend(self, product_id, days=30):
        """Get price trend for a product over the last N days"""
        cutoff = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')
        
        query = f"""
        SELECT 
            _source_file,
            price,
            competitor,
            timestamp
        FROM read_duckdb('{self.db_dir}/*.duckdb', 'prices')
        WHERE product_id = '{product_id}'
          AND timestamp >= '{cutoff}'
        ORDER BY timestamp
        """
        
        return self.conn.execute(query).fetchdf()
    
    def get_best_deal(self, category, days=7):
        """Find the best price for a category in the last N days"""
        query = f"""
        SELECT 
            competitor,
            AVG(price) AS avg_price,
            MIN(price) AS min_price,
            MAX(price) AS max_price,
            COUNT(*) AS sample_count
        FROM read_duckdb('{self.db_dir}/*.duckdb', 'prices')
        WHERE category = '{category}'
        GROUP BY competitor
        ORDER BY avg_price ASC
        LIMIT 5
        """
        
        return self.conn.execute(query).fetchdf()

# Usage example
monitor = PriceMonitor()
trend = monitor.get_price_trend("SKU-12345")
best_deals = monitor.get_best_deal("electronics")

The business model is simple: collect competitor price data from various e-commerce platforms, store it in independent DuckDB files by day, then provide API access for price trend queries and optimal purchase channels. Charge $7–28/month per subscriber.

Scenario 2: Automated Weekly Report Generator

import duckdb

def generate_weekly_report():
    """Auto-generate weekly sales reports"""
    conn = duckdb.connect()
    
    report = conn.execute("""
        WITH daily_stats AS (
            SELECT 
                _source_file AS report_date,
                COUNT(DISTINCT customer_id) AS unique_customers,
                COUNT(*) AS total_orders,
                SUM(total_amount) AS total_revenue,
                AVG(total_amount) AS avg_order_value
            FROM read_duckdb('weekly_sales/*.duckdb', 'orders')
            WHERE order_date >= CURRENT_DATE - INTERVAL 7 DAY
            GROUP BY _source_file
        )
        SELECT 
            report_date,
            unique_customers,
            total_orders,
            total_revenue,
            avg_order_value,
            LAG(total_revenue) OVER (ORDER BY report_date) AS prev_day_revenue,
            ROUND(
                (total_revenue - LAG(total_revenue) OVER (ORDER BY report_date)) 
                / NULLIF(LAG(total_revenue) OVER (ORDER BY report_date), 0) * 100, 2
            ) AS revenue_change_pct
        FROM daily_stats
        ORDER BY report_date
    """).fetchdf()
    
    return report

report = generate_weekly_report()
print(report.to_markdown(index=False))

Scenario 3: Data Quality Audit Tool

-- Check data consistency across multiple files
SELECT 
    _source_file,
    COUNT(*) AS row_count,
    COUNT(DISTINCT id) AS unique_ids,
    COUNT(*) - COUNT(DISTINCT id) AS duplicates,
    COUNT(CASE WHEN email IS NULL THEN 1 END) AS missing_emails,
    COUNT(CASE WHEN phone IS NULL THEN 1 END) AS missing_phones
FROM read_duckdb('customer_db/*.duckdb', 'customers')
GROUP BY _source_file
HAVING duplicates > 0 OR missing_emails > 100
ORDER BY row_count DESC;

Comparison with Similar Functions

Besides read_duckdb(), DuckDB provides other file reading functions:

FunctionPurposeUse Case
read_duckdb()Read .duckdb filesMerge homogeneous database files
read_csv_auto()Read CSV filesText data sources
read_parquet()Read Parquet filesLarge columnar datasets
read_json_auto()Read JSON filesSemi-structured data
read_ndjson()Read NDJSON filesStreaming JSON data

The unique advantage of read_duckdb() is that it reads DuckDB’s internal format directly, preserving full type information and index structures — something other formats cannot match.

Important Notes and Best Practices

1. Version Compatibility

read_duckdb() requires DuckDB ≥ 1.5.0. Upgrade if needed:

pip install --upgrade duckdb

2. File Path Conventions

  • Use relative paths relative to DuckDB’s working directory
  • On Windows, escape backslashes or use raw strings
  • * in glob patterns matches files in the current directory; ** matches recursively in subdirectories

3. Security Considerations

# ⚠️ Don't directly concatenate user input into file paths
# Unsafe
pattern = f"{user_input}/*.duckdb"

# ✅ Use whitelist validation
import pathlib
allowed_dir = pathlib.Path("/data/reports")
user_path = pathlib.Path(user_input)
if user_path.resolve().is_relative_to(allowed_dir.resolve()):
    pattern = str(user_path) + "/*.duckdb"

4. Performance Tuning Checklist

  • For merges exceeding 100 files, set max_threads to control concurrency
  • Use temp_directory for external sort scratch space
  • Create appropriate indexes on files; read_duckdb() leverages internal file indexes
  • Avoid calling read_duckdb() on the same file multiple times in one query

Monetization Paths

While read_duckdb() seems like a small feature, it solves a real and frequent need — multi-file data merging. Here are actionable monetization directions:

Direction 1: Data Integration as a Service (DIaaS)

Provide cross-system data integration for SMEs. Many companies use different tools to collect data (CRM, ERP, e-commerce platforms), each exporting independent .duckdb files. You can merge them in minutes using read_duckdb() and deliver visualized analysis reports.

Pricing: One-time integration $70–280, monthly data monitoring service $140–700/month.

Direction 2: Automated Reporting SaaS

Build automated weekly/monthly reporting systems based on read_duckdb(). Clients generate independent database files daily; you merge and analyze via scheduled tasks, auto-generating charts and insights.

Tech stack: DuckDB + Streamlit/Evidence + cron jobs = a complete reporting product.

Pricing: Basic $14/month, Pro $42/month.

Direction 3: Data Audit Tool

Serve data teams with cross-file data quality audits. Check data consistency, duplicate records, missing values, and auto-generate audit reports.

Pricing: Per-file-count pricing, $14–70 per run.

Direction 4: Education Courses

Package read_duckdb() techniques into courses teaching data analysts how to efficiently handle multi-file scenarios. Such practical tutorials are popular on knowledge payment platforms.

Pricing: Single course $28–70, bundled series $140.

Summary

read_duckdb() is one of the most practical features in DuckDB 1.5+. It reduces what used to be multiple lines of Python code for multi-file merging into a single SQL line:

SELECT * FROM read_duckdb('data_*.duckdb', 'orders');

More importantly, it delivers real performance advantages — predicate pushdown, lazy reading, automatic parallelism — that traditional Python loop solutions struggle to match.

If you’re dealing with multi-file data merging scenarios, stop writing verbose Python loops. Try read_duckdb() and you’ll see a qualitative leap in both speed and data quality.


📖 More DuckDB practical functions and real-world cases → duckdblab.org

💡 More DuckDB advanced techniques → duckdblab.org

Olap Studio · Focused on DuckDB Practical Skills · 2026-06-21

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.