
The Pain Point That Plagues Countless Data Workers
Imagine you’re a data engineer or indie developer who needs to process a large volume of DuckDB database files every day. Common scenarios include:
- Daily-sharded business databases: Each
daily_20260601.duckdbfile stores one day of order data - Multi-project consolidated analysis: Three teams each maintain their own
.duckdbfile, and you need cross-project comparisons - Backup file retrieval: Hundreds of historical backups in
backup/directory, need to quickly find a specific record - Data lake archive queries: Cold data stored as independent DuckDB files, needs temporary aggregation analysis
Before read_duckdb(), your solution looked like this:
import duckdb
import glob
results = []
for f in glob.glob("data_*.duckdb"):
conn = duckdb.connect(f)
results.append(conn.execute("SELECT * FROM orders").fetchdf())
conn.close()
merged = pd.concat(results)
The problems with this code are obvious:
- Each file requires a connect-close cycle — high overhead, easy to leak resources
- All data loads into memory — OOM when facing hundreds of GB of files
- No predicate pushdown — must read everything before filtering
- Verbose — a simple merge operation takes 10+ lines of Python
read_duckdb() changes all of this.
What Is read_duckdb()?
read_duckdb() is a built-in function introduced in DuckDB 1.5.0 (Variegata) that allows you to read table data from .duckdb files directly in SQL, with glob wildcard support for multiple files.
Its core syntax is straightforward:
read_duckdb(file_pattern [, table_name])
file_pattern: File path pattern, supports*and?wildcardstable_name(optional): The table to read; if omitted, returns all tables in the file
Basic Usage: Simple to Complex
Reading a Single File
-- Read the first table from backup.db
SELECT * FROM read_duckdb('backup.db');
-- Read a specific table
SELECT * FROM read_duckdb('backup.db', 'orders');
Merging Multiple Files
-- Merge orders tables from all daily_*.duckdb files
SELECT * FROM read_duckdb('daily_*.duckdb', 'orders');
-- Recursively read all .duckdb files in subdirectories
SELECT * FROM read_duckdb('archive/**/*.duckdb', 'metrics');
Combining with SQL Operations
-- Merged query with filters
SELECT order_id, customer_id, total_amount
FROM read_duckdb('daily_*.duckdb', 'orders')
WHERE order_date >= '2026-06-01'
AND total_amount > 100;
-- Aggregation statistics
SELECT
DATE_TRUNC('day', order_date) AS day,
COUNT(*) AS order_count,
SUM(total_amount) AS revenue,
AVG(total_amount) AS avg_order_value
FROM read_duckdb('daily_*.duckdb', 'orders')
GROUP BY DATE_TRUNC('day', order_date)
ORDER BY day;
Python Integration in Production
In real projects, you’ll typically use DuckDB’s Python bindings:
import duckdb
import glob
# Method 1: Direct SQL usage
conn = duckdb.connect()
result = conn.execute("""
SELECT
_source_file AS file_source,
COUNT(*) AS order_count,
SUM(total_amount) AS total_revenue
FROM read_duckdb('daily_reports/*.duckdb', 'orders')
GROUP BY _source_file
ORDER BY order_count DESC
""").fetchdf()
print(result)
Key Feature: The _source_file Pseudo-Column
When using read_duckdb() to read multiple files, DuckDB automatically adds a column named _source_file that records which file each row came from. This is invaluable for debugging and data lineage:
SELECT
_source_file,
order_id,
customer_id,
total_amount,
order_date
FROM read_duckdb('daily_*.duckdb', 'orders')
LIMIT 10;
Example output:
┌───────────────────────────┬──────────┬─────────────┬──────────────┬────────────┐
│ _source_file │ order_id │ customer_id │ total_amount │ order_date │
├───────────────────────────┼──────────┼─────────────┼──────────────┼────────────┤
│ daily_20260601.duckdb │ 10001 │ C-5023 │ 259.99 │ 2026-06-01 │
│ daily_20260601.duckdb │ 10002 │ C-7891 │ 89.50 │ 2026-06-01 │
│ daily_20260602.duckdb │ 10003 │ C-3344 │ 432.00 │ 2026-06-02 │
│ daily_20260602.duckdb │ 10004 │ C-1122 │ 67.80 │ 2026-06-02 │
└───────────────────────────┴──────────┴─────────────┴──────────────┴────────────┘
Performance Tuning: Make Merging Fly
1. Predicate Pushdown
read_duckdb() supports predicate pushdown, meaning WHERE conditions are passed to each file’s query, reading only matching data:
-- Only files with data after 2026-06-15 will be scanned
SELECT * FROM read_duckdb('daily_*.duckdb', 'events')
WHERE event_time >= '2026-06-15';
2. Verify Optimization with EXPLAIN ANALYZE
EXPLAIN ANALYZE
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS monthly_revenue
FROM read_duckdb('monthly_*.duckdb', 'orders')
WHERE order_date BETWEEN '2026-01-01' AND '2026-06-30'
GROUP BY DATE_TRUNC('month', order_date);
You’ll see in the execution plan that each file’s read is optimized to scan only relevant partitions.
3. Control Concurrent Reads
import duckdb
# Set maximum concurrent file reads
conn = duckdb.connect(config={
'max_threads': 8,
'temp_directory': '/tmp/duckdb_temp'
})
result = conn.execute("""
SELECT * FROM read_duckdb('large_set/*.duckdb', 'data')
""").fetchdf()
4. External Sort for Large Datasets
When merged data exceeds memory, leverage DuckDB’s external sorting:
SET GLOBAL memory_limit = '4GB';
SET GLOBAL temp_directory = '/tmp/duckdb_temp';
SELECT * FROM read_duckdb('archive/*.duckdb', 'transactions')
ORDER BY transaction_date DESC
LIMIT 1000;
Comparison with Traditional Approaches
| Dimension | Python Loop + pandas | read_duckdb() |
|---|---|---|
| Code Lines | 10–20 lines | 1–3 lines of SQL |
| Memory Efficiency | Full load into memory | Lazy read + pushdown |
| Filtering | Filter after reading | Predicate pushdown to file level |
| Error Recovery | Manual exception handling | Unified SQL handling |
| Learning Curve | Requires Python programming | SQL knowledge only |
| Parallel Processing | Manual multi-threading | Automatic parallel scanning |
| Debug Tracing | Extra logic needed | Automatic _source_file column |
Real-World Scenarios: Building Data Products
Scenario 1: Competitor Price Monitoring SaaS
Build an automated competitor price monitoring system using read_duckdb():
import duckdb
from datetime import datetime, timedelta
class PriceMonitor:
def __init__(self, db_dir="price_data"):
self.conn = duckdb.connect(
config={'temp_directory': '/tmp/duckdb_monitor'}
)
self.db_dir = db_dir
def get_price_trend(self, product_id, days=30):
"""Get price trend for a product over the last N days"""
cutoff = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')
query = f"""
SELECT
_source_file,
price,
competitor,
timestamp
FROM read_duckdb('{self.db_dir}/*.duckdb', 'prices')
WHERE product_id = '{product_id}'
AND timestamp >= '{cutoff}'
ORDER BY timestamp
"""
return self.conn.execute(query).fetchdf()
def get_best_deal(self, category, days=7):
"""Find the best price for a category in the last N days"""
query = f"""
SELECT
competitor,
AVG(price) AS avg_price,
MIN(price) AS min_price,
MAX(price) AS max_price,
COUNT(*) AS sample_count
FROM read_duckdb('{self.db_dir}/*.duckdb', 'prices')
WHERE category = '{category}'
GROUP BY competitor
ORDER BY avg_price ASC
LIMIT 5
"""
return self.conn.execute(query).fetchdf()
# Usage example
monitor = PriceMonitor()
trend = monitor.get_price_trend("SKU-12345")
best_deals = monitor.get_best_deal("electronics")
The business model is simple: collect competitor price data from various e-commerce platforms, store it in independent DuckDB files by day, then provide API access for price trend queries and optimal purchase channels. Charge $7–28/month per subscriber.
Scenario 2: Automated Weekly Report Generator
import duckdb
def generate_weekly_report():
"""Auto-generate weekly sales reports"""
conn = duckdb.connect()
report = conn.execute("""
WITH daily_stats AS (
SELECT
_source_file AS report_date,
COUNT(DISTINCT customer_id) AS unique_customers,
COUNT(*) AS total_orders,
SUM(total_amount) AS total_revenue,
AVG(total_amount) AS avg_order_value
FROM read_duckdb('weekly_sales/*.duckdb', 'orders')
WHERE order_date >= CURRENT_DATE - INTERVAL 7 DAY
GROUP BY _source_file
)
SELECT
report_date,
unique_customers,
total_orders,
total_revenue,
avg_order_value,
LAG(total_revenue) OVER (ORDER BY report_date) AS prev_day_revenue,
ROUND(
(total_revenue - LAG(total_revenue) OVER (ORDER BY report_date))
/ NULLIF(LAG(total_revenue) OVER (ORDER BY report_date), 0) * 100, 2
) AS revenue_change_pct
FROM daily_stats
ORDER BY report_date
""").fetchdf()
return report
report = generate_weekly_report()
print(report.to_markdown(index=False))
Scenario 3: Data Quality Audit Tool
-- Check data consistency across multiple files
SELECT
_source_file,
COUNT(*) AS row_count,
COUNT(DISTINCT id) AS unique_ids,
COUNT(*) - COUNT(DISTINCT id) AS duplicates,
COUNT(CASE WHEN email IS NULL THEN 1 END) AS missing_emails,
COUNT(CASE WHEN phone IS NULL THEN 1 END) AS missing_phones
FROM read_duckdb('customer_db/*.duckdb', 'customers')
GROUP BY _source_file
HAVING duplicates > 0 OR missing_emails > 100
ORDER BY row_count DESC;
Comparison with Similar Functions
Besides read_duckdb(), DuckDB provides other file reading functions:
| Function | Purpose | Use Case |
|---|---|---|
read_duckdb() | Read .duckdb files | Merge homogeneous database files |
read_csv_auto() | Read CSV files | Text data sources |
read_parquet() | Read Parquet files | Large columnar datasets |
read_json_auto() | Read JSON files | Semi-structured data |
read_ndjson() | Read NDJSON files | Streaming JSON data |
The unique advantage of read_duckdb() is that it reads DuckDB’s internal format directly, preserving full type information and index structures — something other formats cannot match.
Important Notes and Best Practices
1. Version Compatibility
read_duckdb() requires DuckDB ≥ 1.5.0. Upgrade if needed:
pip install --upgrade duckdb
2. File Path Conventions
- Use relative paths relative to DuckDB’s working directory
- On Windows, escape backslashes or use raw strings
*in glob patterns matches files in the current directory;**matches recursively in subdirectories
3. Security Considerations
# ⚠️ Don't directly concatenate user input into file paths
# Unsafe
pattern = f"{user_input}/*.duckdb"
# ✅ Use whitelist validation
import pathlib
allowed_dir = pathlib.Path("/data/reports")
user_path = pathlib.Path(user_input)
if user_path.resolve().is_relative_to(allowed_dir.resolve()):
pattern = str(user_path) + "/*.duckdb"
4. Performance Tuning Checklist
- For merges exceeding 100 files, set
max_threadsto control concurrency - Use
temp_directoryfor external sort scratch space - Create appropriate indexes on files;
read_duckdb()leverages internal file indexes - Avoid calling
read_duckdb()on the same file multiple times in one query
Monetization Paths
While read_duckdb() seems like a small feature, it solves a real and frequent need — multi-file data merging. Here are actionable monetization directions:
Direction 1: Data Integration as a Service (DIaaS)
Provide cross-system data integration for SMEs. Many companies use different tools to collect data (CRM, ERP, e-commerce platforms), each exporting independent .duckdb files. You can merge them in minutes using read_duckdb() and deliver visualized analysis reports.
Pricing: One-time integration $70–280, monthly data monitoring service $140–700/month.
Direction 2: Automated Reporting SaaS
Build automated weekly/monthly reporting systems based on read_duckdb(). Clients generate independent database files daily; you merge and analyze via scheduled tasks, auto-generating charts and insights.
Tech stack: DuckDB + Streamlit/Evidence + cron jobs = a complete reporting product.
Pricing: Basic $14/month, Pro $42/month.
Direction 3: Data Audit Tool
Serve data teams with cross-file data quality audits. Check data consistency, duplicate records, missing values, and auto-generate audit reports.
Pricing: Per-file-count pricing, $14–70 per run.
Direction 4: Education Courses
Package read_duckdb() techniques into courses teaching data analysts how to efficiently handle multi-file scenarios. Such practical tutorials are popular on knowledge payment platforms.
Pricing: Single course $28–70, bundled series $140.
Summary
read_duckdb() is one of the most practical features in DuckDB 1.5+. It reduces what used to be multiple lines of Python code for multi-file merging into a single SQL line:
SELECT * FROM read_duckdb('data_*.duckdb', 'orders');
More importantly, it delivers real performance advantages — predicate pushdown, lazy reading, automatic parallelism — that traditional Python loop solutions struggle to match.
If you’re dealing with multi-file data merging scenarios, stop writing verbose Python loops. Try read_duckdb() and you’ll see a qualitative leap in both speed and data quality.
📖 More DuckDB practical functions and real-world cases → duckdblab.org
💡 More DuckDB advanced techniques → duckdblab.org
Olap Studio · Focused on DuckDB Practical Skills · 2026-06-21