DuckDB Data Lake: Building Production-Grade Analytics with Iceberg and Delta Lake
TL;DR: DuckDB’s extension ecosystem now supports Apache Iceberg and Delta Lake natively. This guide shows you how to build a production-grade data lakehouse architecture that combines DuckDB’s speed with open table formats for ACID transactions and time travel.
Why Data Lakehouse?
Traditional data architecture separates storage and compute:
┌─────────────────────────────────────────────────┐
│ Traditional Architecture │
│ │
│ [Data Lake] ──> [ETL Pipeline] ──> [DW] │
│ (S3/GCS) (Spark/Airflow) (Redshift) │
│ │
│ Problems: │
│ • Data duplication (lake → warehouse) │
│ • Complex ETL pipelines │
│ • Schema enforcement at load time │
│ • No ACID transactions │
└─────────────────────────────────────────────────┘
The data lakehouse approach eliminates this complexity:
┌─────────────────────────────────────────────────┐
│ Lakehouse Architecture │
│ │
│ [DuckDB] ──> [Apache Iceberg / Delta Lake] ──> │
│ (Analytics) (S3/GCS + Metadata) │
│ │
│ Benefits: │
│ • Single source of truth │
│ • ACID transactions │
│ • Time travel │
│ • Schema evolution │
│ • Open table format │
└─────────────────────────────────────────────────┘
Apache Iceberg with DuckDB
Installation
# Install DuckDB Iceberg extension
duckdb -c "INSTALL iceberg;"
duckdb -c "LOAD iceberg;"
# Or in Python
pip install duckdb
python -c "
import duckdb
con = duckdb.connect()
con.execute('INSTALL iceberg FROM community')
con.execute('LOAD iceberg')
"
Connecting to S3
-- Configure S3 credentials
CREATE SECRET (
TYPE S3,
KEY_ID 'YOUR_AWS_KEY',
SECRET 'YOUR_AWS_SECRET',
REGION 'us-east-1'
);
-- List tables in an Iceberg catalog
SHOW TABLES IN iceberg ('s3://my-bucket/catalog/');
Reading Iceberg Tables
-- Query Iceberg table directly from S3
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spent
FROM read_iceberg('s3://my-bucket/data/orders/*.parquet')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100;
Time Travel Queries
-- Query table as of a specific timestamp
SELECT * FROM iceberg_scan(
's3://my-bucket/data/orders/',
{
'snapshot_timestamp': '2026-06-20 10:00:00'
}
);
-- Query by branch
SELECT * FROM iceberg_scan(
's3://my-bucket/data/orders/',
{
'branch': 'feature-branch'
}
);
-- Query by commit ID
SELECT * FROM iceberg_scan(
's3://my-bucket/data/orders/',
{
'commit_id': 'abc123def456'
}
);
Schema Evolution
-- Add new column to existing table
ALTER TABLE iceberg_scan('s3://my-bucket/data/orders/')
ADD COLUMN loyalty_tier VARCHAR;
-- Modify column type
ALTER TABLE iceberg_scan('s3://my-bucket/data/orders/')
ALTER COLUMN amount TYPE DECIMAL(18,4);
-- Rename column
ALTER TABLE iceberg_scan('s3://my-bucket/data/orders/')
RENAME COLUMN customer_id TO user_id;
Writing to Iceberg
-- Create new Iceberg table
CREATE TABLE iceberg_scan('s3://my-bucket/data/new_orders/') AS
SELECT
order_id,
customer_id,
amount,
order_date,
status
FROM read_parquet('s3://my-bucket/raw/orders/*.parquet')
WHERE order_date >= '2026-01-01';
-- Append data
INSERT INTO iceberg_scan('s3://my-bucket/data/orders/')
SELECT * FROM read_parquet('s3://my-bucket/raw/new_orders/*.parquet');
-- Merge (upsert)
MERGE INTO iceberg_scan('s3://my-bucket/data/orders/') AS target
USING read_parquet('s3://my-bucket/raw/updates/*.parquet') AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET
amount = source.amount,
status = source.status
WHEN NOT MATCHED THEN INSERT VALUES
(source.order_id, source.customer_id, source.amount, source.order_date, source.status);
Delta Lake with DuckDB
Installation
# Install Delta Lake extension
duckdb -c "INSTALL delta;"
duckdb -c "LOAD delta;"
Connecting to Delta Tables
-- Read Delta table from S3
SELECT * FROM delta_scan('s3://my-bucket/data/delta-orders/');
-- Query with time travel
SELECT * FROM delta_scan('s3://my-bucket/data/delta-orders/', {
'version': 42 -- Query by version
});
SELECT * FROM delta_scan('s3://my-bucket/data/delta-orders/', {
'timestamp': '2026-06-20' -- Query by timestamp
});
Writing Delta Tables
-- Create Delta table
CREATE TABLE delta_scan('s3://my-bucket/data/delta-products/') AS
SELECT
product_id,
name,
category,
price,
stock_quantity
FROM read_parquet('s3://my-bucket/raw/products/*.parquet');
-- Append to Delta table
INSERT INTO delta_scan('s3://my-bucket/data/delta-orders/')
SELECT * FROM read_parquet('s3://my-bucket/raw/new_orders/*.parquet');
-- Merge in Delta
CALL delta_merge(
's3://my-bucket/data/delta-orders/',
read_parquet('s3://my-bucket/raw/updates/*.parquet'),
'source.order_id = target.order_id',
['UPDATE SET amount = source.amount, status = source.status'],
['INSERT VALUES (source.order_id, source.customer_id, source.amount, source.order_date, source.status)']
);
Compaction and Optimization
-- Compact small files
CALL delta_optimize('s3://my-bucket/data/delta-orders/');
-- Vacuum old versions
CALL delta_vacuum('s3://my-bucket/data/delta-orders/', 0);
-- Z-order optimization
CALL delta_zorder('s3://my-bucket/data/delta-orders/', 'customer_id, order_date');
Production Architecture
Data Lake Structure
s3://my-company-data-lake/
├── raw/ # Raw data (immutable)
│ ├── orders/
│ ├── products/
│ ├── events/
│ └── inventory/
├── curated/ # Curated tables (Iceberg/Delta)
│ ├── customers/
│ ├── sales/
│ └── analytics/
├── staging/ # Temporary processing
│ ├── etl-temp/
│ └── ml-features/
└── archives/ # Archived data
└── historical/
ETL Pipeline with DuckDB
import duckdb
import boto3
from datetime import datetime
class DataLakePipeline:
def __init__(self, bucket='my-company-data-lake'):
self.bucket = bucket
self.s3 = boto3.client('s3')
self.con = duckdb.connect(':memory:')
self.con.execute('INSTALL iceberg FROM community')
self.con.execute('LOAD iceberg')
def ingest_raw_data(self, source_path, target_path):
"""Ingest raw data from various sources."""
# Read from CSV
self.con.execute(f"""
COPY (
SELECT * FROM read_csv_auto('{source_path}')
) TO '{target_path}' (FORMAT PARQUET)
""")
def transform_curated(self, raw_table, curated_table, transformations):
"""Apply transformations and write to curated layer."""
self.con.execute(f"""
CREATE OR REPLACE TABLE {curated_table} AS
SELECT {transformations}
FROM {raw_table}
""")
def run_analytics(self, query):
"""Run analytical queries on curated data."""
return self.con.execute(query).fetchdf()
# Usage
pipeline = DataLakePipeline()
pipeline.ingest_raw_data(
's3://my-company-data-lake/raw/orders/*.csv',
's3://my-company-data-lake/curated/orders/'
)
results = pipeline.run_analytics("""
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as revenue,
COUNT(*) as orders
FROM iceberg_scan('s3://my-company-data-lake/curated/orders/')
GROUP BY 1
ORDER BY 1 DESC
""")
Incremental Updates
-- Upsert new records
MERGE INTO iceberg_scan('s3://my-bucket/curated/orders/') AS target
USING (
SELECT * FROM read_parquet('s3://my-bucket/raw/new_orders/2026-06-27/*.parquet')
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET
amount = source.amount,
status = source.status,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT VALUES
(source.order_id, source.customer_id, source.amount, source.order_date, source.status);
Performance Optimization
File Size Optimization
-- DuckDB automatically optimizes Parquet file sizes
-- But you can control it explicitly:
COPY (
SELECT * FROM iceberg_scan('s3://my-bucket/raw/orders/')
) TO 's3://my-bucket/curated/orders/' (
FILE_FORMAT PARQUET,
COMPRESSION ZSTD,
PER_THREAD_OUTPUT TRUE,
MAX_PARQUET_ROW_GROUP_SIZE 1000000
);
Partitioning Strategy
-- Partition by date for efficient filtering
CREATE TABLE iceberg_scan('s3://my-bucket/curated/orders/') AS
SELECT * FROM read_parquet('s3://my-bucket/raw/orders/*.parquet')
PARTITION BY (order_date);
-- Query with partition pruning
SELECT * FROM iceberg_scan('s3://my-bucket/curated/orders/')
WHERE order_date >= '2026-06-01' AND order_date < '2026-07-01';
-- Only scans June partitions
Caching Strategy
-- Enable result caching
SET enable_result_cache = TRUE;
SET result_cache_size = '2GB';
-- Cache frequently accessed tables
CREATE CACHE TABLE cached_monthly_sales AS
SELECT
DATE_TRUNC('month', order_date) as month,
category,
SUM(amount) as revenue
FROM iceberg_scan('s3://my-bucket/curated/orders/')
GROUP BY 1, 2;
Cost Comparison
| Metric | DuckDB + Iceberg | Traditional DW (Snowflake/BigQuery) |
|---|---|---|
| Compute cost | $0.01/GB scanned | $5-10/GB scanned |
| Storage cost | $0.023/GB/month (S3) | $0.023/GB/month |
| ETL complexity | Low (SQL only) | High (Spark/Airflow) |
| Time travel | Built-in | Additional cost |
| Schema evolution | Built-in | Limited |
| Concurrency | Limited (single process) | High (distributed) |
Best Practices
1. Use Parquet as the Storage Format
-- Always use Parquet for optimal compression and query performance
COPY data TO 's3://bucket/data/' (FORMAT PARQUET, COMPRESSION ZSTD);
2. Implement Data Quality Checks
-- Validate data before ingesting
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT order_id) as unique_orders,
COUNT(CASE WHEN amount <= 0 THEN 1 END) as invalid_amounts,
COUNT(CASE WHEN order_date > CURRENT_DATE THEN 1 END) as future_dates
FROM read_parquet('s3://bucket/raw/orders/*.parquet');
3. Set Up Lifecycle Policies
# S3 lifecycle policy for raw data
import boto3
s3 = boto3.client('s3')
s3.put_bucket_lifecycle_configuration(
Bucket='my-company-data-lake',
LifecycleConfiguration={
'Rules': [
{
'ID': 'archive-raw-data',
'Filter': {'Prefix': 'raw/'},
'Status': 'Enabled',
'Transitions': [
{'Days': 90, 'StorageClass': 'GLACIER'},
{'Days': 365, 'StorageClass': 'DEEP_ARCHIVE'}
]
}
]
}
)
4. Monitor Query Performance
-- Track query execution times
SELECT
query,
avg_time_ms,
total_runs,
p99_time_ms
FROM query_log
WHERE database = 'analytics'
GROUP BY 1, 2, 3, 4
ORDER BY avg_time_ms DESC;
Conclusion
DuckDB’s native support for Apache Iceberg and Delta Lake makes it an excellent choice for building production-grade data lakehouses. The combination of:
- DuckDB’s speed — columnar analytics at blazing speeds
- Iceberg/Delta’s ACID guarantees — reliable transactions and time travel
- S3/GCS’s scalability — petabyte-scale storage at low cost
- Open table formats — vendor-neutral, portable data
Creates a powerful, cost-effective analytics platform that eliminates the need for expensive cloud data warehouses.
For most organizations, starting with DuckDB + Iceberg on S3 provides the best balance of performance, cost, and flexibility.
This guide assumes basic familiarity with SQL and cloud storage. For production deployments, refer to the official Apache Iceberg and Delta Lake documentation.