DuckDB Data Lake: Building Production-Grade Analytics with Iceberg and Delta Lake

DuckDB Data Lake: Building Production-Grade Analytics with Iceberg and Delta Lake

TL;DR: DuckDB’s extension ecosystem now supports Apache Iceberg and Delta Lake natively. This guide shows you how to build a production-grade data lakehouse architecture that combines DuckDB’s speed with open table formats for ACID transactions and time travel.

Why Data Lakehouse?

Traditional data architecture separates storage and compute:

┌─────────────────────────────────────────────────┐
│              Traditional Architecture             │
│                                                  │
│  [Data Lake] ──> [ETL Pipeline] ──> [DW]        │
│   (S3/GCS)          (Spark/Airflow)   (Redshift) │
│                                                  │
│  Problems:                                      │
│  • Data duplication (lake → warehouse)          │
│  • Complex ETL pipelines                        │
│  • Schema enforcement at load time              │
│  • No ACID transactions                         │
└─────────────────────────────────────────────────┘

The data lakehouse approach eliminates this complexity:

┌─────────────────────────────────────────────────┐
│              Lakehouse Architecture               │
│                                                  │
│  [DuckDB] ──> [Apache Iceberg / Delta Lake] ──> │
│   (Analytics)    (S3/GCS + Metadata)              │
│                                                  │
│  Benefits:                                      │
│  • Single source of truth                       │
│  • ACID transactions                            │
│  • Time travel                                  │
│  • Schema evolution                             │
│  • Open table format                            │
└─────────────────────────────────────────────────┘

Apache Iceberg with DuckDB

Installation

# Install DuckDB Iceberg extension
duckdb -c "INSTALL iceberg;"
duckdb -c "LOAD iceberg;"

# Or in Python
pip install duckdb
python -c "
import duckdb
con = duckdb.connect()
con.execute('INSTALL iceberg FROM community')
con.execute('LOAD iceberg')
"

Connecting to S3

-- Configure S3 credentials
CREATE SECRET (
    TYPE S3,
    KEY_ID 'YOUR_AWS_KEY',
    SECRET 'YOUR_AWS_SECRET',
    REGION 'us-east-1'
);

-- List tables in an Iceberg catalog
SHOW TABLES IN iceberg ('s3://my-bucket/catalog/');

Reading Iceberg Tables

-- Query Iceberg table directly from S3
SELECT 
    customer_id,
    COUNT(*) as order_count,
    SUM(amount) as total_spent
FROM read_iceberg('s3://my-bucket/data/orders/*.parquet')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100;

Time Travel Queries

-- Query table as of a specific timestamp
SELECT * FROM iceberg_scan(
    's3://my-bucket/data/orders/',
    {
        'snapshot_timestamp': '2026-06-20 10:00:00'
    }
);

-- Query by branch
SELECT * FROM iceberg_scan(
    's3://my-bucket/data/orders/',
    {
        'branch': 'feature-branch'
    }
);

-- Query by commit ID
SELECT * FROM iceberg_scan(
    's3://my-bucket/data/orders/',
    {
        'commit_id': 'abc123def456'
    }
);

Schema Evolution

-- Add new column to existing table
ALTER TABLE iceberg_scan('s3://my-bucket/data/orders/') 
ADD COLUMN loyalty_tier VARCHAR;

-- Modify column type
ALTER TABLE iceberg_scan('s3://my-bucket/data/orders/') 
ALTER COLUMN amount TYPE DECIMAL(18,4);

-- Rename column
ALTER TABLE iceberg_scan('s3://my-bucket/data/orders/') 
RENAME COLUMN customer_id TO user_id;

Writing to Iceberg

-- Create new Iceberg table
CREATE TABLE iceberg_scan('s3://my-bucket/data/new_orders/') AS
SELECT 
    order_id,
    customer_id,
    amount,
    order_date,
    status
FROM read_parquet('s3://my-bucket/raw/orders/*.parquet')
WHERE order_date >= '2026-01-01';

-- Append data
INSERT INTO iceberg_scan('s3://my-bucket/data/orders/')
SELECT * FROM read_parquet('s3://my-bucket/raw/new_orders/*.parquet');

-- Merge (upsert)
MERGE INTO iceberg_scan('s3://my-bucket/data/orders/') AS target
USING read_parquet('s3://my-bucket/raw/updates/*.parquet') AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET 
    amount = source.amount,
    status = source.status
WHEN NOT MATCHED THEN INSERT VALUES 
    (source.order_id, source.customer_id, source.amount, source.order_date, source.status);

Delta Lake with DuckDB

Installation

# Install Delta Lake extension
duckdb -c "INSTALL delta;"
duckdb -c "LOAD delta;"

Connecting to Delta Tables

-- Read Delta table from S3
SELECT * FROM delta_scan('s3://my-bucket/data/delta-orders/');

-- Query with time travel
SELECT * FROM delta_scan('s3://my-bucket/data/delta-orders/', {
    'version': 42  -- Query by version
});

SELECT * FROM delta_scan('s3://my-bucket/data/delta-orders/', {
    'timestamp': '2026-06-20'  -- Query by timestamp
});

Writing Delta Tables

-- Create Delta table
CREATE TABLE delta_scan('s3://my-bucket/data/delta-products/') AS
SELECT 
    product_id,
    name,
    category,
    price,
    stock_quantity
FROM read_parquet('s3://my-bucket/raw/products/*.parquet');

-- Append to Delta table
INSERT INTO delta_scan('s3://my-bucket/data/delta-orders/')
SELECT * FROM read_parquet('s3://my-bucket/raw/new_orders/*.parquet');

-- Merge in Delta
CALL delta_merge(
    's3://my-bucket/data/delta-orders/',
    read_parquet('s3://my-bucket/raw/updates/*.parquet'),
    'source.order_id = target.order_id',
    ['UPDATE SET amount = source.amount, status = source.status'],
    ['INSERT VALUES (source.order_id, source.customer_id, source.amount, source.order_date, source.status)']
);

Compaction and Optimization

-- Compact small files
CALL delta_optimize('s3://my-bucket/data/delta-orders/');

-- Vacuum old versions
CALL delta_vacuum('s3://my-bucket/data/delta-orders/', 0);

-- Z-order optimization
CALL delta_zorder('s3://my-bucket/data/delta-orders/', 'customer_id, order_date');

Production Architecture

Data Lake Structure

s3://my-company-data-lake/
├── raw/                    # Raw data (immutable)
│   ├── orders/
│   ├── products/
│   ├── events/
│   └── inventory/
├── curated/                # Curated tables (Iceberg/Delta)
│   ├── customers/
│   ├── sales/
│   └── analytics/
├── staging/                # Temporary processing
│   ├── etl-temp/
│   └── ml-features/
└── archives/               # Archived data
    └── historical/

ETL Pipeline with DuckDB

import duckdb
import boto3
from datetime import datetime

class DataLakePipeline:
    def __init__(self, bucket='my-company-data-lake'):
        self.bucket = bucket
        self.s3 = boto3.client('s3')
        self.con = duckdb.connect(':memory:')
        self.con.execute('INSTALL iceberg FROM community')
        self.con.execute('LOAD iceberg')
    
    def ingest_raw_data(self, source_path, target_path):
        """Ingest raw data from various sources."""
        # Read from CSV
        self.con.execute(f"""
            COPY (
                SELECT * FROM read_csv_auto('{source_path}')
            ) TO '{target_path}' (FORMAT PARQUET)
        """)
    
    def transform_curated(self, raw_table, curated_table, transformations):
        """Apply transformations and write to curated layer."""
        self.con.execute(f"""
            CREATE OR REPLACE TABLE {curated_table} AS
            SELECT {transformations}
            FROM {raw_table}
        """)
    
    def run_analytics(self, query):
        """Run analytical queries on curated data."""
        return self.con.execute(query).fetchdf()

# Usage
pipeline = DataLakePipeline()
pipeline.ingest_raw_data(
    's3://my-company-data-lake/raw/orders/*.csv',
    's3://my-company-data-lake/curated/orders/'
)
results = pipeline.run_analytics("""
    SELECT 
        DATE_TRUNC('month', order_date) as month,
        SUM(amount) as revenue,
        COUNT(*) as orders
    FROM iceberg_scan('s3://my-company-data-lake/curated/orders/')
    GROUP BY 1
    ORDER BY 1 DESC
""")

Incremental Updates

-- Upsert new records
MERGE INTO iceberg_scan('s3://my-bucket/curated/orders/') AS target
USING (
    SELECT * FROM read_parquet('s3://my-bucket/raw/new_orders/2026-06-27/*.parquet')
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET
    amount = source.amount,
    status = source.status,
    updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT VALUES
    (source.order_id, source.customer_id, source.amount, source.order_date, source.status);

Performance Optimization

File Size Optimization

-- DuckDB automatically optimizes Parquet file sizes
-- But you can control it explicitly:
COPY (
    SELECT * FROM iceberg_scan('s3://my-bucket/raw/orders/')
) TO 's3://my-bucket/curated/orders/' (
    FILE_FORMAT PARQUET,
    COMPRESSION ZSTD,
    PER_THREAD_OUTPUT TRUE,
    MAX_PARQUET_ROW_GROUP_SIZE 1000000
);

Partitioning Strategy

-- Partition by date for efficient filtering
CREATE TABLE iceberg_scan('s3://my-bucket/curated/orders/') AS
SELECT * FROM read_parquet('s3://my-bucket/raw/orders/*.parquet')
PARTITION BY (order_date);

-- Query with partition pruning
SELECT * FROM iceberg_scan('s3://my-bucket/curated/orders/')
WHERE order_date >= '2026-06-01' AND order_date < '2026-07-01';
-- Only scans June partitions

Caching Strategy

-- Enable result caching
SET enable_result_cache = TRUE;
SET result_cache_size = '2GB';

-- Cache frequently accessed tables
CREATE CACHE TABLE cached_monthly_sales AS
SELECT 
    DATE_TRUNC('month', order_date) as month,
    category,
    SUM(amount) as revenue
FROM iceberg_scan('s3://my-bucket/curated/orders/')
GROUP BY 1, 2;

Cost Comparison

Metric	DuckDB + Iceberg	Traditional DW (Snowflake/BigQuery)
Compute cost	$0.01/GB scanned	$5-10/GB scanned
Storage cost	$0.023/GB/month (S3)	$0.023/GB/month
ETL complexity	Low (SQL only)	High (Spark/Airflow)
Time travel	Built-in	Additional cost
Schema evolution	Built-in	Limited
Concurrency	Limited (single process)	High (distributed)

Best Practices

1. Use Parquet as the Storage Format

-- Always use Parquet for optimal compression and query performance
COPY data TO 's3://bucket/data/' (FORMAT PARQUET, COMPRESSION ZSTD);

2. Implement Data Quality Checks

-- Validate data before ingesting
SELECT 
    COUNT(*) as total_rows,
    COUNT(DISTINCT order_id) as unique_orders,
    COUNT(CASE WHEN amount <= 0 THEN 1 END) as invalid_amounts,
    COUNT(CASE WHEN order_date > CURRENT_DATE THEN 1 END) as future_dates
FROM read_parquet('s3://bucket/raw/orders/*.parquet');

3. Set Up Lifecycle Policies

# S3 lifecycle policy for raw data
import boto3

s3 = boto3.client('s3')
s3.put_bucket_lifecycle_configuration(
    Bucket='my-company-data-lake',
    LifecycleConfiguration={
        'Rules': [
            {
                'ID': 'archive-raw-data',
                'Filter': {'Prefix': 'raw/'},
                'Status': 'Enabled',
                'Transitions': [
                    {'Days': 90, 'StorageClass': 'GLACIER'},
                    {'Days': 365, 'StorageClass': 'DEEP_ARCHIVE'}
                ]
            }
        ]
    }
)

4. Monitor Query Performance

-- Track query execution times
SELECT 
    query,
    avg_time_ms,
    total_runs,
    p99_time_ms
FROM query_log
WHERE database = 'analytics'
GROUP BY 1, 2, 3, 4
ORDER BY avg_time_ms DESC;

Conclusion

DuckDB’s native support for Apache Iceberg and Delta Lake makes it an excellent choice for building production-grade data lakehouses. The combination of:

DuckDB’s speed — columnar analytics at blazing speeds
Iceberg/Delta’s ACID guarantees — reliable transactions and time travel
S3/GCS’s scalability — petabyte-scale storage at low cost
Open table formats — vendor-neutral, portable data

Creates a powerful, cost-effective analytics platform that eliminates the need for expensive cloud data warehouses.

For most organizations, starting with DuckDB + Iceberg on S3 provides the best balance of performance, cost, and flexibility.

This guide assumes basic familiarity with SQL and cloud storage. For production deployments, refer to the official Apache Iceberg and Delta Lake documentation.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.