DuckDB + Apache Iceberg: From Query to Write — A Practical Data Lake Guide

1. Why DuckDB + Iceberg?

Apache Iceberg has become the de facto standard for open table formats in the data lakehouse ecosystem. It provides ACID transactions, time travel, schema evolution, and partition pruning — enterprise features that power the world’s largest data platforms.

However, the traditional path to Iceberg requires heavy infrastructure: Spark clusters, Trino workers, or Flink pipelines. DuckDB changes this equation entirely.

Starting with v1.5.0, DuckDB’s iceberg extension delivers not just high-performance reads, but full write capabilities — INSERT, UPDATE, DELETE, and MERGE. Combined with Unity Catalog and AWS Glue Catalog integration, your laptop is now a viable Iceberg lakehouse development environment.

DuckDB vs Traditional Iceberg Engines

Feature	DuckDB	Apache Spark	Trino	Flink
Deployment	Zero (embedded)	Cluster required	Cluster required	Cluster required
Startup time	< 1s	3-5 min	10-30s	1-2 min
Iceberg writes	✅ v1.5+	✅	❌ Read-only	✅
Unity Catalog	✅	✅	✅	❌
Memory	256MB+	8GB+	4GB+	4GB+
SQL support	Full	Full	Partial	Partial
Python native	✅	PySpark only	❌	❌
Single-node TB	✅	❌ (needs cluster)	❌	❌

Sources: DuckDB v1.5.2 official docs and community benchmarks.

2. Environment Setup

Install DuckDB and Load the Iceberg Extension

# Install DuckDB CLI (v1.5.2 or later)
curl https://install.duckdb.org | sh

# Or via Python
pip install duckdb

Start DuckDB and load the Iceberg extension:

-- Install the Iceberg extension (community maintained)
INSTALL iceberg FROM community;
LOAD iceberg;

-- Verify installation
SELECT version();

Generate Sample Data

-- Create mock sales data for Iceberg write testing
CREATE TABLE raw_sales AS
SELECT 
    range AS order_id,
    '2026-0' || (range % 9 + 1)::VARCHAR AS month,
    (random() * 1000)::INTEGER AS customer_id,
    CASE WHEN random() < 0.3 THEN 'Electronics'
         WHEN random() < 0.6 THEN 'Clothing'
         ELSE 'Household'
    END AS category,
    (random() * 5000 + 10)::DECIMAL(10,2) AS amount,
    DATE '2026-01-01' + INTERVAL (range % 365) DAY AS order_date
FROM range(1, 100000);

-- Quick stats
SELECT count(*) AS total_orders,
       round(sum(amount)) AS total_revenue
FROM raw_sales;

3. Creating Iceberg Tables and Writing Data

3.1 Local Iceberg Table

-- Attach a local Iceberg database (file-based)
ATTACH 'sales_iceberg' AS sales_db (TYPE iceberg);

USE sales_db;

-- Create a partitioned table (by month)
CREATE TABLE orders (
    order_id INTEGER,
    month VARCHAR,
    customer_id INTEGER,
    category VARCHAR,
    amount DECIMAL(10,2),
    order_date DATE
) PARTITION_BY (month);

-- Bulk insert
INSERT INTO orders 
SELECT * FROM raw_sales;

-- Verify
SELECT month, count(*) AS orders, round(sum(amount)) AS revenue
FROM orders
GROUP BY month
ORDER BY month;

3.2 ACID Transactions and Time Travel

One of Iceberg’s killer features is snapshot isolation and time travel:

-- View Iceberg snapshot history
SELECT snapshot_id, parent_id, timestamp, manifest_list
FROM iceberg_snapshots('sales_iceberg/orders');

-- Start a transaction: update electronics pricing
BEGIN TRANSACTION;

UPDATE orders 
SET amount = amount * 1.1
WHERE category = 'Electronics';

-- Preview changes
SELECT category, round(sum(amount)) AS revenue
FROM orders
WHERE category = 'Electronics'
GROUP BY category;

COMMIT;

-- ⏱ Time travel: query the snapshot before the update
-- Get historical snapshot IDs
SELECT snapshot_id, timestamp 
FROM iceberg_snapshots('sales_iceberg/orders')
ORDER BY timestamp DESC;

-- Query a specific version
SELECT category, round(sum(amount)) AS revenue
FROM orders
FOR SYSTEM_VERSION AS OF 1234567890
WHERE category = 'Electronics'
GROUP BY category;

-- Rollback to a specific snapshot
ALTER TABLE orders ROLLBACK TO 1234567890;

3.3 MERGE INTO (Upsert)

Iceberg supports full MERGE (upsert) operations:

-- Create incremental update data
CREATE TABLE daily_updates AS
SELECT order_id, '2026-05' AS month, 
       customer_id, 'Electronics' AS category,
       amount * 1.2 AS amount, order_date
FROM raw_sales
WHERE order_id <= 100;

-- Execute MERGE
MERGE INTO orders t
USING daily_updates s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET 
    amount = s.amount,
    category = s.category
WHEN NOT MATCHED THEN INSERT 
    (order_id, month, customer_id, category, amount, order_date)
VALUES 
    (s.order_id, s.month, s.customer_id, s.category, s.amount, s.order_date);

4. Integrating with Unity Catalog and Glue Catalog

DuckDB connects to enterprise catalogs via the REST Catalog interface:

-- Connect to Unity Catalog (requires UC endpoint)
ATTACH 'uc:my_catalog.my_schema' AS uc_db 
    (TYPE uc, 
     endpoint 'https://your-uc-instance/api/2.1/unity-catalog',
     token 'your_token_here');

-- Query Iceberg tables in Unity Catalog
SELECT * FROM uc_db.sales_region_iceberg
WHERE region = 'APAC'
LIMIT 100;

-- Connect to AWS Glue Catalog
ATTACH 'glue:my_database' AS glue_db 
    (TYPE glue, 
     region 'us-east-1');

-- Query Iceberg tables in Glue
SELECT year, count(*) AS flights
FROM glue_db.flights_iceberg
WHERE year >= 2024
GROUP BY year
ORDER BY year DESC;

Enterprise Integration Comparison

Feature	Local File	AWS Glue	Unity Catalog
Setup complexity	Low	Medium	High
Cost	Free	Per-table fee	Per-CU fee
Multi-engine sharing	Limited	✅	✅
Access control	None	IAM	RBAC
Data lineage	None	✅	✅
Cross-region replication	Manual	✅	✅

5. Performance Optimization & Best Practices

5.1 Partition Strategy

-- Good partitioning: low-cardinality fields (month, region)
CREATE TABLE orders_partitioned (
    order_id INTEGER,
    month VARCHAR,
    region VARCHAR,
    amount DECIMAL(10,2)
) PARTITION_BY (month, region);

-- Avoid: high-cardinality partitions (order_id, customer_id)
-- These create thousands of tiny files and kill performance

5.2 File Compaction

Frequent writes generate many small files. Compaction is essential:

-- Check current file count
SELECT count(*) AS file_count,
       round(sum(file_size_in_bytes) / 1024 / 1024) AS total_mb
FROM iceberg_files('sales_iceberg/orders');

-- Rewrite small files into larger ones
CALL iceberg_rewrite_data_files(
    'sales_iceberg/orders',
    strategy => 'binpack',
    target_bytes_per_file => '134217728'  -- 128MB
);

5.3 Iceberg-Optimized SQL

-- DuckDB leverages Iceberg's partition pruning automatically
EXPLAIN ANALYZE
SELECT month, round(sum(amount)) AS revenue
FROM orders
WHERE month IN ('2026-01', '2026-02', '2026-03')
GROUP BY month;
-- DuckDB skips irrelevant partitions

6. Comparison with Traditional Data Warehouses

Dimension	DuckDB + Iceberg	Snowflake	Amazon Redshift	ClickHouse
Cost	Storage only (S3 ~$23/TB/mo)	$2-4/credit	$0.25/hr+	Free self-hosted
Query speed	SQLite-level startup	Seconds	Seconds	Milliseconds
Open format	✅ Iceberg/Parquet	❌ Proprietary	❌ Proprietary	❌ Proprietary
Local dev	✅ Zero dependency	❌	❌	Partial
ACID	✅ Iceberg-guaranteed	✅	✅	❌
Data lake compat	✅ Native	Limited	Limited	❌
CI/CD integration	✅ Embedded	❌	❌	❌

Key insight: DuckDB + Iceberg excels in these scenarios:

Data lake dev/test environments (replacing expensive Spark clusters)
Small-to-medium analytical pipelines (< 100GB)
Teams needing open format portability
Cost-sensitive organizations

7. Advanced: Building Automated Iceberg Pipelines

Here’s a complete Python automation script:

import duckdb
import pandas as pd
from datetime import datetime

def ingest_to_iceberg(csv_path: str, table_path: str, partition_col: str):
    """
    Auto-ingest CSV data into Iceberg with compaction
    """
    con = duckdb.connect()
    
    # Load extensions
    con.execute("INSTALL iceberg FROM community; LOAD iceberg;")
    
    # Attach Iceberg database
    con.execute(f"ATTACH '{table_path}' AS db (TYPE iceberg)")
    
    # Read CSV, add ingest date, write to Iceberg
    con.execute(f"""
        CREATE OR REPLACE TABLE db.raw_data AS 
        SELECT *, '{datetime.now().strftime('%Y-%m-%d')}' AS ingest_date
        FROM read_csv_auto('{csv_path}')
    """)
    
    # Create partitioned table
    con.execute(f"""
        CREATE OR REPLACE TABLE db.partitioned_data 
        PARTITION_BY ({partition_col}) AS
        SELECT * FROM db.raw_data
    """)
    
    # Compact small files
    con.execute(f"CALL iceberg_rewrite_data_files('{table_path}/partitioned_data', 'binpack', 134217728)")
    
    # Stats
    rows = con.execute(f"SELECT count(*) FROM db.partitioned_data").fetchone()[0]
    print(f"✅ Successfully ingested {rows} rows into {table_path}")
    
    con.close()

# Usage
ingest_to_iceberg(
    csv_path='/data/sales_2026.csv',
    table_path='s3://my-bucket/iceberg/sales',
    partition_col='month'
)

8. Monetization Strategies 💰

Mastering DuckDB + Iceberg opens several monetization avenues:

1. Data Lake Migration Consulting

Target: SMBs migrating from Spark/Flink to lightweight alternatives
Service: Migrate existing data pipelines to DuckDB + Iceberg
Pricing: $500-$3,000 per migration, depending on data volume
Deliverables: Migration plan + automation scripts + performance report

2. Productize Your Pipeline Template

Product: CLI tool or SaaS wrapping the Python automation above
Pricing: $49/month subscription, includes auto-compaction, monitoring, alerts
Target: Data engineers in small-to-medium teams

3. Training & Workshops

Course: “DuckDB + Iceberg: Building Enterprise Data Lakes on a Laptop”
Pricing: $199 (recorded) or $499 (live with Q&A)
Curriculum: Iceberg fundamentals → DuckDB operations → Catalog integration → Pipeline automation → Production deployment

4. Open-Source Tooling

Build a DuckDB Iceberg management GUI (like pgAdmin for Iceberg)
Monetize via GitHub Sponsors or enterprise features

5. Performance Tuning Consulting

Diagnose poorly performing Iceberg deployments
Services: File layout analysis → Partition strategy optimization → Query rewriting
Rate: $200/hour

Summary

The DuckDB + Apache Iceberg combination democratizes data lake technology. Previously the domain of Spark clusters and specialized infrastructure, Iceberg tables can now be created, queried, and maintained from a single process running on any machine — from a developer laptop to a CI/CD runner to a production server.

Three rules to remember:

Partition wisely: Choose low-cardinality fields for partitioning
Compact regularly: Run iceberg_rewrite_data_files after batch writes
Leverage schema evolution: Iceberg’s schema evolution lets you add/rename/drop columns without downtime — use it aggressively

Your laptop might just be the most cost-effective data lake on the planet.