DuckDB + Apache Iceberg: From Query to Write — A Practical Data Lake Guide

DuckDB v1.5+ brings full Apache Iceberg support — query, write, time travel, and Unity Catalog integration. This hands-on guide walks through creating Iceberg tables, ACID transactions, MERGE operations, and building automated data pipelines, all from your laptop.

1. Why DuckDB + Iceberg?

Apache Iceberg has become the de facto standard for open table formats in the data lakehouse ecosystem. It provides ACID transactions, time travel, schema evolution, and partition pruning — enterprise features that power the world’s largest data platforms.

However, the traditional path to Iceberg requires heavy infrastructure: Spark clusters, Trino workers, or Flink pipelines. DuckDB changes this equation entirely.

Starting with v1.5.0, DuckDB’s iceberg extension delivers not just high-performance reads, but full write capabilities — INSERT, UPDATE, DELETE, and MERGE. Combined with Unity Catalog and AWS Glue Catalog integration, your laptop is now a viable Iceberg lakehouse development environment.

DuckDB vs Traditional Iceberg Engines

FeatureDuckDBApache SparkTrinoFlink
DeploymentZero (embedded)Cluster requiredCluster requiredCluster required
Startup time< 1s3-5 min10-30s1-2 min
Iceberg writes✅ v1.5+❌ Read-only
Unity Catalog
Memory256MB+8GB+4GB+4GB+
SQL supportFullFullPartialPartial
Python nativePySpark only
Single-node TB❌ (needs cluster)

Sources: DuckDB v1.5.2 official docs and community benchmarks.


2. Environment Setup

Install DuckDB and Load the Iceberg Extension

# Install DuckDB CLI (v1.5.2 or later)
curl https://install.duckdb.org | sh

# Or via Python
pip install duckdb

Start DuckDB and load the Iceberg extension:

-- Install the Iceberg extension (community maintained)
INSTALL iceberg FROM community;
LOAD iceberg;

-- Verify installation
SELECT version();

Generate Sample Data

-- Create mock sales data for Iceberg write testing
CREATE TABLE raw_sales AS
SELECT 
    range AS order_id,
    '2026-0' || (range % 9 + 1)::VARCHAR AS month,
    (random() * 1000)::INTEGER AS customer_id,
    CASE WHEN random() < 0.3 THEN 'Electronics'
         WHEN random() < 0.6 THEN 'Clothing'
         ELSE 'Household'
    END AS category,
    (random() * 5000 + 10)::DECIMAL(10,2) AS amount,
    DATE '2026-01-01' + INTERVAL (range % 365) DAY AS order_date
FROM range(1, 100000);

-- Quick stats
SELECT count(*) AS total_orders,
       round(sum(amount)) AS total_revenue
FROM raw_sales;

3. Creating Iceberg Tables and Writing Data

3.1 Local Iceberg Table

-- Attach a local Iceberg database (file-based)
ATTACH 'sales_iceberg' AS sales_db (TYPE iceberg);

USE sales_db;

-- Create a partitioned table (by month)
CREATE TABLE orders (
    order_id INTEGER,
    month VARCHAR,
    customer_id INTEGER,
    category VARCHAR,
    amount DECIMAL(10,2),
    order_date DATE
) PARTITION_BY (month);

-- Bulk insert
INSERT INTO orders 
SELECT * FROM raw_sales;

-- Verify
SELECT month, count(*) AS orders, round(sum(amount)) AS revenue
FROM orders
GROUP BY month
ORDER BY month;

3.2 ACID Transactions and Time Travel

One of Iceberg’s killer features is snapshot isolation and time travel:

-- View Iceberg snapshot history
SELECT snapshot_id, parent_id, timestamp, manifest_list
FROM iceberg_snapshots('sales_iceberg/orders');
-- Start a transaction: update electronics pricing
BEGIN TRANSACTION;

UPDATE orders 
SET amount = amount * 1.1
WHERE category = 'Electronics';

-- Preview changes
SELECT category, round(sum(amount)) AS revenue
FROM orders
WHERE category = 'Electronics'
GROUP BY category;

COMMIT;
-- ⏱ Time travel: query the snapshot before the update
-- Get historical snapshot IDs
SELECT snapshot_id, timestamp 
FROM iceberg_snapshots('sales_iceberg/orders')
ORDER BY timestamp DESC;

-- Query a specific version
SELECT category, round(sum(amount)) AS revenue
FROM orders
FOR SYSTEM_VERSION AS OF 1234567890
WHERE category = 'Electronics'
GROUP BY category;
-- Rollback to a specific snapshot
ALTER TABLE orders ROLLBACK TO 1234567890;

3.3 MERGE INTO (Upsert)

Iceberg supports full MERGE (upsert) operations:

-- Create incremental update data
CREATE TABLE daily_updates AS
SELECT order_id, '2026-05' AS month, 
       customer_id, 'Electronics' AS category,
       amount * 1.2 AS amount, order_date
FROM raw_sales
WHERE order_id <= 100;

-- Execute MERGE
MERGE INTO orders t
USING daily_updates s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET 
    amount = s.amount,
    category = s.category
WHEN NOT MATCHED THEN INSERT 
    (order_id, month, customer_id, category, amount, order_date)
VALUES 
    (s.order_id, s.month, s.customer_id, s.category, s.amount, s.order_date);

4. Integrating with Unity Catalog and Glue Catalog

DuckDB connects to enterprise catalogs via the REST Catalog interface:

-- Connect to Unity Catalog (requires UC endpoint)
ATTACH 'uc:my_catalog.my_schema' AS uc_db 
    (TYPE uc, 
     endpoint 'https://your-uc-instance/api/2.1/unity-catalog',
     token 'your_token_here');

-- Query Iceberg tables in Unity Catalog
SELECT * FROM uc_db.sales_region_iceberg
WHERE region = 'APAC'
LIMIT 100;
-- Connect to AWS Glue Catalog
ATTACH 'glue:my_database' AS glue_db 
    (TYPE glue, 
     region 'us-east-1');

-- Query Iceberg tables in Glue
SELECT year, count(*) AS flights
FROM glue_db.flights_iceberg
WHERE year >= 2024
GROUP BY year
ORDER BY year DESC;

Enterprise Integration Comparison

FeatureLocal FileAWS GlueUnity Catalog
Setup complexityLowMediumHigh
CostFreePer-table feePer-CU fee
Multi-engine sharingLimited
Access controlNoneIAMRBAC
Data lineageNone
Cross-region replicationManual

5. Performance Optimization & Best Practices

5.1 Partition Strategy

-- Good partitioning: low-cardinality fields (month, region)
CREATE TABLE orders_partitioned (
    order_id INTEGER,
    month VARCHAR,
    region VARCHAR,
    amount DECIMAL(10,2)
) PARTITION_BY (month, region);

-- Avoid: high-cardinality partitions (order_id, customer_id)
-- These create thousands of tiny files and kill performance

5.2 File Compaction

Frequent writes generate many small files. Compaction is essential:

-- Check current file count
SELECT count(*) AS file_count,
       round(sum(file_size_in_bytes) / 1024 / 1024) AS total_mb
FROM iceberg_files('sales_iceberg/orders');

-- Rewrite small files into larger ones
CALL iceberg_rewrite_data_files(
    'sales_iceberg/orders',
    strategy => 'binpack',
    target_bytes_per_file => '134217728'  -- 128MB
);

5.3 Iceberg-Optimized SQL

-- DuckDB leverages Iceberg's partition pruning automatically
EXPLAIN ANALYZE
SELECT month, round(sum(amount)) AS revenue
FROM orders
WHERE month IN ('2026-01', '2026-02', '2026-03')
GROUP BY month;
-- DuckDB skips irrelevant partitions

6. Comparison with Traditional Data Warehouses

DimensionDuckDB + IcebergSnowflakeAmazon RedshiftClickHouse
CostStorage only (S3 ~$23/TB/mo)$2-4/credit$0.25/hr+Free self-hosted
Query speedSQLite-level startupSecondsSecondsMilliseconds
Open format✅ Iceberg/Parquet❌ Proprietary❌ Proprietary❌ Proprietary
Local dev✅ Zero dependencyPartial
ACID✅ Iceberg-guaranteed
Data lake compat✅ NativeLimitedLimited
CI/CD integration✅ Embedded

Key insight: DuckDB + Iceberg excels in these scenarios:

  • Data lake dev/test environments (replacing expensive Spark clusters)
  • Small-to-medium analytical pipelines (< 100GB)
  • Teams needing open format portability
  • Cost-sensitive organizations

7. Advanced: Building Automated Iceberg Pipelines

Here’s a complete Python automation script:

import duckdb
import pandas as pd
from datetime import datetime

def ingest_to_iceberg(csv_path: str, table_path: str, partition_col: str):
    """
    Auto-ingest CSV data into Iceberg with compaction
    """
    con = duckdb.connect()
    
    # Load extensions
    con.execute("INSTALL iceberg FROM community; LOAD iceberg;")
    
    # Attach Iceberg database
    con.execute(f"ATTACH '{table_path}' AS db (TYPE iceberg)")
    
    # Read CSV, add ingest date, write to Iceberg
    con.execute(f"""
        CREATE OR REPLACE TABLE db.raw_data AS 
        SELECT *, '{datetime.now().strftime('%Y-%m-%d')}' AS ingest_date
        FROM read_csv_auto('{csv_path}')
    """)
    
    # Create partitioned table
    con.execute(f"""
        CREATE OR REPLACE TABLE db.partitioned_data 
        PARTITION_BY ({partition_col}) AS
        SELECT * FROM db.raw_data
    """)
    
    # Compact small files
    con.execute(f"CALL iceberg_rewrite_data_files('{table_path}/partitioned_data', 'binpack', 134217728)")
    
    # Stats
    rows = con.execute(f"SELECT count(*) FROM db.partitioned_data").fetchone()[0]
    print(f"✅ Successfully ingested {rows} rows into {table_path}")
    
    con.close()

# Usage
ingest_to_iceberg(
    csv_path='/data/sales_2026.csv',
    table_path='s3://my-bucket/iceberg/sales',
    partition_col='month'
)

8. Monetization Strategies 💰

Mastering DuckDB + Iceberg opens several monetization avenues:

1. Data Lake Migration Consulting

  • Target: SMBs migrating from Spark/Flink to lightweight alternatives
  • Service: Migrate existing data pipelines to DuckDB + Iceberg
  • Pricing: $500-$3,000 per migration, depending on data volume
  • Deliverables: Migration plan + automation scripts + performance report

2. Productize Your Pipeline Template

  • Product: CLI tool or SaaS wrapping the Python automation above
  • Pricing: $49/month subscription, includes auto-compaction, monitoring, alerts
  • Target: Data engineers in small-to-medium teams

3. Training & Workshops

  • Course: “DuckDB + Iceberg: Building Enterprise Data Lakes on a Laptop”
  • Pricing: $199 (recorded) or $499 (live with Q&A)
  • Curriculum: Iceberg fundamentals → DuckDB operations → Catalog integration → Pipeline automation → Production deployment

4. Open-Source Tooling

  • Build a DuckDB Iceberg management GUI (like pgAdmin for Iceberg)
  • Monetize via GitHub Sponsors or enterprise features

5. Performance Tuning Consulting

  • Diagnose poorly performing Iceberg deployments
  • Services: File layout analysis → Partition strategy optimization → Query rewriting
  • Rate: $200/hour

Summary

The DuckDB + Apache Iceberg combination democratizes data lake technology. Previously the domain of Spark clusters and specialized infrastructure, Iceberg tables can now be created, queried, and maintained from a single process running on any machine — from a developer laptop to a CI/CD runner to a production server.

Three rules to remember:

  1. Partition wisely: Choose low-cardinality fields for partitioning
  2. Compact regularly: Run iceberg_rewrite_data_files after batch writes
  3. Leverage schema evolution: Iceberg’s schema evolution lets you add/rename/drop columns without downtime — use it aggressively

Your laptop might just be the most cost-effective data lake on the planet.