Featured image of post DuckDB + Apache Iceberg: From Query to Write — A Practical Data Lake Guide

DuckDB + Apache Iceberg: From Query to Write — A Practical Data Lake Guide

DuckDB v1.5+ brings full Apache Iceberg support — query, write, time travel, and Unity Catalog integration. This hands-on guide walks through creating Iceberg tables, ACID transactions, MERGE operations, and building automated data pipelines, all from your laptop.

1. Why DuckDB + Iceberg?

Apache Iceberg has become the de facto standard for open table formats in the data lakehouse ecosystem. It provides ACID transactions, time travel, schema evolution, and partition pruning — enterprise features that power the world’s largest data platforms.

However, the traditional path to Iceberg requires heavy infrastructure: Spark clusters, Trino workers, or Flink pipelines. DuckDB changes this equation entirely.

Starting with v1.5.0, DuckDB’s iceberg extension delivers not just high-performance reads, but full write capabilities — INSERT, UPDATE, DELETE, and MERGE. Combined with Unity Catalog and AWS Glue Catalog integration, your laptop is now a viable Iceberg lakehouse development environment.

DuckDB vs Traditional Iceberg Engines

FeatureDuckDBApache SparkTrinoFlink
DeploymentZero (embedded)Cluster requiredCluster requiredCluster required
Startup time< 1s3-5 min10-30s1-2 min
Iceberg writes✅ v1.5+❌ Read-only
Unity Catalog
Memory256MB+8GB+4GB+4GB+
SQL supportFullFullPartialPartial
Python nativePySpark only
Single-node TB❌ (needs cluster)

Sources: DuckDB v1.5.2 official docs and community benchmarks.


2. Environment Setup

Install DuckDB and Load the Iceberg Extension

# Install DuckDB CLI (v1.5.2 or later)
curl https://install.duckdb.org | sh

# Or via Python
pip install duckdb

Start DuckDB and load the Iceberg extension:

-- Install the Iceberg extension (community maintained)
INSTALL iceberg FROM community;
LOAD iceberg;

-- Verify installation
SELECT version();

Generate Sample Data

-- Create mock sales data for Iceberg write testing
CREATE TABLE raw_sales AS
SELECT 
    range AS order_id,
    '2026-0' || (range % 9 + 1)::VARCHAR AS month,
    (random() * 1000)::INTEGER AS customer_id,
    CASE WHEN random() < 0.3 THEN 'Electronics'
         WHEN random() < 0.6 THEN 'Clothing'
         ELSE 'Household'
    END AS category,
    (random() * 5000 + 10)::DECIMAL(10,2) AS amount,
    DATE '2026-01-01' + INTERVAL (range % 365) DAY AS order_date
FROM range(1, 100000);

-- Quick stats
SELECT count(*) AS total_orders,
       round(sum(amount)) AS total_revenue
FROM raw_sales;

3. Creating Iceberg Tables and Writing Data

3.1 Local Iceberg Table

-- Attach a local Iceberg database (file-based)
ATTACH 'sales_iceberg' AS sales_db (TYPE iceberg);

USE sales_db;

-- Create a partitioned table (by month)
CREATE TABLE orders (
    order_id INTEGER,
    month VARCHAR,
    customer_id INTEGER,
    category VARCHAR,
    amount DECIMAL(10,2),
    order_date DATE
) PARTITION_BY (month);

-- Bulk insert
INSERT INTO orders 
SELECT * FROM raw_sales;

-- Verify
SELECT month, count(*) AS orders, round(sum(amount)) AS revenue
FROM orders
GROUP BY month
ORDER BY month;

3.2 ACID Transactions and Time Travel

One of Iceberg’s killer features is snapshot isolation and time travel:

-- View Iceberg snapshot history
SELECT snapshot_id, parent_id, timestamp, manifest_list
FROM iceberg_snapshots('sales_iceberg/orders');
-- Start a transaction: update electronics pricing
BEGIN TRANSACTION;

UPDATE orders 
SET amount = amount * 1.1
WHERE category = 'Electronics';

-- Preview changes
SELECT category, round(sum(amount)) AS revenue
FROM orders
WHERE category = 'Electronics'
GROUP BY category;

COMMIT;
-- ⏱ Time travel: query the snapshot before the update
-- Get historical snapshot IDs
SELECT snapshot_id, timestamp 
FROM iceberg_snapshots('sales_iceberg/orders')
ORDER BY timestamp DESC;

-- Query a specific version
SELECT category, round(sum(amount)) AS revenue
FROM orders
FOR SYSTEM_VERSION AS OF 1234567890
WHERE category = 'Electronics'
GROUP BY category;
-- Rollback to a specific snapshot
ALTER TABLE orders ROLLBACK TO 1234567890;

3.3 MERGE INTO (Upsert)

Iceberg supports full MERGE (upsert) operations:

-- Create incremental update data
CREATE TABLE daily_updates AS
SELECT order_id, '2026-05' AS month, 
       customer_id, 'Electronics' AS category,
       amount * 1.2 AS amount, order_date
FROM raw_sales
WHERE order_id <= 100;

-- Execute MERGE
MERGE INTO orders t
USING daily_updates s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET 
    amount = s.amount,
    category = s.category
WHEN NOT MATCHED THEN INSERT 
    (order_id, month, customer_id, category, amount, order_date)
VALUES 
    (s.order_id, s.month, s.customer_id, s.category, s.amount, s.order_date);

4. Integrating with Unity Catalog and Glue Catalog

DuckDB connects to enterprise catalogs via the REST Catalog interface:

-- Connect to Unity Catalog (requires UC endpoint)
ATTACH 'uc:my_catalog.my_schema' AS uc_db 
    (TYPE uc, 
     endpoint 'https://your-uc-instance/api/2.1/unity-catalog',
     token 'your_token_here');

-- Query Iceberg tables in Unity Catalog
SELECT * FROM uc_db.sales_region_iceberg
WHERE region = 'APAC'
LIMIT 100;
-- Connect to AWS Glue Catalog
ATTACH 'glue:my_database' AS glue_db 
    (TYPE glue, 
     region 'us-east-1');

-- Query Iceberg tables in Glue
SELECT year, count(*) AS flights
FROM glue_db.flights_iceberg
WHERE year >= 2024
GROUP BY year
ORDER BY year DESC;

Enterprise Integration Comparison

FeatureLocal FileAWS GlueUnity Catalog
Setup complexityLowMediumHigh
CostFreePer-table feePer-CU fee
Multi-engine sharingLimited
Access controlNoneIAMRBAC
Data lineageNone
Cross-region replicationManual

5. Performance Optimization & Best Practices

5.1 Partition Strategy

-- Good partitioning: low-cardinality fields (month, region)
CREATE TABLE orders_partitioned (
    order_id INTEGER,
    month VARCHAR,
    region VARCHAR,
    amount DECIMAL(10,2)
) PARTITION_BY (month, region);

-- Avoid: high-cardinality partitions (order_id, customer_id)
-- These create thousands of tiny files and kill performance

5.2 File Compaction

Frequent writes generate many small files. Compaction is essential:

-- Check current file count
SELECT count(*) AS file_count,
       round(sum(file_size_in_bytes) / 1024 / 1024) AS total_mb
FROM iceberg_files('sales_iceberg/orders');

-- Rewrite small files into larger ones
CALL iceberg_rewrite_data_files(
    'sales_iceberg/orders',
    strategy => 'binpack',
    target_bytes_per_file => '134217728'  -- 128MB
);

5.3 Iceberg-Optimized SQL

-- DuckDB leverages Iceberg's partition pruning automatically
EXPLAIN ANALYZE
SELECT month, round(sum(amount)) AS revenue
FROM orders
WHERE month IN ('2026-01', '2026-02', '2026-03')
GROUP BY month;
-- DuckDB skips irrelevant partitions

6. Comparison with Traditional Data Warehouses

DimensionDuckDB + IcebergSnowflakeAmazon RedshiftClickHouse
CostStorage only (S3 ~$23/TB/mo)$2-4/credit$0.25/hr+Free self-hosted
Query speedSQLite-level startupSecondsSecondsMilliseconds
Open format✅ Iceberg/Parquet❌ Proprietary❌ Proprietary❌ Proprietary
Local dev✅ Zero dependencyPartial
ACID✅ Iceberg-guaranteed
Data lake compat✅ NativeLimitedLimited
CI/CD integration✅ Embedded

Key insight: DuckDB + Iceberg excels in these scenarios:

  • Data lake dev/test environments (replacing expensive Spark clusters)
  • Small-to-medium analytical pipelines (< 100GB)
  • Teams needing open format portability
  • Cost-sensitive organizations

7. Advanced: Building Automated Iceberg Pipelines

Here’s a complete Python automation script:

import duckdb
import pandas as pd
from datetime import datetime

def ingest_to_iceberg(csv_path: str, table_path: str, partition_col: str):
    """
    Auto-ingest CSV data into Iceberg with compaction
    """
    con = duckdb.connect()
    
    # Load extensions
    con.execute("INSTALL iceberg FROM community; LOAD iceberg;")
    
    # Attach Iceberg database
    con.execute(f"ATTACH '{table_path}' AS db (TYPE iceberg)")
    
    # Read CSV, add ingest date, write to Iceberg
    con.execute(f"""
        CREATE OR REPLACE TABLE db.raw_data AS 
        SELECT *, '{datetime.now().strftime('%Y-%m-%d')}' AS ingest_date
        FROM read_csv_auto('{csv_path}')
    """)
    
    # Create partitioned table
    con.execute(f"""
        CREATE OR REPLACE TABLE db.partitioned_data 
        PARTITION_BY ({partition_col}) AS
        SELECT * FROM db.raw_data
    """)
    
    # Compact small files
    con.execute(f"CALL iceberg_rewrite_data_files('{table_path}/partitioned_data', 'binpack', 134217728)")
    
    # Stats
    rows = con.execute(f"SELECT count(*) FROM db.partitioned_data").fetchone()[0]
    print(f"✅ Successfully ingested {rows} rows into {table_path}")
    
    con.close()

# Usage
ingest_to_iceberg(
    csv_path='/data/sales_2026.csv',
    table_path='s3://my-bucket/iceberg/sales',
    partition_col='month'
)

8. Monetization Strategies 💰

Mastering DuckDB + Iceberg opens several monetization avenues:

1. Data Lake Migration Consulting

  • Target: SMBs migrating from Spark/Flink to lightweight alternatives
  • Service: Migrate existing data pipelines to DuckDB + Iceberg
  • Pricing: $500-$3,000 per migration, depending on data volume
  • Deliverables: Migration plan + automation scripts + performance report

2. Productize Your Pipeline Template

  • Product: CLI tool or SaaS wrapping the Python automation above
  • Pricing: $49/month subscription, includes auto-compaction, monitoring, alerts
  • Target: Data engineers in small-to-medium teams

3. Training & Workshops

  • Course: “DuckDB + Iceberg: Building Enterprise Data Lakes on a Laptop”
  • Pricing: $199 (recorded) or $499 (live with Q&A)
  • Curriculum: Iceberg fundamentals → DuckDB operations → Catalog integration → Pipeline automation → Production deployment

4. Open-Source Tooling

  • Build a DuckDB Iceberg management GUI (like pgAdmin for Iceberg)
  • Monetize via GitHub Sponsors or enterprise features

5. Performance Tuning Consulting

  • Diagnose poorly performing Iceberg deployments
  • Services: File layout analysis → Partition strategy optimization → Query rewriting
  • Rate: $200/hour

Architecture Overview

Summary

The DuckDB + Apache Iceberg combination democratizes data lake technology. Previously the domain of Spark clusters and specialized infrastructure, Iceberg tables can now be created, queried, and maintained from a single process running on any machine — from a developer laptop to a CI/CD runner to a production server.

Three rules to remember:

  1. Partition wisely: Choose low-cardinality fields for partitioning
  2. Compact regularly: Run iceberg_rewrite_data_files after batch writes
  3. Leverage schema evolution: Iceberg’s schema evolution lets you add/rename/drop columns without downtime — use it aggressively

Your laptop might just be the most cost-effective data lake on the planet.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.