DuckLake v1.0: Lightweight Lakehouse Solution

A comprehensive introduction to DuckLake v1.0 — a lightweight lakehouse storage format built on DuckDB. Deep dive into COPY TO syntax, multi-client support, format comparison matrix, and monetization strategies.

Introduction

Over the past decade, the line between data warehouses and data lakes has blurred, giving rise to the Lakehouse architecture. Databricks’ Delta Lake, Apache Iceberg, and Apache Hudi have dominated the lakehouse format landscape — but they all share a common problem: they are too heavy.

Running any of these three formats typically requires Spark, Hive Metastore, HDFS or object storage, and a catalog service. For small-to-medium teams, this is not just a steep learning curve — it’s an operational nightmare.

DuckDB v1.5’s DuckLake format is designed to solve exactly this problem.

DuckLake is not a replacement for Parquet or Delta Lake. Rather, it provides a lightweight lakehouse format suited for embedded scenarios — no Spark, no Metastore, no catalog service. Just DuckDB handling everything from writes and queries to lifecycle management.

What is DuckLake?

DuckLake is a structured lakehouse storage format natively supported by DuckDB. Under the hood, it is a collection of Parquet files accompanied by metadata files, tracked through a Transaction Log that records every write operation. This provides ACID transactions, time-travel queries, and incremental read capabilities.

Core Features

  • Zero external dependencies: No Spark, Hive, HDFS, or catalog services required
  • ACID transactions: Concurrent write isolation through file-level optimistic locking
  • Schema evolution: Add/drop columns, modify types
  • Time travel: Query any historical version
  • Incremental queries: Read only newly written data shards
  • Open format compatibility: Underlying data stored as Parquet — any Parquet reader can consume it

The name “DuckLake” is straightforward: Duck + Lake. DuckDB is the duck, the Lake is the lakehouse — putting the lake into the duck perfectly captures the essence of lightweight lakehousing.

Installation & Setup

DuckLake ships as a built-in feature of DuckDB v1.5. No extension installation required:

-- Verify DuckDB version (requires v1.5+)
SELECT version();

-- Confirm DuckLake support
SELECT * FROM duckdb_extensions() WHERE extension_name = 'ducklake';

For Python users, it’s equally straightforward:

import duckdb

con = duckdb.connect()
# DuckLake is immediately available — no extra pip packages needed

COPY TO Syntax Deep Dive

The core write interface for DuckLake is the COPY TO statement. In v1.5, COPY TO has been significantly extended to support direct DuckLake format writes:

Basic Syntax

-- Write query results in DuckLake format
COPY (SELECT * FROM orders)
TO 'data/orders.ducklake'
(FORMAT DUCKLAKE, APPEND FALSE);

-- Append writes (create new version)
COPY (SELECT * FROM new_orders)
TO 'data/orders.ducklake'
(FORMAT DUCKLAKE, APPEND TRUE);

Key Parameters

ParameterTypeDefaultDescription
FORMATEnumNoneMust be set to DUCKLAKE
APPENDBooleanFALSETRUE appends new data; FALSE overwrites the entire Lake
COMPRESSIONEnumZSTDParquet compression: ZSTD/SNAPPY/LZ4/UNCOMPRESSED
ROW_GROUP_SIZEInteger122880Rows per Row Group
OVERWRITE_SCHEMABooleanFALSEAllows schema changes on append
PARTITION_BYColumn listEmptyPartition storage by specified columns

Advanced Usage

-- Partitioned write with ZSTD compression
COPY (SELECT * FROM events WHERE year = 2026)
TO 'data/events.ducklake'
(FORMAT DUCKLAKE, PARTITION_BY (region, dt), COMPRESSION 'ZSTD');

-- Overwrite schema on append
COPY (SELECT id, name, email, signup_date FROM users_v2)
TO 'data/users.ducklake'
(FORMAT DUCKLAKE, APPEND TRUE, OVERWRITE_SCHEMA TRUE);

Reading DuckLake

-- Basic read (latest version)
SELECT * FROM 'data/orders.ducklake';

-- Time travel: read specific version
SELECT * FROM 'data/orders.ducklake' (VERSION 3);

-- Time travel: read at specific timestamp
SELECT * FROM 'data/orders.ducklake' (TIMESTAMP '2026-05-09 12:00:00');

-- View version history
SELECT * FROM ducklake_versions('data/orders.ducklake');

Management Operations

-- Compact (merge small files)
CALL ducklake_compact('data/orders.ducklake');

-- Clean up expired versions
CALL ducklake_vacuum('data/orders.ducklake', KEEP_VERSIONS 10);

-- Get statistics
SELECT * FROM ducklake_stats('data/orders.ducklake');

Multi-Client Support

A key strength of DuckLake is its broad ecosystem compatibility. Here’s how to access DuckLake from various clients:

Python (DuckDB + PyArrow)

import duckdb
import pandas as pd

con = duckdb.connect()

# Write DuckLake
con.execute("""
    COPY (SELECT * FROM range(1000000) t(id))
    TO 'test.ducklake' (FORMAT DUCKLAKE)
""")

# Read as Pandas DataFrame
df = con.execute(
    "SELECT * FROM 'test.ducklake'"
).df()

# Read as PyArrow Table
import pyarrow as pa
table = con.execute(
    "SELECT * FROM 'test.ducklake'"
).arrow()

R Language

library(duckdb)
library(dplyr)

con <- dbConnect(duckdb())

# Read DuckLake
df <- tbl(con, "test.ducklake") %>%
  filter(id > 500000) %>%
  collect()

print(df)

Java / JDBC

// pom.xml: add duckdb-jdbc dependency
Connection conn = DriverManager.getConnection("jdbc:duckdb:");
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery(
  "SELECT count(*) FROM 'data/orders.ducklake'"
);
while (rs.next()) {
    System.out.println(rs.getLong(1));
}

Node.js

const duckdb = require('duckdb');
const db = new duckdb.Database(':memory:');
db.all("SELECT * FROM 'data/orders.ducklake' LIMIT 10",
  (err, rows) => {
    if (err) throw err;
    console.log(rows);
  }
);

Command-Line CLI

# Query directly via DuckDB CLI
duckdb -c "SELECT region, count(*) FROM 'data/sales.ducklake' GROUP BY region"

# Export to CSV
duckdb -c "COPY (SELECT * FROM 'data/sales.ducklake') TO 'export.csv' (HEADER TRUE)"

Parquet vs Delta Lake vs Iceberg vs DuckLake

Here is a comprehensive comparison matrix to help you make informed technology decisions:

DimensionParquetDelta LakeApache IcebergDuckLake v1.0
TypeColumnar file formatLakehouse table formatLakehouse table formatLightweight lakehouse format
ACID transactions❌ Not supported✅ Optimistic concurrency✅ Optimistic concurrency✅ File-level optimistic lock
Schema evolution❌ Not supported✅ Supported✅ Supported✅ Supported
Time travel❌ Not supported✅ 30-day default✅ By snapshot✅ By version/timestamp
Incremental reads❌ Full scan✅ By version✅ By snapshot✅ By version
Partition pruning✅ Statistics-based✅ Partition trimming✅ Hidden partitioning✅ Partition pruning
File compaction❌ External tools✅ OPTIMIZE command✅ Rewrite operationsducklake_compact
Metadata management❌ NoneHive Metastore / AWS GlueHive / REST / NessieNo Metastore needed
Execution enginesAny engineSpark / Flink / Trino / DuckDBSpark / Flink / Trino / DuckDBDuckDB native
CPU architecturex86 / ARM / RISC-Vx86 / ARMx86 / ARMx86 / ARM / RISC-V
Embedded scenarios⚠️ Usable, no transactions❌ Too heavy❌ Too heavyBorn for it
External dependenciesNoneSpark + Hive + HDFSSpark + Hive + HDFSZero dependencies
Query performance⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Write performance⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Ecosystem maturity⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐ (Rapidly evolving)
Open source licenseApache 2.0Apache 2.0Apache 2.0MIT

Selection Guide

  • Already using Spark/Flink ecosystem → Stick with Delta Lake or Iceberg
  • Need only a file format → Use Parquet
  • Small-to-medium team wanting lakehouse capabilities without SparkDuckLake is the best choice
  • Embedded or mobile data solutions → DuckLake (DuckDB’s embedded design is a natural fit)
  • Startup validating an idea quickly → DuckLake, with zero operational overhead

Complete Runnable SQL Example

Below is an end-to-end hands-on example simulating an e-commerce order analytics scenario:

-- ===========================================
-- DuckLake in Action: E-Commerce Order Analytics
-- ===========================================

-- 1. Prepare sample data
CREATE OR REPLACE TABLE raw_orders AS
SELECT * FROM (VALUES
  (1001, 'Alice',   'Electronics', 2999.00, '2026-05-01'::DATE),
  (1002, 'Bob',     'Clothing',     459.00, '2026-05-01'::DATE),
  (1003, 'Charlie', 'Food',          89.90, '2026-05-02'::DATE),
  (1004, 'Alice',   'Books',         79.00, '2026-05-03'::DATE),
  (1005, 'David',   'Electronics', 1599.00, '2026-05-03'::DATE),
  (1006, 'Bob',     'Food',         120.50, '2026-05-04'::DATE),
  (1007, 'Eve',     'Clothing',     899.00, '2026-05-04'::DATE),
  (1008, 'Charlie', 'Electronics', 4599.00, '2026-05-05'::DATE),
  (1009, 'Alice',   'Food',         210.00, '2026-05-06'::DATE),
  (1010, 'David',   'Books',        150.00, '2026-05-06'::DATE)
) t(order_id, customer, category, amount, order_date);

-- 2. Write to DuckLake (Version 1)
COPY raw_orders TO 'ecommerce.ducklake' (FORMAT DUCKLAKE);

-- 3. View version history
SELECT * FROM ducklake_versions('ecommerce.ducklake');

-- 4. Query: category sales summary
SELECT category,
       count(*) AS order_count,
       round(sum(amount), 2) AS total_sales,
       round(avg(amount), 2) AS avg_amount
FROM 'ecommerce.ducklake'
GROUP BY category
ORDER BY total_sales DESC;

-- 5. Append new orders (Version 2)
INSERT INTO raw_orders VALUES
  (1011, 'Eve',   'Electronics', 3200.00, '2026-05-07'),
  (1012, 'Bob',   'Books',         55.00, '2026-05-07');

COPY (SELECT * FROM raw_orders WHERE order_id > 1010)
TO 'ecommerce.ducklake' (FORMAT DUCKLAKE, APPEND TRUE);

-- 6. Time travel: view Version 1 data
SELECT sum(amount) AS version_1_total
FROM 'ecommerce.ducklake' (VERSION 1);

-- 7. Time travel: view latest data
SELECT sum(amount) AS latest_total
FROM 'ecommerce.ducklake' (VERSION 2);

-- 8. Schema evolution: add new column
ALTER TABLE raw_orders ADD COLUMN shipping_address VARCHAR;
UPDATE raw_orders SET shipping_address = CASE
  WHEN customer = 'Alice'   THEN 'Haidian, Beijing'
  WHEN customer = 'Bob'     THEN 'Pudong, Shanghai'
  WHEN customer = 'Charlie' THEN 'Tianhe, Guangzhou'
  WHEN customer = 'David'   THEN 'Nanshan, Shenzhen'
  WHEN customer = 'Eve'     THEN 'Xihu, Hangzhou'
END;

-- 9. Overwrite with new schema (Version 3)
COPY raw_orders TO 'ecommerce.ducklake'
(FORMAT DUCKLAKE, OVERWRITE_SCHEMA TRUE);

-- 10. Verify schema evolution succeeded
DESCRIBE SELECT * FROM 'ecommerce.ducklake' (VERSION 3);

-- 11. Advanced analysis: window functions
SELECT customer, category, amount,
       sum(amount) OVER (PARTITION BY customer) AS customer_total,
       rank() OVER (PARTITION BY category ORDER BY amount DESC) AS category_rank
FROM 'ecommerce.ducklake' (VERSION 3)
ORDER BY category, category_rank;

-- 12. Compaction and cleanup
CALL ducklake_compact('ecommerce.ducklake');
CALL ducklake_vacuum('ecommerce.ducklake', KEEP_VERSIONS 3);

-- 13. Final verification
SELECT category,
       sum(amount) AS total,
       count(*) AS orders
FROM 'ecommerce.ducklake'
GROUP BY category;

Expected Results

┌──────────────┬──────────────┬──────────┐
│   category   │ order_count  │  total   │
│   varchar    │    int64     │ decimal  │
├──────────────┼──────────────┼──────────┤
│ Electronics  │           3  │ 9798.00  │
│ Clothing     │           2  │ 1358.00  │
│ Food         │           3  │  420.40  │
│ Books        │           3  │  284.00  │
└──────────────┴──────────────┴──────────┘

Monetization Strategies

DuckLake, as an emerging lightweight lakehouse format, presents multiple commercialization opportunities:

1. DuckLake Data Pipeline Service

Target audience: SMBs and independent developers

  • Build a DuckLake-based data pipeline orchestration service (lightweight Airbyte alternative)
  • SaaS platform: users configure data sources, data is automatically written in DuckLake format
  • Pricing model: storage volume + API call count
  • Estimated monthly fee: $29–$199/month depending on data volume

2. DuckLake-Native Visualization Tool

Target audience: Business analysts, non-technical users

  • Build a DuckLake-native BI visualization tool (lightweight Metabase alternative)
  • Leverage DuckDB’s embedded nature: browser-side + DuckDB WASM reads DuckLake directly
  • Key selling point: no backend service needed — drag and drop files to analyze
  • Business model: Open-source community edition + Enterprise (permissions, team collaboration)

3. DuckLake Conversion Service

Target audience: Enterprises with legacy data

  • One-click conversion service: JSON / CSV / Database → DuckLake format
  • Enterprise tier supports incremental sync and CDC (Change Data Capture)
  • TAM (Total Addressable Market): All SMBs using CSV and JSON for data analysis

4. DuckLake Data Marketplace

Target audience: Data providers and consumers

  • Build a data marketplace built on DuckLake format
  • Data providers upload DuckLake-formatted datasets
  • Consumers pay per-use or via subscription
  • Core advantage: DuckLake’s time-travel capability enables historical version tracking

5. Embedded / IoT Solutions

Target audience: Edge computing devices, IoT gateways

  • Run DuckDB + DuckLake on Raspberry Pi / Jetson Nano
  • Use cases: data collection, local aggregation, incremental upload
  • Compared to traditional approach: eliminates the two-step SQLite-to-Parquet pipeline
  • Pricing: per-node licensing ($5/node/month)

6. Training & Consulting

Target audience: DuckDB and Lakehouse beginners

  • Create “DuckLake from Zero to Hero” paid course (Udemy / indie platform)
  • Enterprise training: DuckDB + DuckLake best practices
  • Technical consulting: Legacy warehouse migration to DuckLake
  • Pricing reference: Beginner course $49.9, enterprise training $2000–$5000/day

Conclusion

DuckLake v1.0 is a strategically significant addition to the DuckDB ecosystem. It breaks the long-held assumption that “lakehouse = heavy infrastructure,” proving that a full lakehouse capability can be realized within a single embedded OLAP engine.

Its core value proposition is clear:

  1. Zero-dependency deployment — No Spark, Hive Metastore, or HDFS required
  2. Out-of-the-box ACID — Every DuckDB instance is a complete lakehouse engine
  3. Drastically lower TCO — Hardware, operations, and personnel costs all significantly reduced
  4. Seamless compatibility — Underlying Parquet files ensure data isn’t locked in

What makes DuckLake most exciting for data practitioners is that it brings lakehouse capabilities from the data center to your laptop, a Raspberry Pi, or even a browser. When you can run a full ACID lakehouse on a notebook, the boundaries of what’s possible in data engineering expand dramatically.

DuckLake isn’t here to replace Delta Lake or Iceberg — in large-scale data center scenarios, the mature ecosystem of the three giants still provides irreplaceable advantages. But for small teams, startups, individual developers, and edge computing use cases, DuckLake is arguably the most elegant option available today.

We’d love to hear your thoughts and experiences with DuckLake in the comments below!