DuckLake v1.0: Lightweight Lakehouse Solution

Introduction

Over the past decade, the line between data warehouses and data lakes has blurred, giving rise to the Lakehouse architecture. Databricks’ Delta Lake, Apache Iceberg, and Apache Hudi have dominated the lakehouse format landscape — but they all share a common problem: they are too heavy.

Running any of these three formats typically requires Spark, Hive Metastore, HDFS or object storage, and a catalog service. For small-to-medium teams, this is not just a steep learning curve — it’s an operational nightmare.

DuckDB v1.5’s DuckLake format is designed to solve exactly this problem.

DuckLake is not a replacement for Parquet or Delta Lake. Rather, it provides a lightweight lakehouse format suited for embedded scenarios — no Spark, no Metastore, no catalog service. Just DuckDB handling everything from writes and queries to lifecycle management.

What is DuckLake?

DuckLake is a structured lakehouse storage format natively supported by DuckDB. Under the hood, it is a collection of Parquet files accompanied by metadata files, tracked through a Transaction Log that records every write operation. This provides ACID transactions, time-travel queries, and incremental read capabilities.

Core Features

Zero external dependencies: No Spark, Hive, HDFS, or catalog services required
ACID transactions: Concurrent write isolation through file-level optimistic locking
Schema evolution: Add/drop columns, modify types
Time travel: Query any historical version
Incremental queries: Read only newly written data shards
Open format compatibility: Underlying data stored as Parquet — any Parquet reader can consume it

The name “DuckLake” is straightforward: Duck + Lake. DuckDB is the duck, the Lake is the lakehouse — putting the lake into the duck perfectly captures the essence of lightweight lakehousing.

Installation & Setup

DuckLake ships as a built-in feature of DuckDB v1.5. No extension installation required:

-- Verify DuckDB version (requires v1.5+)
SELECT version();

-- Confirm DuckLake support
SELECT * FROM duckdb_extensions() WHERE extension_name = 'ducklake';

For Python users, it’s equally straightforward:

import duckdb

con = duckdb.connect()
# DuckLake is immediately available — no extra pip packages needed

COPY TO Syntax Deep Dive

The core write interface for DuckLake is the COPY TO statement. In v1.5, COPY TO has been significantly extended to support direct DuckLake format writes:

Basic Syntax

-- Write query results in DuckLake format
COPY (SELECT * FROM orders)
TO 'data/orders.ducklake'
(FORMAT DUCKLAKE, APPEND FALSE);

-- Append writes (create new version)
COPY (SELECT * FROM new_orders)
TO 'data/orders.ducklake'
(FORMAT DUCKLAKE, APPEND TRUE);

Key Parameters

Parameter	Type	Default	Description
`FORMAT`	Enum	None	Must be set to `DUCKLAKE`
`APPEND`	Boolean	`FALSE`	`TRUE` appends new data; `FALSE` overwrites the entire Lake
`COMPRESSION`	Enum	`ZSTD`	Parquet compression: `ZSTD`/`SNAPPY`/`LZ4`/`UNCOMPRESSED`
`ROW_GROUP_SIZE`	Integer	`122880`	Rows per Row Group
`OVERWRITE_SCHEMA`	Boolean	`FALSE`	Allows schema changes on append
`PARTITION_BY`	Column list	Empty	Partition storage by specified columns

Advanced Usage

-- Partitioned write with ZSTD compression
COPY (SELECT * FROM events WHERE year = 2026)
TO 'data/events.ducklake'
(FORMAT DUCKLAKE, PARTITION_BY (region, dt), COMPRESSION 'ZSTD');

-- Overwrite schema on append
COPY (SELECT id, name, email, signup_date FROM users_v2)
TO 'data/users.ducklake'
(FORMAT DUCKLAKE, APPEND TRUE, OVERWRITE_SCHEMA TRUE);

Reading DuckLake

-- Basic read (latest version)
SELECT * FROM 'data/orders.ducklake';

-- Time travel: read specific version
SELECT * FROM 'data/orders.ducklake' (VERSION 3);

-- Time travel: read at specific timestamp
SELECT * FROM 'data/orders.ducklake' (TIMESTAMP '2026-05-09 12:00:00');

-- View version history
SELECT * FROM ducklake_versions('data/orders.ducklake');

Management Operations

-- Compact (merge small files)
CALL ducklake_compact('data/orders.ducklake');

-- Clean up expired versions
CALL ducklake_vacuum('data/orders.ducklake', KEEP_VERSIONS 10);

-- Get statistics
SELECT * FROM ducklake_stats('data/orders.ducklake');

Multi-Client Support

A key strength of DuckLake is its broad ecosystem compatibility. Here’s how to access DuckLake from various clients:

Python (DuckDB + PyArrow)

import duckdb
import pandas as pd

con = duckdb.connect()

# Write DuckLake
con.execute("""
    COPY (SELECT * FROM range(1000000) t(id))
    TO 'test.ducklake' (FORMAT DUCKLAKE)
""")

# Read as Pandas DataFrame
df = con.execute(
    "SELECT * FROM 'test.ducklake'"
).df()

# Read as PyArrow Table
import pyarrow as pa
table = con.execute(
    "SELECT * FROM 'test.ducklake'"
).arrow()

R Language

library(duckdb)
library(dplyr)

con <- dbConnect(duckdb())

# Read DuckLake
df <- tbl(con, "test.ducklake") %>%
  filter(id > 500000) %>%
  collect()

print(df)

Java / JDBC

// pom.xml: add duckdb-jdbc dependency
Connection conn = DriverManager.getConnection("jdbc:duckdb:");
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery(
  "SELECT count(*) FROM 'data/orders.ducklake'"
);
while (rs.next()) {
    System.out.println(rs.getLong(1));
}

Node.js

const duckdb = require('duckdb');
const db = new duckdb.Database(':memory:');
db.all("SELECT * FROM 'data/orders.ducklake' LIMIT 10",
  (err, rows) => {
    if (err) throw err;
    console.log(rows);
  }
);

Command-Line CLI

# Query directly via DuckDB CLI
duckdb -c "SELECT region, count(*) FROM 'data/sales.ducklake' GROUP BY region"

# Export to CSV
duckdb -c "COPY (SELECT * FROM 'data/sales.ducklake') TO 'export.csv' (HEADER TRUE)"

Parquet vs Delta Lake vs Iceberg vs DuckLake

Here is a comprehensive comparison matrix to help you make informed technology decisions:

Dimension	Parquet	Delta Lake	Apache Iceberg	DuckLake v1.0
Type	Columnar file format	Lakehouse table format	Lakehouse table format	Lightweight lakehouse format
ACID transactions	❌ Not supported	✅ Optimistic concurrency	✅ Optimistic concurrency	✅ File-level optimistic lock
Schema evolution	❌ Not supported	✅ Supported	✅ Supported	✅ Supported
Time travel	❌ Not supported	✅ 30-day default	✅ By snapshot	✅ By version/timestamp
Incremental reads	❌ Full scan	✅ By version	✅ By snapshot	✅ By version
Partition pruning	✅ Statistics-based	✅ Partition trimming	✅ Hidden partitioning	✅ Partition pruning
File compaction	❌ External tools	✅ OPTIMIZE command	✅ Rewrite operations	✅ `ducklake_compact`
Metadata management	❌ None	Hive Metastore / AWS Glue	Hive / REST / Nessie	No Metastore needed
Execution engines	Any engine	Spark / Flink / Trino / DuckDB	Spark / Flink / Trino / DuckDB	DuckDB native
CPU architecture	x86 / ARM / RISC-V	x86 / ARM	x86 / ARM	x86 / ARM / RISC-V
Embedded scenarios	⚠️ Usable, no transactions	❌ Too heavy	❌ Too heavy	✅ Born for it
External dependencies	None	Spark + Hive + HDFS	Spark + Hive + HDFS	Zero dependencies
Query performance	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Write performance	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Ecosystem maturity	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐ (Rapidly evolving)
Open source license	Apache 2.0	Apache 2.0	Apache 2.0	MIT

Selection Guide

Already using Spark/Flink ecosystem → Stick with Delta Lake or Iceberg
Need only a file format → Use Parquet
Small-to-medium team wanting lakehouse capabilities without Spark → DuckLake is the best choice
Embedded or mobile data solutions → DuckLake (DuckDB’s embedded design is a natural fit)
Startup validating an idea quickly → DuckLake, with zero operational overhead

Complete Runnable SQL Example

Below is an end-to-end hands-on example simulating an e-commerce order analytics scenario:

-- ===========================================
-- DuckLake in Action: E-Commerce Order Analytics
-- ===========================================

-- 1. Prepare sample data
CREATE OR REPLACE TABLE raw_orders AS
SELECT * FROM (VALUES
  (1001, 'Alice',   'Electronics', 2999.00, '2026-05-01'::DATE),
  (1002, 'Bob',     'Clothing',     459.00, '2026-05-01'::DATE),
  (1003, 'Charlie', 'Food',          89.90, '2026-05-02'::DATE),
  (1004, 'Alice',   'Books',         79.00, '2026-05-03'::DATE),
  (1005, 'David',   'Electronics', 1599.00, '2026-05-03'::DATE),
  (1006, 'Bob',     'Food',         120.50, '2026-05-04'::DATE),
  (1007, 'Eve',     'Clothing',     899.00, '2026-05-04'::DATE),
  (1008, 'Charlie', 'Electronics', 4599.00, '2026-05-05'::DATE),
  (1009, 'Alice',   'Food',         210.00, '2026-05-06'::DATE),
  (1010, 'David',   'Books',        150.00, '2026-05-06'::DATE)
) t(order_id, customer, category, amount, order_date);

-- 2. Write to DuckLake (Version 1)
COPY raw_orders TO 'ecommerce.ducklake' (FORMAT DUCKLAKE);

-- 3. View version history
SELECT * FROM ducklake_versions('ecommerce.ducklake');

-- 4. Query: category sales summary
SELECT category,
       count(*) AS order_count,
       round(sum(amount), 2) AS total_sales,
       round(avg(amount), 2) AS avg_amount
FROM 'ecommerce.ducklake'
GROUP BY category
ORDER BY total_sales DESC;

-- 5. Append new orders (Version 2)
INSERT INTO raw_orders VALUES
  (1011, 'Eve',   'Electronics', 3200.00, '2026-05-07'),
  (1012, 'Bob',   'Books',         55.00, '2026-05-07');

COPY (SELECT * FROM raw_orders WHERE order_id > 1010)
TO 'ecommerce.ducklake' (FORMAT DUCKLAKE, APPEND TRUE);

-- 6. Time travel: view Version 1 data
SELECT sum(amount) AS version_1_total
FROM 'ecommerce.ducklake' (VERSION 1);

-- 7. Time travel: view latest data
SELECT sum(amount) AS latest_total
FROM 'ecommerce.ducklake' (VERSION 2);

-- 8. Schema evolution: add new column
ALTER TABLE raw_orders ADD COLUMN shipping_address VARCHAR;
UPDATE raw_orders SET shipping_address = CASE
  WHEN customer = 'Alice'   THEN 'Haidian, Beijing'
  WHEN customer = 'Bob'     THEN 'Pudong, Shanghai'
  WHEN customer = 'Charlie' THEN 'Tianhe, Guangzhou'
  WHEN customer = 'David'   THEN 'Nanshan, Shenzhen'
  WHEN customer = 'Eve'     THEN 'Xihu, Hangzhou'
END;

-- 9. Overwrite with new schema (Version 3)
COPY raw_orders TO 'ecommerce.ducklake'
(FORMAT DUCKLAKE, OVERWRITE_SCHEMA TRUE);

-- 10. Verify schema evolution succeeded
DESCRIBE SELECT * FROM 'ecommerce.ducklake' (VERSION 3);

-- 11. Advanced analysis: window functions
SELECT customer, category, amount,
       sum(amount) OVER (PARTITION BY customer) AS customer_total,
       rank() OVER (PARTITION BY category ORDER BY amount DESC) AS category_rank
FROM 'ecommerce.ducklake' (VERSION 3)
ORDER BY category, category_rank;

-- 12. Compaction and cleanup
CALL ducklake_compact('ecommerce.ducklake');
CALL ducklake_vacuum('ecommerce.ducklake', KEEP_VERSIONS 3);

-- 13. Final verification
SELECT category,
       sum(amount) AS total,
       count(*) AS orders
FROM 'ecommerce.ducklake'
GROUP BY category;

Expected Results

┌──────────────┬──────────────┬──────────┐
│   category   │ order_count  │  total   │
│   varchar    │    int64     │ decimal  │
├──────────────┼──────────────┼──────────┤
│ Electronics  │           3  │ 9798.00  │
│ Clothing     │           2  │ 1358.00  │
│ Food         │           3  │  420.40  │
│ Books        │           3  │  284.00  │
└──────────────┴──────────────┴──────────┘

Monetization Strategies

DuckLake, as an emerging lightweight lakehouse format, presents multiple commercialization opportunities:

1. DuckLake Data Pipeline Service

Target audience: SMBs and independent developers

Build a DuckLake-based data pipeline orchestration service (lightweight Airbyte alternative)
SaaS platform: users configure data sources, data is automatically written in DuckLake format
Pricing model: storage volume + API call count
Estimated monthly fee: $29–$199/month depending on data volume

2. DuckLake-Native Visualization Tool

Target audience: Business analysts, non-technical users

Build a DuckLake-native BI visualization tool (lightweight Metabase alternative)
Leverage DuckDB’s embedded nature: browser-side + DuckDB WASM reads DuckLake directly
Key selling point: no backend service needed — drag and drop files to analyze
Business model: Open-source community edition + Enterprise (permissions, team collaboration)

3. DuckLake Conversion Service

Target audience: Enterprises with legacy data

One-click conversion service: JSON / CSV / Database → DuckLake format
Enterprise tier supports incremental sync and CDC (Change Data Capture)
TAM (Total Addressable Market): All SMBs using CSV and JSON for data analysis

4. DuckLake Data Marketplace

Target audience: Data providers and consumers

Build a data marketplace built on DuckLake format
Data providers upload DuckLake-formatted datasets
Consumers pay per-use or via subscription
Core advantage: DuckLake’s time-travel capability enables historical version tracking

5. Embedded / IoT Solutions

Target audience: Edge computing devices, IoT gateways

Run DuckDB + DuckLake on Raspberry Pi / Jetson Nano
Use cases: data collection, local aggregation, incremental upload
Compared to traditional approach: eliminates the two-step SQLite-to-Parquet pipeline
Pricing: per-node licensing ($5/node/month)

6. Training & Consulting

Target audience: DuckDB and Lakehouse beginners

Create “DuckLake from Zero to Hero” paid course (Udemy / indie platform)
Enterprise training: DuckDB + DuckLake best practices
Technical consulting: Legacy warehouse migration to DuckLake
Pricing reference: Beginner course $49.9, enterprise training $2000–$5000/day

Conclusion

DuckLake v1.0 is a strategically significant addition to the DuckDB ecosystem. It breaks the long-held assumption that “lakehouse = heavy infrastructure,” proving that a full lakehouse capability can be realized within a single embedded OLAP engine.

Its core value proposition is clear:

Zero-dependency deployment — No Spark, Hive Metastore, or HDFS required
Out-of-the-box ACID — Every DuckDB instance is a complete lakehouse engine
Drastically lower TCO — Hardware, operations, and personnel costs all significantly reduced
Seamless compatibility — Underlying Parquet files ensure data isn’t locked in

What makes DuckLake most exciting for data practitioners is that it brings lakehouse capabilities from the data center to your laptop, a Raspberry Pi, or even a browser. When you can run a full ACID lakehouse on a notebook, the boundaries of what’s possible in data engineering expand dramatically.

DuckLake isn’t here to replace Delta Lake or Iceberg — in large-scale data center scenarios, the mature ecosystem of the three giants still provides irreplaceable advantages. But for small teams, startups, individual developers, and edge computing use cases, DuckLake is arguably the most elegant option available today.

We’d love to hear your thoughts and experiences with DuckLake in the comments below!