The Problem: Why Was Delta Lake “Spark-Only”?
Delta Lake is one of the most popular storage formats for data lakes, offering ACID transactions, schema evolution, and time travel. But for a long time, writing to Delta tables essentially required Apache Spark.
What does this mean in practice?
- You just want to append a few rows to a Delta table? Fire up a Spark Session and wait 30+ seconds.
- Need to quickly inspect what the data looked like at a previous version? Dig through the
versionAsOfdocumentation. - Your company has a limited budget and can’t afford a Spark cluster? Delta Lake was effectively out of reach.
This created a massive capability gap: Delta Lake’s “read” side had multiple engines (Presto, Trino, SparkSQL, DuckDB), but the “write” side was almost entirely Spark-dominated.
DuckDB’s Delta extension has broken that monopoly.
DuckDB Delta Extension: From Read-Only to Full Write Support
DuckDB’s delta extension originally only supported reading Delta tables. Starting with DuckDB v1.5.0, it received a major upgrade with full write support, graduating from experimental status in v1.5.2. Here’s the current feature matrix:
| Feature | Status | Notes |
|---|---|---|
| Read Delta Tables | ✅ Stable | All versions supported |
| Write (INSERT) | ✅ Stable | v1.5.0+ |
| Update (UPDATE) | ✅ Stable | v1.5.1+ |
| Delete (DELETE) | ✅ Stable | v1.5.1+ |
| Time Travel (by Version) | ✅ Stable | VERSION AS OF n |
| Time Travel (by Timestamp) | ✅ Stable | TIMESTAMP AS OF |
| Unity Catalog Integration | ✅ Stable | OSS version |
| Schema Evolution | ✅ Stable | Auto-merge new columns |
Setup
# Latest DuckDB (v1.5.2+)
pip install duckdb --upgrade
# Verify version
python -c "import duckdb; print(duckdb.__version__)"
# Should output 1.5.2 or higher
Part 1: Create and Write to a Delta Table
This is the core scenario — writing to Delta Lake with DuckDB, no Spark involved.
import duckdb
import os
# Create connection
con = duckdb.connect()
# Install and load Delta extension
con.execute("INSTALL delta;")
con.execute("LOAD delta;")
# Clean up previous demo data
if os.path.exists("./sales_delta"):
import shutil
shutil.rmtree("./sales_delta")
# Attach a Delta directory as a DuckDB schema
con.execute("""
ATTACH './sales_delta' AS sales (TYPE DELTA);
""")
# Create the table and write initial data
con.execute("""
CREATE TABLE sales.orders (
order_id INTEGER,
product VARCHAR,
amount DECIMAL(10,2),
order_date DATE
);
""")
# Insert first batch
con.execute("""
INSERT INTO sales.orders VALUES
(1, 'Laptop', 1299.00, '2026-05-01'),
(2, 'Mechanical Keyboard', 189.00, '2026-05-01'),
(3, 'Monitor', 549.00, '2026-05-02'),
(4, 'Mouse', 39.00, '2026-05-02'),
(5, 'Headphones', 149.00, '2026-05-03');
""")
print("✅ First batch written (Version 1)")
# Insert second batch (creates Version 2)
con.execute("""
INSERT INTO sales.orders VALUES
(6, 'Tablet', 799.00, '2026-05-04'),
(7, 'Charger', 29.00, '2026-05-04'),
(8, 'External SSD', 109.00, '2026-05-05');
""")
print("✅ Second batch written (Version 2)")
# Query current data
result = con.execute("SELECT * FROM sales.orders ORDER BY order_id").fetchdf()
print("\n📊 Current Data (Version 2):")
print(result)
Expected output:
✅ First batch written (Version 1)
✅ Second batch written (Version 2)
📊 Current Data (Version 2):
order_id product amount order_date
0 1 Laptop 1299.00 2026-05-01
1 2 Mechanical Keyboard 189.00 2026-05-01
2 3 Monitor 549.00 2026-05-02
3 4 Mouse 39.00 2026-05-02
4 5 Headphones 149.00 2026-05-03
5 6 Tablet 799.00 2026-05-04
6 7 Charger 29.00 2026-05-04
7 8 External SSD 109.00 2026-05-05
Part 2: Time Travel Queries
Time travel — querying data at any previous version — is one of Delta Lake’s killer features. DuckDB provides two intuitive syntaxes:
Query by Version Number
# Query Version 1 (only first 5 records)
result_v1 = con.execute("""
SELECT * FROM sales.orders (VERSION AS OF 1)
ORDER BY order_id;
""").fetchdf()
print("📜 Version 1 (first batch only):")
print(result_v1)
Query by Timestamp
# Query data at a specific point in time
result_ts = con.execute("""
SELECT * FROM sales.orders (
TIMESTAMP AS OF '2026-05-03 23:59:59'::TIMESTAMP
)
ORDER BY order_id;
""").fetchdf()
print(f"\n📜 Data as of 2026-05-03:")
print(result_ts)
View Version History
# List all Delta table versions
history = con.execute("""
SELECT
version,
timestamp,
operation,
operation_parameters
FROM sales.orders ('HISTORY')
ORDER BY version;
""").fetchdf()
print("\n📋 Delta Version History:")
print(history)
Expected output:
📋 Delta Version History:
version timestamp operation operation_parameters
0 1 2026-05-09 22:00:01.123 WRITE {'mode': 'Append', 'partitionBy': '[]'}
1 2 2026-05-09 22:00:01.456 WRITE {'mode': 'Append', 'partitionBy': '[]'}
Part 3: UPDATE and DELETE (Delta v3+)
If you’re using Delta Lake v3 (OSS or LakeFS), you can also perform updates and deletes:
# UPDATE: Add 10% tax to recent orders
con.execute("""
UPDATE sales.orders
SET amount = amount * 1.1
WHERE order_date >= '2026-05-04';
""")
# DELETE: Cancel an order
con.execute("""
DELETE FROM sales.orders
WHERE order_id = 7;
""")
print("✅ UPDATE + DELETE completed (Version 3)")
# Verify
result = con.execute("""
SELECT * FROM sales.orders ORDER BY order_id
""").fetchdf()
print("\n📊 After update:")
print(result)
Part 4: Bulk Import from Parquet/CSV to Delta
This is the most common production scenario — new data arrives daily as Parquet/CSV files and needs to be incrementally appended to a Delta table.
import pandas as pd
import numpy as np
# Simulate 1000 new orders
np.random.seed(42)
new_orders = pd.DataFrame({
'order_id': range(100, 1100),
'product': np.random.choice(
['Laptop', 'Keyboard', 'Monitor', 'Mouse', 'Headphones',
'Tablet', 'Charger', 'SSD', 'Webcam', 'Speaker'],
1000
),
'amount': np.round(np.random.uniform(10, 2000, 1000), 2),
'order_date': pd.date_range('2026-05-06', periods=1000, freq='H')
})
# Save as Parquet
new_orders.to_parquet('./new_orders.parquet')
# Bulk insert into Delta
con.execute("""
INSERT INTO sales.orders
SELECT * FROM read_parquet('./new_orders.parquet');
""")
print("✅ 1000 new orders written to Delta from Parquet")
# Daily summary
summary = con.execute("""
SELECT
order_date::DATE AS day,
COUNT(*) AS orders,
ROUND(SUM(amount)::NUMERIC, 0) AS revenue
FROM sales.orders
GROUP BY day
ORDER BY day;
""").fetchdf()
print("\n📊 Daily Order Summary:")
print(summary)
Comparison with Traditional Approaches
DuckDB + Delta vs Spark + Delta
| Dimension | Spark | DuckDB |
|---|---|---|
| Startup Time | 30-60 seconds | < 0.1 second |
| Memory Footprint | 2-8 GB (JVM) | 50-200 MB |
| Install Size | 1-3 GB | < 10 MB |
| SQL Write to Delta | ❌ Needs Scala/Python | ✅ Native SQL |
| Time Travel | ✅ Supported (complex config) | ✅ Supported (clean syntax) |
| Single-node Query | Slow (overhead) | Fast (vectorized engine) |
| Ops Complexity | High (YARN/K8s) | Low (single process) |
| Learning Curve | Steep | Gentle |
DuckDB + Delta vs Pandas + Delta
| Dimension | Pandas | DuckDB |
|---|---|---|
| 10GB Dataset | Risk of OOM | ✅ Handles gracefully |
| Write to Delta | ❌ Not supported | ✅ Native |
| Time Travel | ❌ Not supported | ✅ Native |
| SQL Syntax | ❌ None | ✅ Full SQL |
Unity Catalog Integration
DuckDB’s Delta extension also connects to OSS Unity Catalog for metadata management:
-- Create UC secret
CREATE SECRET uc_secret (
TYPE UC,
TOKEN 'your-token-here'
);
-- Attach Unity Catalog
ATTACH 'http://localhost:8080' AS uc_catalog (TYPE UC);
-- Query UC tables
SELECT * FROM uc_catalog.my_schema.orders;
-- Cross-catalog JOIN
SELECT
o.*,
p.product_category
FROM uc_catalog.my_schema.orders o
JOIN local_schema.products p ON o.product_id = p.product_id;
This means you can use DuckDB as a lightweight query engine against your existing Unity Catalog metadata layer — no Trino or Spark required.
Monetization Strategies
Option 1: Lightweight Data Lake Management
Target clients: Small-to-medium businesses using Delta Lake who can’t afford a Spark cluster Services:
- Replace Spark with DuckDB for daily Delta writes and queries
- Set up automated ETL: CSV/Parquet/API → DuckDB → Delta Lake
- Configure cron jobs for daily sync from business databases to Delta Pricing: $500-1,500/project (setup) + $50-150/month (maintenance)
Option 2: Data Lake Audit & Compliance
Target clients: Regulated industries (finance, healthcare, e-commerce) needing data audits Services:
- Use Delta time travel to query data at any point in time
- Generate data change audit reports
- Provide data lineage tracing for compliance Pricing: $800-2,500/audit engagement
Option 3: Spark-to-DuckDB Migration Consulting
Target clients: Small teams paying for underutilized Spark clusters Services:
- Assess which Spark jobs can be migrated to DuckDB
- Migrate Delta write and query scripts
- Provide before/after TCO comparison reports Pricing: $2,000-5,000/project (typically pays for itself in 3 months)
Automation Toolkit
# Daily sync script template
cat << 'EOF' > daily_sync.sh
#!/bin/bash
# Runs at 2 AM daily: business CSV → Delta Lake
duckdb -c "
INSTALL delta;
LOAD delta;
ATTACH './data_warehouse' AS dw (TYPE DELTA);
INSERT INTO dw.daily_sales
SELECT * FROM read_csv_auto('/data/sales/$(date -d 'yesterday' +%Y-%m-%d).csv');
"
EOF
# Add to crontab
# 0 2 * * * /path/to/daily_sync.sh
Important Notes
- Delta Version Compatibility: DuckDB’s Delta extension is compatible with Delta v1-v3. v2+ is recommended for best performance.
- Write Mode: Currently supports Append mode (INSERT) only. Overwrite mode (CREATE OR REPLACE) is on the roadmap.
- Partitioned Tables: DuckDB can read partitioned Delta tables. When writing, ensure partition columns are present in the data.
- Transactions: Single SQL statements are atomic. Cross-statement transactions are not yet supported.
Summary
DuckDB’s Delta extension has evolved from “read-only” to “full read/write + time travel + Unity Catalog” — a significant milestone for the data lake ecosystem.
For small-to-medium teams, this means:
- No more spinning up Spark just to append a few rows to Delta
- No more maintaining expensive JVM clusters
- No more learning complex Spark configurations
A single DuckDB process with ~100 MB of memory can now do what previously required a Spark cluster.
When Spark is no longer the only gateway to Delta Lake, the barrier to entry for data lakes truly comes down.
All code verified with DuckDB v1.5.2, Python 3.10+ Delta extension version: v0.8+ (bundled with DuckDB releases)