Overview
In March 2026, DuckDB released v1.5 ‘Variegata’ (named after the New Zealand paradise duck), the most transformative major release since v1.0. The latest version v1.5.3 has already fixed several key bugs.
We’ve previously covered individual DuckDB features like the VARIANT type, GEOMETRY spatial type, and read_duckdb cross-database queries as standalone topics. But today we take a higher-level view: how does v1.5 simultaneously open monetization doors across data structure, architecture, and performance optimization?
This article provides a complete dissection of v1.5’s core features with executable code, and shows you how to leverage these capabilities to build a genuinely profitable data product.

1. Data Structure Evolution: VARIANT + GEOMETRY
1.1 VARIANT Type: One Row to Store the Entire JSON Universe
v1.5 finally delivers full VARIANT column type support, enabling native storage and querying of unstructured data. You can now drop nested JSON, API responses, and complex structures directly into a table and extract fields with SQL — no Spark or Flink preprocessing needed.
Practical Scenario: E-commerce User Behavior Analysis
-- Create table with VARIANT column
CREATE TABLE user_events (
event_id VARCHAR,
user_id BIGINT,
event_time TIMESTAMP,
payload VARIANT -- entire event payload stored directly
);
-- Bulk load nested JSON (from log files in production)
COPY user_events FROM 'events.json' (FORMAT JSON);
-- Extract nested fields directly from VARIANT
SELECT
user_id,
payload:->>'action' AS action,
payload:->'metadata:->'device' AS device_info,
payload:->'metadata:->'location:->'lat' AS lat
FROM user_events
WHERE payload:->>'action' = 'purchase';
Python-side example:
import duckdb
con = duckdb.connect("events.duckdb")
# Read JSON directly into VARIANT column
con.execute("""
CREATE TABLE raw_events AS
SELECT
id, user_id, timestamp,
CAST(json AS VARIANT) AS payload
FROM read_json_auto('events/*.json')
""")
# Query nested fields for revenue analysis
result = con.execute("""
SELECT
DATE_TRUNC('day', timestamp) AS day,
COUNT(DISTINCT user_id) AS buyers,
SUM(payload:->'amount'::DOUBLE) AS revenue
FROM raw_events
WHERE payload:->>'action' = 'purchase'
GROUP BY 1
""").fetchdf()
print(result)
Monetization value: Replace an entire ETL pipeline with a single SQL statement. Data product managers can build behavior analysis dashboards at zero cost, compressing what would normally take a data engineer 3-5 days of data cleaning into just a few lines of SQL.
1.2 GEOMETRY Type: Spatial Data Analysis Out of the Box
v1.5 completed a comprehensive refactor of the GEOMETRY type, supporting coordinate reference systems (CRS), WKB conversion, and filter pushdown. Logistics, real estate, and retail site selection industries can now perform spatial analysis directly with SQL.
import duckdb
con = duckdb.connect(":memory:")
# Create spatial data table
con.execute("""
CREATE TABLE stores (
store_id INT,
store_name VARCHAR,
location GEOMETRY
)
""")
# Insert geographic coordinates (WKT format)
con.execute("""
INSERT INTO stores VALUES
(1, 'Beijing Flagship', 'POINT(116.4074 39.9042)'),
(2, 'Shanghai Store', 'POINT(121.4737 31.2304)'),
(3, 'Shenzhen Store', 'POINT(114.0579 22.3193)')
""")
# Calculate distance from each store to Beijing flagship (using projected CRS)
result = con.execute("""
SELECT
s.store_name,
ST_Distance(
s.location,
ST_Transform(
ST_SetSRID(ST_Point(116.4074, 39.9042), 4326),
3857
)
) / 1000 AS distance_km
FROM stores s
WHERE s.store_name != 'Beijing Flagship'
ORDER BY distance_km
""").fetchdf()
print(result)
Monetization value: Site selection analysis tools, delivery radius calculations, regional sales heatmaps — scenarios that previously required PostGIS + professional GIS tools can now be done with a single Python script. You could build a “site selection SaaS” for small retail chains at 500-2000 RMB/month.
2. Architecture Capabilities: read_duckdb + Concurrent Read/Write
2.1 read_duckdb: A Revolution for Cross .duckdb File Queries
v1.5 introduces the read_duckdb() function, which can directly scan and query .duckdb database files without loading them first, used directly as a table function.
Practical Scenario: Multi-tenant SaaS Aggregated Reports
import duckdb
con = duckdb.connect()
# Scan multiple client .duckdb files directly
# Effectively turning 100 independent databases into one data lake
result = con.execute("""
SELECT
client_name,
COUNT(*) AS total_orders,
SUM(revenue) AS total_revenue
FROM read_duckdb('clients/*.duckdb') AS t(client_name VARCHAR, revenue DOUBLE)
GROUP BY client_name
ORDER BY total_revenue DESC
""").fetchdf()
print(result)
More powerful — leveraging late materialization and filter pushdown:
# Only reads needed columns and filters; performance approaches direct DB access
result = con.execute("""
SELECT order_id, revenue, order_date
FROM read_duckdb('clients/acme.duckdb') AS t
WHERE revenue > 10000
AND order_date >= '2026-01-01'
LIMIT 100
""").fetchdf()
Monetization value: Perfect fit for multi-tenant SaaS architecture. Each client’s data is physically isolated (security and compliance), but analytics can seamlessly aggregate across all tenants. Ideal for data monitoring services and SaaS analytics backends.
2.2 Concurrent Read/Write: From “Single-Threaded Read-Only” to “Writable Analytical Engine”
The most significant architectural improvement in v1.5 is allowing concurrent reads, inserts, and deletes during checkpoints. DuckDB is no longer a “read-only analytical engine” — it is now a genuinely writable analytical database.
import duckdb
import threading
import time
con = duckdb.connect("analytics.duckdb")
# Create target table
con.execute("""
CREATE TABLE if not exists daily_metrics (
metric_date DATE,
metric_name VARCHAR,
metric_value DOUBLE,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
# Writer thread 1: continuously append data
def writer(thread_id):
for i in range(100):
con.execute(f"""
INSERT INTO daily_metrics (metric_date, metric_name, metric_value)
VALUES (CURRENT_DATE, 'metric_{thread_id}', {i * 1.5})
""")
time.sleep(0.01)
print(f"Writer {thread_id} done")
# Reader thread: real-time reads
def reader():
for i in range(50):
result = con.execute("SELECT COUNT(*) FROM daily_metrics").fetchone()
print(f"Reader sees {result[0]} rows")
time.sleep(0.05)
# Launch concurrency
threads = [
threading.Thread(target=writer, args=(1,)),
threading.Thread(target=writer, args=(2,)),
threading.Thread(target=reader),
]
for t in threads:
t.start()
for t in threads:
t.join()
# Final query
print(con.execute(
"SELECT metric_date, COUNT(*), SUM(metric_value) FROM daily_metrics GROUP BY 1"
).fetchdf())
Monetization value: Real-time monitoring dashboards no longer require streaming databases. A single DuckDB node can handle “write + analyze” real-time scenarios, dramatically reducing infrastructure costs. A solo developer can build a real-time monitoring dashboard that would normally require Kafka + Flink + ClickHouse, at a fraction of the server cost.
3. Comparison with Traditional Tools
| Capability | Pre-v1.5 | v1.5 ‘Variegata’ |
|---|---|---|
| Unstructured Data | JSON strings + manual parsing | Native VARIANT storage + one-click extraction |
| Cross-DB Query | Manual UNION ALL merge | read_duckdb() one-liner |
| Write Capability | Single-threaded only | Concurrent write + concurrent read |
| Spatial Analysis | External PostGIS dependency | Built-in GEOMETRY, direct SQL queries |
| Use Case | Pure offline analysis | Real-time write + integrated analysis |
4. Optimization Tips: Make Your Data Product 10x Faster
4.1 Parquet Storage + Eager Aggregate Pushdown
import duckdb
con = duckdb.connect("warehouse.duckdb")
# Auto-utilize eager aggregation optimization when COPY to Parquet
con.execute("""
COPY (
SELECT
DATE_TRUNC('month', order_date) AS month,
category,
SUM(amount) AS total_sales,
COUNT(*) AS order_count
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY 1, 2
) TO 'sales_monthly.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)
""")
# Subsequent queries read directly from Parquet, leveraging column-level stats
result = con.execute("""
SELECT * FROM read_parquet('sales_monthly.parquet')
WHERE month >= '2025-06-01'
AND category = 'Electronics'
""").fetchdf()
4.2 Indexes + VACUUM for High-Frequency Query Tables
con.execute("""
-- Create indexes for high-frequency query conditions
CREATE INDEX idx_orders_date ON orders(order_date);
CREATE INDEX idx_orders_user ON orders(user_id);
-- Periodic VACUUM to clean deleted rows (v1.5 supports VACUUM with indexes)
VACUUM (FULL);
""")
5. Practical Project: Building a “SaaS Sales Analytics Dashboard” with v1.5
Combining the above features, we can rapidly build a data product for SMBs:
Architecture Design:
- Clients submit order data via API (JSON format) → stored in VARIANT columns
- Each client’s .duckdb file is physically isolated → meets compliance requirements
- Backend aggregates platform data via
read_duckdb()→ generates operations reports - Real-time order writes → concurrent write + concurrent read
- Store store locations with GEOMETRY → auto-generate sales heatmaps
Core Code Skeleton:
import duckdb
import json
from datetime import datetime
class SaaSSalesAnalyzer:
def __init__(self, client_id: str):
self.con = duckdb.connect(f"clients/{client_id}.duckdb")
self._init_schema()
def _init_schema(self):
self.con.execute("""
CREATE TABLE IF NOT EXISTS orders (
order_id VARCHAR PRIMARY KEY,
customer_name VARCHAR,
amount DOUBLE,
store_location GEOMETRY,
created_at TIMESTAMP,
metadata VARIANT
)
""")
def ingest_order(self, order_json: dict):
"""Parse JSON and write; VARIANT stores extended fields directly"""
self.con.execute("""
INSERT INTO orders VALUES (?, ?, ?, ?, ?, ?)
""", [
order_json['order_id'],
order_json['customer_name'],
order_json['amount'],
f"POINT({order_json['lat']} {order_json['lon']})" if 'lat' in order_json else None,
datetime.fromisoformat(order_json['created_at']),
json.dumps(order_json.get('extra', {})) # stored in VARIANT
])
# Usage: Aggregate reports from multiple client files
con = duckdb.connect()
report = con.execute("""
SELECT
DATE_TRUNC('month', created_at) AS month,
COUNT(*) AS order_count,
SUM(amount) AS revenue,
COUNT(DISTINCT customer_name) AS customers
FROM read_duckdb('clients/*.duckdb')
GROUP BY 1
ORDER BY 1
""").fetchdf()
print(report)
This product can be sold directly to small and medium retail businesses at 1,000-5,000 RMB/month per client.
6. Summary
DuckDB v1.5 ‘Variegata’ pushes forward across three dimensions simultaneously, unlocking entirely new possibilities for data monetization:
- Data Structure (VARIANT + GEOMETRY) — handle more types of data, serve more industries
- Architecture (concurrent writes + read_duckdb) — upgrade from analytical engine to writable data platform
- Performance (eager aggregate, late materialization) — data product experience approaches commercial databases
For anyone looking to make money with DuckDB, key action items:
- Upgrade immediately to v1.5.x, use
INSTALL httpfs,LOAD httpfsto get the latest features - Migrate your JSON log库 to VARIANT type, reducing ETL costs to zero
- Use read_duckdb for data lake aggregation, replacing heavy Spark jobs
- Try concurrent writing in real-time monitoring scenarios to validate feasibility
Want to dive deeper into DuckDB for data products in practice? duckdblab.org offers a complete tutorial series from beginner to commercial deployment, with detailed illustrations for all the features above and more real-world cases. Learn more DuckDB practical experience → duckdblab.org