What is DuckDB?
DuckDB is an open-source, embedded SQL OLAP database management system designed specifically for data analytics. It uses a columnar storage engine with vectorized execution, making analytical queries 10-100x faster than traditional row-based databases like SQLite.
5 Core Advantages
- Embedded: No server to install, runs inside your application process
- Columnar Storage: Only reads needed columns, dramatically reducing I/O
- Vectorized Execution: Processes data in batches, leveraging CPU cache
- Full SQL Support: Window functions, CTEs, GROUPING SETS, and more
- Multi-Language Bindings: Python, R, Java, Node.js, C/C++
When to Use DuckDB
| Use Case | Rating | Why |
|---|---|---|
| Data Exploration | ⭐⭐⭐ | Million-row queries in milliseconds |
| ETL Pipelines | ⭐⭐⭐ | Zero-config data processing |
| BI Reporting | ⭐⭐⭐ | Replace traditional BI backends |
| Embedded Analytics | ⭐⭐⭐ | In-process analytical engine |
| SQL Learning | ⭐⭐⭐ | Zero-install SQL practice |
| OLTP Workloads | ❌ | Not for high-concurrency writes |
Installing DuckDB
MacOS
brew install duckdb
Linux (Ubuntu/Debian)
curl https://install.duckdb.org | sh
Windows
winget install DuckDB.cli
Python
pip install duckdb
Verify
duckdb --version
# v1.5.3
SQL Query Basics
Create Table & Insert
CREATE TABLE sales (
product VARCHAR,
category VARCHAR,
amount DECIMAL(10,2),
sale_date DATE
);
INSERT INTO sales VALUES
('Laptop', 'Electronics', 1200.00, '2026-01-15'),
('Keyboard', 'Peripherals', 80.00, '2026-01-16'),
('Monitor', 'Electronics', 500.00, '2026-01-17');
Basic Queries
-- Aggregate query
SELECT category, COUNT(*) AS count, SUM(amount) AS total
FROM sales
GROUP BY category
ORDER BY total DESC;
-- Window function
SELECT product, amount,
RANK() OVER (ORDER BY amount DESC) AS rank
FROM sales;
DuckDB SQL Extensions
QUALIFY Clause: Filter directly on window functions
SELECT product, amount,
RANK() OVER (ORDER BY amount DESC) AS rank
FROM sales
QUALIFY rank <= 3;
GROUP BY ALL: Auto-group by non-aggregated columns
SELECT category, product, SUM(amount)
FROM sales
GROUP BY ALL;
COLUMNS Expression: Batch column operations
-- Exclude columns
SELECT * EXCLUDE (sale_date) FROM sales;
-- Replace columns
SELECT REPLACE(amount * 1.1 AS amount) FROM sales;
Python Integration
Connect & Query
import duckdb
# In-memory database
conn = duckdb.connect()
# Execute query
result = conn.execute('SELECT 1 + 1').fetchall()
print(result) # [(2,)]
Pandas Integration
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
result = conn.execute('''
SELECT a, SUM(b) as total
FROM df
GROUP BY a
''').fetchdf()
Query Files Directly
# Query CSV
conn.execute("SELECT * FROM 'data.csv'").fetchdf()
# Query Parquet
conn.execute("SELECT * FROM 'data.parquet'").fetchdf()
# Query JSON
conn.execute("SELECT * FROM 'data.json'").fetchdf()
Performance Optimization
1. Use Parquet Format
Columnar Parquet + DuckDB’s columnar engine = 10-50x faster than CSV.
2. Partition Pruning
SELECT * FROM read_parquet('data/*.parquet', hive_partitioning=true)
WHERE year = 2026 AND month = 5;
3. Memory Management
SET memory_limit = '4GB';
SET threads = 4;
4. Materialized Views
CREATE VIEW monthly_sales AS
SELECT category, SUM(amount) AS total
FROM sales
GROUP BY category;
DuckDB vs Other Tools
| Feature | DuckDB | SQLite | Pandas | ClickHouse |
|---|---|---|---|---|
| Analytical Queries | ⭐⭐⭐ | ⭐ | ⭐⭐ | ⭐⭐⭐ |
| Row Operations | ⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐ |
| Memory Efficiency | ⭐⭐⭐ | ⭐⭐⭐ | ⭐ | ⭐⭐⭐ |
| Setup Difficulty | Zero | Zero | Needs Env | Needs Server |
| Python Integration | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐ |
| Best For | Analytics | Storage | Wrangling | Real-time |
AI / LLM Integration
DuckDB serves as the perfect “data brain” for AI agents:
- Natural Language Queries: AI analyzes questions → generates SQL → DuckDB executes
- RAG Data Prep: Clean and preprocess documents at scale
- ML Inference: Run ML models inside the database with infera extension
Production Deployment
Docker
docker run -v $(pwd)/data:/data -p 5432:5432 duckdb/duckdb
Resource Limits
SET memory_limit = '4GB';
SET threads = 4;
SET temp_directory = '/tmp/duckdb_tmp';
Backup
- Database files use
.duckdbextension - Regular file backup is sufficient
- Export to Parquet as backup format
Next Steps
- DuckDB Python Integration Guide — Complete Python examples
- DuckDB SQL Syntax Reference — From SELECT to PIVOT
- DuckDB Performance Tuning — 150x speedup on 50GB data
- DuckDB E-Commerce Dashboard — Real-world use case
