Introduction
If you’re a Python developer who works with data, you’ve likely felt the friction between Pandas’ memory limits and the need for SQL-powered analytics. DuckDB Python bridges this gap perfectly — it’s an in-process SQL OLAP database that runs inside your Python process with zero external dependencies, offering vectorized execution that crushes Pandas on performance.
This guide walks you through everything you need to integrate DuckDB into your Python workflow: from a simple pip install to advanced patterns like zero-copy DataFrame queries, multi-file Parquet analysis, and parameterized SQL for production pipelines.
If you’re new to DuckDB entirely, start with our DuckDB Installation Guide and DuckDB Beginners Guide 2026 first.
1. Installing the DuckDB Python Package
Getting started with DuckDB Python requires exactly one command:
pip install duckdb
This installs the duckdb Python package, which bundles DuckDB’s entire C++ engine as a native extension. No separate server, no JDBC driver, no Docker container — just a Python import.
Verify the Installation
import duckdb
print(duckdb.__version__)
# Output (example): 1.2.0
You can also install optional extensions for advanced functionality:
pip install duckdb duckdb-extensions # Optional: install extension management
Pro Tip: DuckDB Python works on Python 3.8 through 3.13, across Linux, macOS, and Windows. It’s a single-file download — the wheel is ~15MB on most platforms.
2. Basic Connection and Query Execution
DuckDB offers two connection modes, both accessible from Python.
In-Memory Database (Default)
For most data analysis work, you’ll use an in-memory database:
import duckdb
# Create an in-memory database
conn = duckdb.connect()
# Or use the default connection directly
result = duckdb.sql("SELECT 'Hello, DuckDB!' AS greeting")
print(result)
Output:
┌──────────────────┐
│ greeting │
│ varchar │
├──────────────────┤
│ Hello, DuckDB! │
└──────────────────┘
Persistent Database
For data that needs to survive your Python session:
conn = duckdb.connect('my_analysis.db')
conn.sql("CREATE TABLE users (id INTEGER, name VARCHAR, city VARCHAR)")
conn.sql("INSERT INTO users VALUES (1, 'Alice', 'New York'), (2, 'Bob', 'London')")
conn.sql("SELECT * FROM users").show()
Fetching Results
DuckDB provides multiple ways to retrieve query results:
result = duckdb.sql("SELECT unnest([10, 20, 30]) AS value")
# As a list of tuples
rows = result.fetchall() # [(10,), (20,), (30,)]
# As a Pandas DataFrame
df = result.fetchdf() # DataFrame with one column 'value'
# As a list of dictionaries
dicts = result.fetchdf().to_dict('records') # [{'value': 10}, ...]
# As an Arrow table
import pyarrow as pa
arrow_table = result.fetch_arrow_table()
💡
fetchdf()is the most commonly used method — it seamlessly bridges DuckDB and Pandas, letting you use DuckDB for heavy lifting and Pandas for visualization or further processing.
3. DuckDB + Pandas DataFrame Integration
This is the killer feature of DuckDB Python. You can run SQL queries directly on Pandas DataFrames without any data copying.
Query a DataFrame with SQL
import pandas as pd
import duckdb
# Create a Pandas DataFrame
df = pd.DataFrame({
'product': ['Widget A', 'Widget B', 'Widget C', 'Widget A'],
'category': ['Electronics', 'Electronics', 'Home', 'Electronics'],
'price': [29.99, 49.99, 15.99, 34.99],
'quantity': [100, 75, 200, 50]
})
# Run SQL directly on the DataFrame — zero copy!
result = duckdb.sql("""
SELECT
product,
category,
SUM(price * quantity) AS total_revenue,
AVG(price) AS avg_price,
COUNT(*) AS transaction_count
FROM df
WHERE price > 20
GROUP BY product, category
ORDER BY total_revenue DESC
""").fetchdf()
print(result)
product category total_revenue avg_price transaction_count
0 Widget B Electronics 3749.25 49.99 1
1 Widget A Electronics 3498.50 32.49 2
Multiple DataFrames in One Query
You can JOIN multiple DataFrames, or mix DataFrames with CSV files and database tables:
orders = pd.DataFrame({
'order_id': [1, 2, 3],
'customer_id': [101, 102, 101],
'amount': [250.0, 180.0, 320.0]
})
customers = pd.DataFrame({
'customer_id': [101, 102, 103],
'name': ['Alice', 'Bob', 'Charlie'],
'city': ['New York', 'London', 'Tokyo']
})
result = duckdb.sql("""
SELECT
c.name,
c.city,
COUNT(o.order_id) AS order_count,
SUM(o.amount) AS total_spent
FROM customers AS c
LEFT JOIN orders AS o ON c.customer_id = o.customer_id
GROUP BY c.name, c.city
ORDER BY total_spent DESC NULLS LAST
""").fetchdf()
print(result)
name city order_count total_spent
0 Alice New York 2 570.0
1 Bob London 1 180.0
2 Charlie Tokyo 0 NaN
How Zero-Copy Works
DuckDB’s Python client doesn’t serialize your DataFrames — it reads the underlying NumPy/Pandas columnar data directly via Apache Arrow. This means:
- No memory duplication — what you see in Pandas is what DuckDB queries
- Instant setup — no
CREATE TABLEor data loading required - Seamless round-trips — query a DataFrame → get a DataFrame back
4. Parameterized Queries
When building production pipelines or interactive applications, you need parameterized queries to avoid SQL injection and handle dynamic values safely.
Using ? Placeholders
duckdb.sql("SELECT * FROM df WHERE price > ? AND category = ?", [30.0, 'Electronics']).show()
Named Parameters
min_price = 25.0
target_category = 'Home'
duckdb.sql("""
SELECT * FROM df
WHERE price >= $min_price AND category = $target_category
""", params={'min_price': min_price, 'target_category': target_category}).show()
Parameterized INSERT
conn = duckdb.connect()
conn.execute("CREATE TABLE IF NOT EXISTS sales (product VARCHAR, amount DECIMAL(10,2), sale_date DATE)")
products = ['Widget A', 'Widget B', 'Widget C']
amounts = [99.99, 149.99, 79.99]
dates = ['2026-01-15', '2026-01-16', '2026-01-17']
for p, a, d in zip(products, amounts, dates):
conn.execute("INSERT INTO sales VALUES (?, ?, ?)", [p, a, d])
conn.sql("SELECT * FROM sales").show()
Bulk Insert with Parameterized Lists
For better performance with many rows:
data = [
('Widget D', 199.99, '2026-02-01'),
('Widget E', 249.99, '2026-02-02'),
('Widget F', 129.99, '2026-02-03'),
]
conn.executemany("INSERT INTO sales VALUES (?, ?, ?)", data)
5. Reading and Writing CSV, Parquet, and JSON
DuckDB’s read_csv_auto, read_parquet, and read_json_auto functions make file I/O trivially easy from Python.
CSV Files
# Read a CSV file directly into a DuckDB relation
rel = duckdb.sql("SELECT * FROM read_csv_auto('data/sales_2026.csv')")
print(rel.fetchdf().head())
# Read with explicit options
rel = duckdb.sql("""
SELECT * FROM read_csv_auto(
'data/sales_2026.csv',
header=true,
delim=',',
dateformat='%Y-%m-%d',
all_varchar=true
)
""")
# Write a query result to CSV
duckdb.sql("COPY (SELECT * FROM read_csv_auto('input.csv') WHERE amount > 100) TO 'filtered_output.csv' (HEADER, DELIMITER ',')")
Parquet Files
Parquet is where DuckDB truly shines — its columnar storage matches DuckDB’s vectorized engine perfectly.
# Read a Parquet file
df = duckdb.sql("SELECT * FROM read_parquet('data/analytics.parquet')").fetchdf()
# Read multiple Parquet files with glob patterns
df = duckdb.sql("SELECT * FROM read_parquet('data/monthly/*.parquet')").fetchdf()
# Read partitioned Parquet datasets
df = duckdb.sql("""
SELECT * FROM read_parquet('data/year=2026/month=*/*.parquet')
WHERE region = 'EMEA'
""").fetchdf()
# Write a query to Parquet
duckdb.sql("""
COPY (
SELECT region, SUM(revenue) AS total
FROM read_parquet('data/*.parquet')
GROUP BY region
) TO 'region_totals.parquet' (FORMAT PARQUET)
""")
JSON Files
DuckDB supports both newline-delimited JSON and standard JSON arrays:
# NDJSON (one JSON object per line)
df = duckdb.sql("SELECT * FROM read_json_auto('data/events.ndjson')").fetchdf()
# JSON array
df = duckdb.sql("SELECT * FROM read_json_auto('data/array.json')").fetchdf()
# Nested JSON with automatic flattening
df = duckdb.sql("""
SELECT
id,
user.name AS user_name,
user.address.city AS city,
metadata.timestamp::TIMESTAMP AS event_time
FROM read_json_auto('data/complex.json')
""").fetchdf()
# Write to JSON
duckdb.sql("""
COPY (SELECT * FROM read_parquet('data.parquet') LIMIT 1000)
TO 'sample.json' (FORMAT JSON)
""")
6. Using DuckDB as a Python Library for Data Analysis
Beyond simple querying, DuckDB Python enables powerful analytical workflows.
Chaining Query Results
DuckDB uses a relational API that supports method chaining:
rel = duckdb.sql("SELECT * FROM read_csv_auto('transactions.csv')")
# Chain filters and aggregations
result = (
rel.filter("amount > 50")
.aggregate("customer_id, SUM(amount) AS total, COUNT(*) AS txns")
.order("total DESC")
.limit(10)
.fetchdf()
)
Window Functions
result = duckdb.sql("""
SELECT
product,
sale_date,
amount,
SUM(amount) OVER (PARTITION BY product ORDER BY sale_date) AS running_total,
AVG(amount) OVER (PARTITION BY product ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_3
FROM read_csv_auto('daily_sales.csv')
ORDER BY product, sale_date
""").fetchdf()
Statistical Analysis
result = duckdb.sql("""
SELECT
category,
COUNT(*) AS n,
AVG(price) AS mean_price,
STDDEV(price) AS std_price,
MIN(price) AS min_price,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price,
MAX(price) AS max_price,
CORR(price, quantity) AS price_qty_correlation
FROM read_parquet('products/*.parquet')
GROUP BY category
""").fetchdf()
Creating Views for Reusable Analysis
conn = duckdb.connect()
# Register a Pandas DataFrame as a view
conn.register('orders_view', orders_df)
# Create a SQL view for reusable logic
conn.sql("""
CREATE VIEW high_value_orders AS
SELECT * FROM orders_view
WHERE amount > 500 AND status = 'completed'
""")
# Use the view in subsequent queries
hourly_stats = conn.sql("""
SELECT
DATE_TRUNC('hour', order_time) AS hour,
COUNT(*) AS orders,
SUM(amount) AS revenue
FROM high_value_orders
GROUP BY hour
ORDER BY hour
""").fetchdf()
7. Performance Tips for Python Users
1. Push Down Filters and Projections
DuckDB can push filters directly into Parquet/CSV reading. Always filter before joining:
# ❌ Slow: reads everything then filters
df = duckdb.sql("""
SELECT * FROM (
SELECT * FROM read_parquet('huge_dataset/*.parquet')
) WHERE region = 'APAC'
""").fetchdf()
# ✅ Fast: filter is pushed into the reader
df = duckdb.sql("""
SELECT * FROM read_parquet('huge_dataset/*.parquet')
WHERE region = 'APAC'
""").fetchdf()
2. Use Parquet Instead of CSV
Parquet is 10-100x faster for analytical queries:
# Slow
duckdb.sql("SELECT * FROM read_csv_auto('data.csv') WHERE date > '2026-01-01'")
# Fast
duckdb.sql("SELECT * FROM read_parquet('data.parquet') WHERE date > '2026-01-01'")
3. Set Memory Limits Explicitly
# Limit DuckDB memory usage
duckdb.sql("SET memory_limit = '4GB'")
# Set the number of threads
duckdb.sql("SET threads = 4")
4. Use fetchdf() Judiciously
Only convert results to Pandas when you need Pandas-specific functionality:
# ❌ Unnecessary conversion
df = duckdb.sql("SELECT * FROM large_table").fetchdf()
# Then do another DuckDB operation
df2 = duckdb.sql("SELECT COUNT(*) FROM df").fetchdf()
# ✅ Stay in DuckDB as long as possible
count = duckdb.sql("SELECT COUNT(*) FROM large_table").fetchone()[0]
5. Register Large DataFrames Instead of Passing Them In
For DataFrames you query repeatedly, register them once:
# ❌ Re-parsed every time
for i in range(100):
duckdb.sql("SELECT COUNT(*) FROM my_df WHERE amount > ?", [i])
# ✅ Registered once, reused efficiently
conn = duckdb.connect()
conn.register('my_df', my_df)
for i in range(100):
conn.sql("SELECT COUNT(*) FROM my_df WHERE amount > ?", [i])
Real-World Example: Complete Analysis Pipeline
Here’s a realistic Python analysis that combines everything we’ve covered:
import duckdb
import pandas as pd
from datetime import datetime
# Connect to a persistent database
conn = duckdb.connect('retail_analysis.db')
# 1. Load raw data from multiple sources
conn.sql("""
CREATE TABLE daily_sales AS
SELECT * FROM read_csv_auto('data/sales_2026.csv')
""")
# 2. Ingest Parquet data from a BI export
conn.sql("""
INSERT INTO daily_sales
SELECT * FROM read_parquet('data/bi_export/*.parquet')
""")
# 3. Run an analytical query
monthly_performance = conn.sql("""
SELECT
DATE_TRUNC('month', sale_date) AS month,
product_category,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(quantity * unit_price) AS revenue,
SUM(quantity * unit_price) / NULLIF(COUNT(DISTINCT customer_id), 0) AS revenue_per_customer
FROM daily_sales
WHERE sale_date >= '2026-01-01'
GROUP BY ALL
HAVING revenue > 10000
ORDER BY month, revenue DESC
""").fetchdf()
# 4. Blend with a Pandas DataFrame (e.g., customer segments from a CRM export)
segments = pd.read_csv('data/customer_segments.csv')
blended = conn.sql("""
SELECT
s.month,
s.product_category,
s.revenue,
cs.segment,
cs.region
FROM monthly_performance AS s
JOIN segments AS cs ON s.product_category = cs.category
WHERE cs.segment IN ('Premium', 'Enterprise')
""").fetchdf()
# 5. Export the final result
conn.sql("""
COPY (
SELECT * FROM blended
) TO 'analysis_output.parquet' (FORMAT PARQUET)
""")
print("Analysis complete! Output saved to analysis_output.parquet")
print(blended.head())
Conclusion
The DuckDB Python integration offers a uniquely powerful combination: the full analytical power of SQL with the flexibility of Python, all in a single process with zero external dependencies. Whether you’re replacing slow Pandas groupbys, building ETL pipelines across CSV/Parquet/JSON, or creating interactive analytical applications, DuckDB provides the performance and simplicity you need.
Key takeaways:
- Installation is trivial —
pip install duckdband you’re ready - Pandas integration is seamless — query DataFrames directly with SQL, zero copy
- Multi-format support — CSV, Parquet, and JSON all work natively
- Production-ready — parameterized queries, memory limits, thread control
- Performance first — vectorized execution, filter pushdown, columnar storage
Start integrating DuckDB into your Python data workflow today. For more DuckDB tutorials, check out our Installation Guide and Beginners Guide 2026. And remember: the fastest Python data code is the code that hands the heavy lifting to DuckDB.
