Featured image of post DuckDB Python Integration Guide: From Installation to Advanced Data Analysis

DuckDB Python Integration Guide: From Installation to Advanced Data Analysis

Complete DuckDB Python integration guide covering installation, SQL queries, Pandas DataFrames, Parquet/CSV/JSON I/O, parameterized queries, and performance optimization for data analysts.

Introduction

If you’re a Python developer who works with data, you’ve likely felt the friction between Pandas’ memory limits and the need for SQL-powered analytics. DuckDB Python bridges this gap perfectly — it’s an in-process SQL OLAP database that runs inside your Python process with zero external dependencies, offering vectorized execution that crushes Pandas on performance.

This guide walks you through everything you need to integrate DuckDB into your Python workflow: from a simple pip install to advanced patterns like zero-copy DataFrame queries, multi-file Parquet analysis, and parameterized SQL for production pipelines.

If you’re new to DuckDB entirely, start with our DuckDB Installation Guide and DuckDB Beginners Guide 2026 first.


1. Installing the DuckDB Python Package

Getting started with DuckDB Python requires exactly one command:

pip install duckdb

This installs the duckdb Python package, which bundles DuckDB’s entire C++ engine as a native extension. No separate server, no JDBC driver, no Docker container — just a Python import.

Verify the Installation

import duckdb
print(duckdb.__version__)
# Output (example): 1.2.0

You can also install optional extensions for advanced functionality:

pip install duckdb duckdb-extensions  # Optional: install extension management

Pro Tip: DuckDB Python works on Python 3.8 through 3.13, across Linux, macOS, and Windows. It’s a single-file download — the wheel is ~15MB on most platforms.


2. Basic Connection and Query Execution

DuckDB offers two connection modes, both accessible from Python.

In-Memory Database (Default)

For most data analysis work, you’ll use an in-memory database:

import duckdb

# Create an in-memory database
conn = duckdb.connect()

# Or use the default connection directly
result = duckdb.sql("SELECT 'Hello, DuckDB!' AS greeting")
print(result)

Output:

┌──────────────────┐
│     greeting     │
│    varchar       │
├──────────────────┤
│ Hello, DuckDB!   │
└──────────────────┘

Persistent Database

For data that needs to survive your Python session:

conn = duckdb.connect('my_analysis.db')
conn.sql("CREATE TABLE users (id INTEGER, name VARCHAR, city VARCHAR)")
conn.sql("INSERT INTO users VALUES (1, 'Alice', 'New York'), (2, 'Bob', 'London')")
conn.sql("SELECT * FROM users").show()

Fetching Results

DuckDB provides multiple ways to retrieve query results:

result = duckdb.sql("SELECT unnest([10, 20, 30]) AS value")

# As a list of tuples
rows = result.fetchall()       # [(10,), (20,), (30,)]

# As a Pandas DataFrame
df = result.fetchdf()          # DataFrame with one column 'value'

# As a list of dictionaries
dicts = result.fetchdf().to_dict('records')  # [{'value': 10}, ...]

# As an Arrow table
import pyarrow as pa
arrow_table = result.fetch_arrow_table()

💡 fetchdf() is the most commonly used method — it seamlessly bridges DuckDB and Pandas, letting you use DuckDB for heavy lifting and Pandas for visualization or further processing.


3. DuckDB + Pandas DataFrame Integration

This is the killer feature of DuckDB Python. You can run SQL queries directly on Pandas DataFrames without any data copying.

Query a DataFrame with SQL

import pandas as pd
import duckdb

# Create a Pandas DataFrame
df = pd.DataFrame({
    'product': ['Widget A', 'Widget B', 'Widget C', 'Widget A'],
    'category': ['Electronics', 'Electronics', 'Home', 'Electronics'],
    'price': [29.99, 49.99, 15.99, 34.99],
    'quantity': [100, 75, 200, 50]
})

# Run SQL directly on the DataFrame — zero copy!
result = duckdb.sql("""
    SELECT
        product,
        category,
        SUM(price * quantity) AS total_revenue,
        AVG(price) AS avg_price,
        COUNT(*) AS transaction_count
    FROM df
    WHERE price > 20
    GROUP BY product, category
    ORDER BY total_revenue DESC
""").fetchdf()

print(result)
    product     category  total_revenue  avg_price  transaction_count
0  Widget B  Electronics        3749.25      49.99                  1
1  Widget A  Electronics        3498.50      32.49                  2

Multiple DataFrames in One Query

You can JOIN multiple DataFrames, or mix DataFrames with CSV files and database tables:

orders = pd.DataFrame({
    'order_id': [1, 2, 3],
    'customer_id': [101, 102, 101],
    'amount': [250.0, 180.0, 320.0]
})

customers = pd.DataFrame({
    'customer_id': [101, 102, 103],
    'name': ['Alice', 'Bob', 'Charlie'],
    'city': ['New York', 'London', 'Tokyo']
})

result = duckdb.sql("""
    SELECT
        c.name,
        c.city,
        COUNT(o.order_id) AS order_count,
        SUM(o.amount) AS total_spent
    FROM customers AS c
    LEFT JOIN orders AS o ON c.customer_id = o.customer_id
    GROUP BY c.name, c.city
    ORDER BY total_spent DESC NULLS LAST
""").fetchdf()

print(result)
      name      city  order_count  total_spent
0    Alice  New York            2        570.0
1      Bob    London            1        180.0
2  Charlie     Tokyo            0          NaN

How Zero-Copy Works

DuckDB’s Python client doesn’t serialize your DataFrames — it reads the underlying NumPy/Pandas columnar data directly via Apache Arrow. This means:

  • No memory duplication — what you see in Pandas is what DuckDB queries
  • Instant setup — no CREATE TABLE or data loading required
  • Seamless round-trips — query a DataFrame → get a DataFrame back

4. Parameterized Queries

When building production pipelines or interactive applications, you need parameterized queries to avoid SQL injection and handle dynamic values safely.

Using ? Placeholders

duckdb.sql("SELECT * FROM df WHERE price > ? AND category = ?", [30.0, 'Electronics']).show()

Named Parameters

min_price = 25.0
target_category = 'Home'

duckdb.sql("""
    SELECT * FROM df
    WHERE price >= $min_price AND category = $target_category
""", params={'min_price': min_price, 'target_category': target_category}).show()

Parameterized INSERT

conn = duckdb.connect()
conn.execute("CREATE TABLE IF NOT EXISTS sales (product VARCHAR, amount DECIMAL(10,2), sale_date DATE)")

products = ['Widget A', 'Widget B', 'Widget C']
amounts = [99.99, 149.99, 79.99]
dates = ['2026-01-15', '2026-01-16', '2026-01-17']

for p, a, d in zip(products, amounts, dates):
    conn.execute("INSERT INTO sales VALUES (?, ?, ?)", [p, a, d])

conn.sql("SELECT * FROM sales").show()

Bulk Insert with Parameterized Lists

For better performance with many rows:

data = [
    ('Widget D', 199.99, '2026-02-01'),
    ('Widget E', 249.99, '2026-02-02'),
    ('Widget F', 129.99, '2026-02-03'),
]
conn.executemany("INSERT INTO sales VALUES (?, ?, ?)", data)

5. Reading and Writing CSV, Parquet, and JSON

DuckDB’s read_csv_auto, read_parquet, and read_json_auto functions make file I/O trivially easy from Python.

CSV Files

# Read a CSV file directly into a DuckDB relation
rel = duckdb.sql("SELECT * FROM read_csv_auto('data/sales_2026.csv')")
print(rel.fetchdf().head())

# Read with explicit options
rel = duckdb.sql("""
    SELECT * FROM read_csv_auto(
        'data/sales_2026.csv',
        header=true,
        delim=',',
        dateformat='%Y-%m-%d',
        all_varchar=true
    )
""")

# Write a query result to CSV
duckdb.sql("COPY (SELECT * FROM read_csv_auto('input.csv') WHERE amount > 100) TO 'filtered_output.csv' (HEADER, DELIMITER ',')")

Parquet Files

Parquet is where DuckDB truly shines — its columnar storage matches DuckDB’s vectorized engine perfectly.

# Read a Parquet file
df = duckdb.sql("SELECT * FROM read_parquet('data/analytics.parquet')").fetchdf()

# Read multiple Parquet files with glob patterns
df = duckdb.sql("SELECT * FROM read_parquet('data/monthly/*.parquet')").fetchdf()

# Read partitioned Parquet datasets
df = duckdb.sql("""
    SELECT * FROM read_parquet('data/year=2026/month=*/*.parquet')
    WHERE region = 'EMEA'
""").fetchdf()

# Write a query to Parquet
duckdb.sql("""
    COPY (
        SELECT region, SUM(revenue) AS total
        FROM read_parquet('data/*.parquet')
        GROUP BY region
    ) TO 'region_totals.parquet' (FORMAT PARQUET)
""")

JSON Files

DuckDB supports both newline-delimited JSON and standard JSON arrays:

# NDJSON (one JSON object per line)
df = duckdb.sql("SELECT * FROM read_json_auto('data/events.ndjson')").fetchdf()

# JSON array
df = duckdb.sql("SELECT * FROM read_json_auto('data/array.json')").fetchdf()

# Nested JSON with automatic flattening
df = duckdb.sql("""
    SELECT
        id,
        user.name AS user_name,
        user.address.city AS city,
        metadata.timestamp::TIMESTAMP AS event_time
    FROM read_json_auto('data/complex.json')
""").fetchdf()

# Write to JSON
duckdb.sql("""
    COPY (SELECT * FROM read_parquet('data.parquet') LIMIT 1000)
    TO 'sample.json' (FORMAT JSON)
""")

6. Using DuckDB as a Python Library for Data Analysis

Beyond simple querying, DuckDB Python enables powerful analytical workflows.

Chaining Query Results

DuckDB uses a relational API that supports method chaining:

rel = duckdb.sql("SELECT * FROM read_csv_auto('transactions.csv')")

# Chain filters and aggregations
result = (
    rel.filter("amount > 50")
       .aggregate("customer_id, SUM(amount) AS total, COUNT(*) AS txns")
       .order("total DESC")
       .limit(10)
       .fetchdf()
)

Window Functions

result = duckdb.sql("""
    SELECT
        product,
        sale_date,
        amount,
        SUM(amount) OVER (PARTITION BY product ORDER BY sale_date) AS running_total,
        AVG(amount) OVER (PARTITION BY product ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_3
    FROM read_csv_auto('daily_sales.csv')
    ORDER BY product, sale_date
""").fetchdf()

Statistical Analysis

result = duckdb.sql("""
    SELECT
        category,
        COUNT(*) AS n,
        AVG(price) AS mean_price,
        STDDEV(price) AS std_price,
        MIN(price) AS min_price,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price,
        MAX(price) AS max_price,
        CORR(price, quantity) AS price_qty_correlation
    FROM read_parquet('products/*.parquet')
    GROUP BY category
""").fetchdf()

Creating Views for Reusable Analysis

conn = duckdb.connect()

# Register a Pandas DataFrame as a view
conn.register('orders_view', orders_df)

# Create a SQL view for reusable logic
conn.sql("""
    CREATE VIEW high_value_orders AS
    SELECT * FROM orders_view
    WHERE amount > 500 AND status = 'completed'
""")

# Use the view in subsequent queries
hourly_stats = conn.sql("""
    SELECT
        DATE_TRUNC('hour', order_time) AS hour,
        COUNT(*) AS orders,
        SUM(amount) AS revenue
    FROM high_value_orders
    GROUP BY hour
    ORDER BY hour
""").fetchdf()

7. Performance Tips for Python Users

1. Push Down Filters and Projections

DuckDB can push filters directly into Parquet/CSV reading. Always filter before joining:

# ❌ Slow: reads everything then filters
df = duckdb.sql("""
    SELECT * FROM (
        SELECT * FROM read_parquet('huge_dataset/*.parquet')
    ) WHERE region = 'APAC'
""").fetchdf()

# ✅ Fast: filter is pushed into the reader
df = duckdb.sql("""
    SELECT * FROM read_parquet('huge_dataset/*.parquet')
    WHERE region = 'APAC'
""").fetchdf()

2. Use Parquet Instead of CSV

Parquet is 10-100x faster for analytical queries:

# Slow
duckdb.sql("SELECT * FROM read_csv_auto('data.csv') WHERE date > '2026-01-01'")

# Fast
duckdb.sql("SELECT * FROM read_parquet('data.parquet') WHERE date > '2026-01-01'")

3. Set Memory Limits Explicitly

# Limit DuckDB memory usage
duckdb.sql("SET memory_limit = '4GB'")

# Set the number of threads
duckdb.sql("SET threads = 4")

4. Use fetchdf() Judiciously

Only convert results to Pandas when you need Pandas-specific functionality:

# ❌ Unnecessary conversion
df = duckdb.sql("SELECT * FROM large_table").fetchdf()
# Then do another DuckDB operation
df2 = duckdb.sql("SELECT COUNT(*) FROM df").fetchdf()

# ✅ Stay in DuckDB as long as possible
count = duckdb.sql("SELECT COUNT(*) FROM large_table").fetchone()[0]

5. Register Large DataFrames Instead of Passing Them In

For DataFrames you query repeatedly, register them once:

# ❌ Re-parsed every time
for i in range(100):
    duckdb.sql("SELECT COUNT(*) FROM my_df WHERE amount > ?", [i])

# ✅ Registered once, reused efficiently
conn = duckdb.connect()
conn.register('my_df', my_df)
for i in range(100):
    conn.sql("SELECT COUNT(*) FROM my_df WHERE amount > ?", [i])

Real-World Example: Complete Analysis Pipeline

Here’s a realistic Python analysis that combines everything we’ve covered:

import duckdb
import pandas as pd
from datetime import datetime

# Connect to a persistent database
conn = duckdb.connect('retail_analysis.db')

# 1. Load raw data from multiple sources
conn.sql("""
    CREATE TABLE daily_sales AS
    SELECT * FROM read_csv_auto('data/sales_2026.csv')
""")

# 2. Ingest Parquet data from a BI export
conn.sql("""
    INSERT INTO daily_sales
    SELECT * FROM read_parquet('data/bi_export/*.parquet')
""")

# 3. Run an analytical query
monthly_performance = conn.sql("""
    SELECT
        DATE_TRUNC('month', sale_date) AS month,
        product_category,
        COUNT(DISTINCT customer_id) AS unique_customers,
        SUM(quantity * unit_price) AS revenue,
        SUM(quantity * unit_price) / NULLIF(COUNT(DISTINCT customer_id), 0) AS revenue_per_customer
    FROM daily_sales
    WHERE sale_date >= '2026-01-01'
    GROUP BY ALL
    HAVING revenue > 10000
    ORDER BY month, revenue DESC
""").fetchdf()

# 4. Blend with a Pandas DataFrame (e.g., customer segments from a CRM export)
segments = pd.read_csv('data/customer_segments.csv')
blended = conn.sql("""
    SELECT
        s.month,
        s.product_category,
        s.revenue,
        cs.segment,
        cs.region
    FROM monthly_performance AS s
    JOIN segments AS cs ON s.product_category = cs.category
    WHERE cs.segment IN ('Premium', 'Enterprise')
""").fetchdf()

# 5. Export the final result
conn.sql("""
    COPY (
        SELECT * FROM blended
    ) TO 'analysis_output.parquet' (FORMAT PARQUET)
""")

print("Analysis complete! Output saved to analysis_output.parquet")
print(blended.head())

Conclusion

The DuckDB Python integration offers a uniquely powerful combination: the full analytical power of SQL with the flexibility of Python, all in a single process with zero external dependencies. Whether you’re replacing slow Pandas groupbys, building ETL pipelines across CSV/Parquet/JSON, or creating interactive analytical applications, DuckDB provides the performance and simplicity you need.

Key takeaways:

  • Installation is trivialpip install duckdb and you’re ready
  • Pandas integration is seamless — query DataFrames directly with SQL, zero copy
  • Multi-format support — CSV, Parquet, and JSON all work natively
  • Production-ready — parameterized queries, memory limits, thread control
  • Performance first — vectorized execution, filter pushdown, columnar storage

Start integrating DuckDB into your Python data workflow today. For more DuckDB tutorials, check out our Installation Guide and Beginners Guide 2026. And remember: the fastest Python data code is the code that hands the heavy lifting to DuckDB.

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy