DuckDB + Apache Arrow: Zero-Copy Data Integration Guide

DuckDB + Apache Arrow: Zero-Copy Data Integration Guide

Master zero-copy data integration between DuckDB and Apache Arrow. Covers Arrow interfaces, ADBC protocol, PyArrow interop, and practical scenarios for data pipelines, ML feature engineering, and cross-language communication.

Overview

In data engineering, transferring data between different systems has always been one of the most challenging bottlenecks. Traditional data exchange methods—JSON serialization, CSV parsing, or even row-by-row DataFrame conversion—incur massive CPU overhead and memory waste.

Apache Arrow solves this by defining a standardized columnar memory format that enables zero-copy data sharing: data is loaded once, and all Arrow-compatible tools can read it directly without serialization or deserialization.

DuckDB, as an embedded analytical database, has deep integration with Apache Arrow. DuckDB can read Arrow data directly, return query results as Arrow RecordBatches, and even serve data through the ADBC (Arrow Database Connectivity) protocol.

This article provides a comprehensive guide to mastering DuckDB + Apache Arrow integration, covering fundamentals, practical code, and monetization strategies.

Why Arrow Matters?

Problems with Traditional Data Transfer

Consider passing DuckDB query results to a Python script for machine learning training:

import duckdb

# Traditional approach: DuckDB → CSV/JSON → Pandas
conn = duckdb.connect()
result = conn.execute("SELECT * FROM large_table")
df = result.fetchdf()  # Internal: DuckDB → Python tuple → Pandas DataFrame

Problems with this approach:

Two memory copies: DuckDB columnar data → Python row-oriented tuples → Pandas columnar DataFrame
High CPU overhead: Format conversion consumes significant CPU cycles
Memory waste: Multiple copies of the same data live in memory simultaneously
High latency: Large datasets may take tens of seconds to convert

How Arrow Solves It

Arrow defines a standard columnar memory format shared by all compatible tools—no copying needed:

┌─────────────────────────────────────────────────┐
│  Apache Arrow Columnar Format (Shared Memory)    │
│                                                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │ Col A    │  │ Col B    │  │ Col C    │       │
│  │ Int32    │  │ Float64  │  │ String   │       │
│  │ [1,2,3]  │  │ [4.0,...]│  │ ["a",...]│       │
│  └──────────┘  └──────────┘  └──────────┘       │
│                ▲         ▲                        │
│                │         │                        │
│      ┌─────────┘         └──────────┐             │
│      ▼                               ▼             │
│  ┌──────────┐                  ┌──────────┐       │
│  │ DuckDB   │                  │ PyArrow  │       │
│  │Zero-Copy │                  │Zero-Copy │       │
│  │  Read    │                  │  Read    │       │
│  └──────────┘                  └──────────┘       │
└─────────────────────────────────────────────────┘

DuckDB, PyArrow, Pandas (with PyArrow backend), Polars, DataFusion, and other tools can all operate on the same Arrow memory—data never moves.

DuckDB Arrow Interfaces Deep Dive

1. Query Results as Arrow RecordBatches

DuckDB’s Python API provides direct Arrow conversion:

import duckdb
import pyarrow as pa

# Create in-memory database
conn = duckdb.connect()

# Load data
conn.execute("""
    CREATE TABLE sales AS SELECT * FROM read_csv_auto('sales_large.csv')
""")

# Get query results in Arrow format
result = conn.execute("""
    SELECT 
        region,
        date_trunc('month', sale_date) AS month,
        SUM(amount) AS total_sales,
        COUNT(*) AS transaction_count
    FROM sales
    WHERE sale_date >= '2025-01-01'
    GROUP BY region, month
    ORDER BY region, month
""")

# Zero-copy: return as Arrow Table
arrow_table = result.fetch_arrow_table()
print(f"Rows: {arrow_table.num_rows}, Columns: {arrow_table.num_columns}")
print(f"Schema: {arrow_table.schema}")

The key difference: fetch_arrow_table() returns data in Arrow format, avoiding the format conversion overhead of fetchdf() (which returns a Pandas DataFrame).

2. Querying PyArrow Tables Directly

DuckDB can query PyArrow tables directly without importing data first:

import pyarrow as pa
import pyarrow.dataset as ds
import duckdb

# Create a PyArrow table
table = pa.table({
    'id': pa.array([1, 2, 3, 4, 5]),
    'name': pa.array(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']),
    'score': pa.array([95.5, 87.3, 92.1, 78.9, 88.4])
})

# DuckDB queries PyArrow table directly (zero-copy!)
conn = duckdb.connect()
result = conn.execute("""
    SELECT 
        name,
        score,
        RANK() OVER (ORDER BY score DESC) AS rank
    FROM table
    WHERE score > 85
    ORDER BY score DESC
""").fetch_arrow_table()

print(result)

Sample output:

pyarrow.Table
name: string, score: double, rank: int32
----
name: ["Alice", "Charlie", "Eve", "Bob"]
score: [95.5, 92.1, 88.4, 87.3]
rank: [1, 2, 3, 4]

The zero-copy here means: the PyArrow table’s data resides in Arrow memory, and DuckDB’s query engine reads this memory directly for analysis—no data is copied.

3. Reading Arrow IPC Files

Arrow’s IPC (Inter-Process Communication) format is an efficient binary serialization format. DuckDB reads it natively:

import duckdb
import pyarrow as pa
import pyarrow.ipc as ipc
import tempfile

# Create sample data and write to Arrow IPC file
table = pa.table({
    'timestamp': pa.array([1000, 2000, 3000, 4000]),
    'temperature': pa.array([22.5, 23.1, 21.8, 24.2])
})

with tempfile.NamedTemporaryFile(suffix='.arrow', delete=False) as f:
    writer = ipc.new_file(f, table.schema)
    writer.write_table(table)
    writer.close()
    ipc_path = f.name

# DuckDB queries Arrow IPC file directly
conn = duckdb.connect()
conn.execute(f"CREATE TABLE temps AS SELECT * FROM read_arrow_ipc('{ipc_path}')")
result = conn.execute("SELECT AVG(temperature) as avg_temp FROM temps").fetchone()
print(f"Average temperature: {result[0]}°C")

4. Streaming with Arrow RecordBatches

For very large datasets, DuckDB supports streaming Arrow processing:

import duckdb
import pyarrow as pa
import pyarrow.csv as csv

# Stream CSV → Arrow stream → DuckDB instant query
conn = duckdb.connect()

read_options = csv.ReadOptions(block_size=1024 * 1024 * 10)  # 10MB blocks
csv_stream = csv.open_csv('ultra_large.csv', read_options=read_options)

# Process batch by batch
for batch in csv_stream:
    # Zero-copy: DuckDB queries Arrow RecordBatch directly
    result = conn.execute("""
        SELECT count(*) as cnt, sum(amount) as total
        FROM batch
        WHERE status = 'completed'
    """).fetchone()
    print(f"Batch result: {result}")

ADBC: The Arrow Database Connectivity Protocol

ADBC (Arrow Database Connectivity) is a next-generation database connectivity standard driven by the Arrow community, designed to replace JDBC/ODBC’s inefficient data transfer.

Why ADBC Matters?

Feature	JDBC/ODBC	ADBC
Data Transfer Format	Row-based (row-by-row fetch)	Columnar Arrow batch transfer
Serialization Overhead	High (type conversion per row)	Low (zero-copy)
Batch Transfer	No native batch support	Native RecordBatch support
Memory Efficiency	Poor (row-oriented storage)	Excellent (columnar compression)
Cross-Language Support	Complex bindings	Native cross-language
Streaming Queries	Limited support	Full support

Using the DuckDB ADBC Driver

import adbc_driver_duckdb.dbapi as duckdb_adbc

# Connect to DuckDB via ADBC
conn = duckdb_adbc.connect()

# Create table
conn.execute("""
    CREATE TABLE orders AS
    SELECT range AS order_id, 
           random() * 1000 AS amount,
           CASE WHEN random() > 0.5 THEN 'completed' ELSE 'pending' END AS status
    FROM range(1000000)
""")

# Query and get Arrow-formatted results
cur = conn.cursor()
cur.execute("""
    SELECT status, count(*) as cnt, sum(amount) as total
    FROM orders
    GROUP BY status
""")

# Zero-copy retrieval of Arrow data
for batch in cur.fetch_record_batches():
    print(batch)

ADBC’s greatest advantage: when DuckDB runs remotely (via the Quack protocol or MotherDuck), clients can use ADBC to fetch data in Arrow batches, reducing network transfer and serialization overhead.

Practical Scenarios

Scenario 1: Cross-Language Data Pipelines

A Python data processing pipeline exchanging data with Java/Rust services:

# Python side: DuckDB processing → Arrow format output
import duckdb
import pyarrow as pa
import pyarrow.ipc as ipc

conn = duckdb.connect()

# Data cleaning
conn.execute("""
    CREATE VIEW cleaned_sales AS
    SELECT 
        sale_id,
        customer_id,
        amount,
        sale_date
    FROM read_parquet('raw_sales/*.parquet')
    WHERE amount > 0 AND customer_id IS NOT NULL
""")

# Export to Arrow IPC file (interchange format)
result = conn.execute("SELECT * FROM cleaned_sales")
arrow_table = result.fetch_arrow_table()
with open('exchange_data.arrow', 'wb') as f:
    writer = ipc.new_file(f, arrow_table.schema)
    writer.write_table(arrow_table)
    writer.close()
print(f"Exported {arrow_table.num_rows} rows to Arrow IPC file")

# Java/Rust side reads this Arrow file directly (zero-copy)

Scenario 2: ML Feature Engineering

Use DuckDB as a feature engineering engine, outputting Arrow format for ML model training:

import duckdb
import pyarrow as pa
import pyarrow.parquet as pq
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# DuckDB handles 1 billion log rows
conn = duckdb.connect()

# Feature engineering - all in DuckDB SQL
features = conn.execute("""
    SELECT
        customer_id,
        -- Time features
        date_diff('day', last_purchase_date, current_date) AS days_since_last_purchase,
        -- Aggregate features
        COUNT(*) AS total_orders,
        SUM(amount) AS total_spent,
        AVG(amount) AS avg_order_value,
        STDDEV(amount) AS order_amount_volatility,
        -- Categorical encoding
        CASE payment_method
            WHEN 'credit_card' THEN 1
            WHEN 'debit_card' THEN 2
            WHEN 'paypal' THEN 3
            ELSE 0
        END AS payment_method_code,
        -- Target variable
        CASE WHEN churned = true THEN 1 ELSE 0 END AS label
    FROM customer_events
    WHERE event_date >= '2025-01-01'
    GROUP BY customer_id, last_purchase_date, payment_method, churned
""").fetch_arrow_table()  # Zero-copy to Arrow

# Arrow → NumPy (zero-copy for numeric types)
X = np.column_stack([
    features.column('days_since_last_purchase').to_numpy(),
    features.column('total_orders').to_numpy(),
    features.column('total_spent').to_numpy(),
    features.column('avg_order_value').to_numpy(),
    features.column('order_amount_volatility').to_numpy(),
    features.column('payment_method_code').to_numpy(),
])
y = features.column('label').to_numpy()

# Train model
model = RandomForestRegressor(n_estimators=100)
model.fit(X, y)

Key insight: Arrow’s to_numpy() achieves zero-copy for numeric types—Arrow data maps directly to NumPy arrays without copying.

Share data between microservices using Arrow shared memory:

# Process A: Data Producer (DuckDB → Arrow → shared memory)
import duckdb
import pyarrow as pa
import pyarrow.ipc as ipc

conn = duckdb.connect()
result = conn.execute("SELECT * FROM daily_aggregation")
arrow_table = result.fetch_arrow_table()

# Write Arrow table to shared memory file
with open('/dev/shm/data.arrow', 'wb') as f:
    writer = ipc.new_file(f, arrow_table.schema)
    writer.write_table(arrow_table)
    writer.close()

# Process B: Data Consumer (millisecond read)
import pyarrow.ipc as ipc
with open('/dev/shm/data.arrow', 'rb') as f:
    reader = ipc.open_file(f)
    table = reader.read_all()
print(f"Read {table.num_rows} rows with zero copy from shared memory")

Comparison with Traditional Tools

Feature	DuckDB + Arrow	Pandas	Spark
Data Exchange Format	Columnar Arrow (zero-copy)	Row/Column hybrid (needs conversion)	Row-based JVM (needs serialization)
Cross-Language Support	Native (C++/Python/R/Java)	Python only	JVM + Python
Memory Efficiency	High (columnar, zero-copy)	Medium (high memory usage)	Low (JVM overhead)
Query Latency	Milliseconds (embedded)	Seconds (needs loading)	Seconds to minutes (cluster)
Single-Node Throughput	10-100 GB/s	1-5 GB/s	JVM-limited
Streaming	Yes (RecordBatch stream)	Limited	Yes (micro-batch)
Installation	pip install duckdb	pip install pandas	Requires Hadoop cluster
ML Tool Integration	Arrow → NumPy zero-copy	Native NumPy	Needs conversion
Remote Query	ADBC/Quack protocol	Not natively supported	Thrift RPC
Data Source Diversity	High (CSV/Parquet/Arrow/JSON)	Medium	High (HDFS/S3)

Best Practices

1. Choose the Right Interface

fetch_arrow_table(): Best for small to medium datasets (fits in memory)
fetch_record_batch(): Best for very large datasets (streaming)
Direct PyArrow table querying: When data is already in Arrow format
ADBC driver: For remote database scenarios

2. Performance Optimization Tips

Column pruning: SELECT only needed columns to reduce Arrow data volume
Predicate pushdown: Filter data at the SQL level to reduce Arrow data size
Optimal batch size: For streaming, 1M-10M rows per batch is usually optimal
Leverage Dictionary encoding: DuckDB automatically optimizes low-cardinality categorical columns

# Best practice example
conn.execute("""
    SELECT 
        -- Only needed columns
        customer_id,
        total_amount
    FROM orders
    WHERE date >= '2025-01-01'  -- Predicate pushdown
    ORDER BY total_amount DESC
    LIMIT 1000
""")

3. Common Pitfalls

Pitfall	Cause	Solution
`fetch_arrow_table()` OOM	Data exceeds available memory	Use `fetch_record_batch()` streaming
Arrow/Pandas backend conflict	Mixed backends	Standardize on `dtype_backend='pyarrow'`
String type performance drop	Arrow strings vs DuckDB VARCHAR	Enable Dictionary encoding
Timestamp precision loss	Arrow nanosecond vs DuckDB microsecond	Explicit CAST to target precision

Monetization Strategies

Mastering DuckDB + Arrow integration opens several revenue paths:

1. Enterprise Data Pipeline Optimization Consulting ($500-3,000/day)

Audit JDBC/ODBC data transfer bottlenecks and migrate to Arrow + ADBC architecture
Design zero-copy data pipelines to reduce server and memory costs
Provide performance optimization for high-throughput scenarios (finance, e-commerce)

2. Build a Data Middleware Product

Develop a lightweight data lake query engine based on DuckDB + Arrow
Offer as SaaS API: users upload CSV/Parquet, system returns analytical results via Arrow
Pricing $50-500/month per customer, targeting SMBs and startups

3. Open Source + Paid Support

Create a data migration tool leveraging DuckDB’s Arrow interface (e.g., data-arrow-sync)
Open source on GitHub for community traction, offer enterprise paid support
Follow the commercialization path of tools like dlt and dbt

4. Technical Training & Tutorials

Launch an online course: “High-Performance Data Engineering with DuckDB + Arrow”
Price $50-200 per student for data engineers and analysts
Offer corporate training ($1,000-5,000/day)

5. ML/AI Pipeline Specialist Service

Design DuckDB → Arrow → ML training pipelines for AI startups
Reduce feature engineering data conversion overhead, accelerate model iteration
Project-based pricing $3,000-15,000

Summary

DuckDB’s deep integration with Apache Arrow brings revolutionary performance improvements to modern data engineering. Through zero-copy data sharing, developers can:

Eliminate unnecessary data serialization overhead
Efficiently exchange data across languages and tools
Build high-performance data pipelines and ML feature engineering workflows

Combined with the ADBC protocol, DuckDB can serve as an Arrow-native analytical database, replacing traditional JDBC/ODBC solutions. As the Arrow ecosystem continues to grow, mastering this technology will become a core competency for data engineers.

Start today—replace fetchdf() with fetch_arrow_table() in your next project and experience the performance gains of zero-copy data processing!

References

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.