Featured image of post DuckDB + Apache Arrow: Zero-Copy Data Integration Guide

DuckDB + Apache Arrow: Zero-Copy Data Integration Guide

Master zero-copy data integration between DuckDB and Apache Arrow. Covers Arrow interfaces, ADBC protocol, PyArrow interop, and practical scenarios for data pipelines, ML feature engineering, and cross-language communication.

Overview

In data engineering, transferring data between different systems has always been one of the most challenging bottlenecks. Traditional data exchange methods—JSON serialization, CSV parsing, or even row-by-row DataFrame conversion—incur massive CPU overhead and memory waste.

Apache Arrow solves this by defining a standardized columnar memory format that enables zero-copy data sharing: data is loaded once, and all Arrow-compatible tools can read it directly without serialization or deserialization.

DuckDB, as an embedded analytical database, has deep integration with Apache Arrow. DuckDB can read Arrow data directly, return query results as Arrow RecordBatches, and even serve data through the ADBC (Arrow Database Connectivity) protocol.

This article provides a comprehensive guide to mastering DuckDB + Apache Arrow integration, covering fundamentals, practical code, and monetization strategies.

Why Arrow Matters?

Problems with Traditional Data Transfer

Consider passing DuckDB query results to a Python script for machine learning training:

import duckdb

# Traditional approach: DuckDB → CSV/JSON → Pandas
conn = duckdb.connect()
result = conn.execute("SELECT * FROM large_table")
df = result.fetchdf()  # Internal: DuckDB → Python tuple → Pandas DataFrame

Problems with this approach:

  1. Two memory copies: DuckDB columnar data → Python row-oriented tuples → Pandas columnar DataFrame
  2. High CPU overhead: Format conversion consumes significant CPU cycles
  3. Memory waste: Multiple copies of the same data live in memory simultaneously
  4. High latency: Large datasets may take tens of seconds to convert

How Arrow Solves It

Arrow defines a standard columnar memory format shared by all compatible tools—no copying needed:

┌─────────────────────────────────────────────────┐
│  Apache Arrow Columnar Format (Shared Memory)    │
│                                                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │ Col A    │  │ Col B    │  │ Col C    │       │
│  │ Int32    │  │ Float64  │  │ String   │       │
│  │ [1,2,3]  │  │ [4.0,...]│  │ ["a",...]│       │
│  └──────────┘  └──────────┘  └──────────┘       │
│                ▲         ▲                        │
│                │         │                        │
│      ┌─────────┘         └──────────┐             │
│      ▼                               ▼             │
│  ┌──────────┐                  ┌──────────┐       │
│  │ DuckDB   │                  │ PyArrow  │       │
│  │Zero-Copy │                  │Zero-Copy │       │
│  │  Read    │                  │  Read    │       │
│  └──────────┘                  └──────────┘       │
└─────────────────────────────────────────────────┘

DuckDB, PyArrow, Pandas (with PyArrow backend), Polars, DataFusion, and other tools can all operate on the same Arrow memory—data never moves.

DuckDB Arrow Interfaces Deep Dive

1. Query Results as Arrow RecordBatches

DuckDB’s Python API provides direct Arrow conversion:

import duckdb
import pyarrow as pa

# Create in-memory database
conn = duckdb.connect()

# Load data
conn.execute("""
    CREATE TABLE sales AS SELECT * FROM read_csv_auto('sales_large.csv')
""")

# Get query results in Arrow format
result = conn.execute("""
    SELECT 
        region,
        date_trunc('month', sale_date) AS month,
        SUM(amount) AS total_sales,
        COUNT(*) AS transaction_count
    FROM sales
    WHERE sale_date >= '2025-01-01'
    GROUP BY region, month
    ORDER BY region, month
""")

# Zero-copy: return as Arrow Table
arrow_table = result.fetch_arrow_table()
print(f"Rows: {arrow_table.num_rows}, Columns: {arrow_table.num_columns}")
print(f"Schema: {arrow_table.schema}")

The key difference: fetch_arrow_table() returns data in Arrow format, avoiding the format conversion overhead of fetchdf() (which returns a Pandas DataFrame).

2. Querying PyArrow Tables Directly

DuckDB can query PyArrow tables directly without importing data first:

import pyarrow as pa
import pyarrow.dataset as ds
import duckdb

# Create a PyArrow table
table = pa.table({
    'id': pa.array([1, 2, 3, 4, 5]),
    'name': pa.array(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']),
    'score': pa.array([95.5, 87.3, 92.1, 78.9, 88.4])
})

# DuckDB queries PyArrow table directly (zero-copy!)
conn = duckdb.connect()
result = conn.execute("""
    SELECT 
        name,
        score,
        RANK() OVER (ORDER BY score DESC) AS rank
    FROM table
    WHERE score > 85
    ORDER BY score DESC
""").fetch_arrow_table()

print(result)

Sample output:

pyarrow.Table
name: string, score: double, rank: int32
----
name: ["Alice", "Charlie", "Eve", "Bob"]
score: [95.5, 92.1, 88.4, 87.3]
rank: [1, 2, 3, 4]

The zero-copy here means: the PyArrow table’s data resides in Arrow memory, and DuckDB’s query engine reads this memory directly for analysis—no data is copied.

3. Reading Arrow IPC Files

Arrow’s IPC (Inter-Process Communication) format is an efficient binary serialization format. DuckDB reads it natively:

import duckdb
import pyarrow as pa
import pyarrow.ipc as ipc
import tempfile

# Create sample data and write to Arrow IPC file
table = pa.table({
    'timestamp': pa.array([1000, 2000, 3000, 4000]),
    'temperature': pa.array([22.5, 23.1, 21.8, 24.2])
})

with tempfile.NamedTemporaryFile(suffix='.arrow', delete=False) as f:
    writer = ipc.new_file(f, table.schema)
    writer.write_table(table)
    writer.close()
    ipc_path = f.name

# DuckDB queries Arrow IPC file directly
conn = duckdb.connect()
conn.execute(f"CREATE TABLE temps AS SELECT * FROM read_arrow_ipc('{ipc_path}')")
result = conn.execute("SELECT AVG(temperature) as avg_temp FROM temps").fetchone()
print(f"Average temperature: {result[0]}°C")

4. Streaming with Arrow RecordBatches

For very large datasets, DuckDB supports streaming Arrow processing:

import duckdb
import pyarrow as pa
import pyarrow.csv as csv

# Stream CSV → Arrow stream → DuckDB instant query
conn = duckdb.connect()

read_options = csv.ReadOptions(block_size=1024 * 1024 * 10)  # 10MB blocks
csv_stream = csv.open_csv('ultra_large.csv', read_options=read_options)

# Process batch by batch
for batch in csv_stream:
    # Zero-copy: DuckDB queries Arrow RecordBatch directly
    result = conn.execute("""
        SELECT count(*) as cnt, sum(amount) as total
        FROM batch
        WHERE status = 'completed'
    """).fetchone()
    print(f"Batch result: {result}")

ADBC: The Arrow Database Connectivity Protocol

ADBC (Arrow Database Connectivity) is a next-generation database connectivity standard driven by the Arrow community, designed to replace JDBC/ODBC’s inefficient data transfer.

Why ADBC Matters?

FeatureJDBC/ODBCADBC
Data Transfer FormatRow-based (row-by-row fetch)Columnar Arrow batch transfer
Serialization OverheadHigh (type conversion per row)Low (zero-copy)
Batch TransferNo native batch supportNative RecordBatch support
Memory EfficiencyPoor (row-oriented storage)Excellent (columnar compression)
Cross-Language SupportComplex bindingsNative cross-language
Streaming QueriesLimited supportFull support

Using the DuckDB ADBC Driver

import adbc_driver_duckdb.dbapi as duckdb_adbc

# Connect to DuckDB via ADBC
conn = duckdb_adbc.connect()

# Create table
conn.execute("""
    CREATE TABLE orders AS
    SELECT range AS order_id, 
           random() * 1000 AS amount,
           CASE WHEN random() > 0.5 THEN 'completed' ELSE 'pending' END AS status
    FROM range(1000000)
""")

# Query and get Arrow-formatted results
cur = conn.cursor()
cur.execute("""
    SELECT status, count(*) as cnt, sum(amount) as total
    FROM orders
    GROUP BY status
""")

# Zero-copy retrieval of Arrow data
for batch in cur.fetch_record_batches():
    print(batch)

ADBC’s greatest advantage: when DuckDB runs remotely (via the Quack protocol or MotherDuck), clients can use ADBC to fetch data in Arrow batches, reducing network transfer and serialization overhead.

Practical Scenarios

Scenario 1: Cross-Language Data Pipelines

A Python data processing pipeline exchanging data with Java/Rust services:

# Python side: DuckDB processing → Arrow format output
import duckdb
import pyarrow as pa
import pyarrow.ipc as ipc

conn = duckdb.connect()

# Data cleaning
conn.execute("""
    CREATE VIEW cleaned_sales AS
    SELECT 
        sale_id,
        customer_id,
        amount,
        sale_date
    FROM read_parquet('raw_sales/*.parquet')
    WHERE amount > 0 AND customer_id IS NOT NULL
""")

# Export to Arrow IPC file (interchange format)
result = conn.execute("SELECT * FROM cleaned_sales")
arrow_table = result.fetch_arrow_table()
with open('exchange_data.arrow', 'wb') as f:
    writer = ipc.new_file(f, arrow_table.schema)
    writer.write_table(arrow_table)
    writer.close()
print(f"Exported {arrow_table.num_rows} rows to Arrow IPC file")

# Java/Rust side reads this Arrow file directly (zero-copy)

Scenario 2: ML Feature Engineering

Use DuckDB as a feature engineering engine, outputting Arrow format for ML model training:

import duckdb
import pyarrow as pa
import pyarrow.parquet as pq
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# DuckDB handles 1 billion log rows
conn = duckdb.connect()

# Feature engineering - all in DuckDB SQL
features = conn.execute("""
    SELECT
        customer_id,
        -- Time features
        date_diff('day', last_purchase_date, current_date) AS days_since_last_purchase,
        -- Aggregate features
        COUNT(*) AS total_orders,
        SUM(amount) AS total_spent,
        AVG(amount) AS avg_order_value,
        STDDEV(amount) AS order_amount_volatility,
        -- Categorical encoding
        CASE payment_method
            WHEN 'credit_card' THEN 1
            WHEN 'debit_card' THEN 2
            WHEN 'paypal' THEN 3
            ELSE 0
        END AS payment_method_code,
        -- Target variable
        CASE WHEN churned = true THEN 1 ELSE 0 END AS label
    FROM customer_events
    WHERE event_date >= '2025-01-01'
    GROUP BY customer_id, last_purchase_date, payment_method, churned
""").fetch_arrow_table()  # Zero-copy to Arrow

# Arrow → NumPy (zero-copy for numeric types)
X = np.column_stack([
    features.column('days_since_last_purchase').to_numpy(),
    features.column('total_orders').to_numpy(),
    features.column('total_spent').to_numpy(),
    features.column('avg_order_value').to_numpy(),
    features.column('order_amount_volatility').to_numpy(),
    features.column('payment_method_code').to_numpy(),
])
y = features.column('label').to_numpy()

# Train model
model = RandomForestRegressor(n_estimators=100)
model.fit(X, y)

Key insight: Arrow’s to_numpy() achieves zero-copy for numeric types—Arrow data maps directly to NumPy arrays without copying.

Scenario 3: Cross-Process Data Sharing

Share data between microservices using Arrow shared memory:

# Process A: Data Producer (DuckDB → Arrow → shared memory)
import duckdb
import pyarrow as pa
import pyarrow.ipc as ipc

conn = duckdb.connect()
result = conn.execute("SELECT * FROM daily_aggregation")
arrow_table = result.fetch_arrow_table()

# Write Arrow table to shared memory file
with open('/dev/shm/data.arrow', 'wb') as f:
    writer = ipc.new_file(f, arrow_table.schema)
    writer.write_table(arrow_table)
    writer.close()

# Process B: Data Consumer (millisecond read)
import pyarrow.ipc as ipc
with open('/dev/shm/data.arrow', 'rb') as f:
    reader = ipc.open_file(f)
    table = reader.read_all()
print(f"Read {table.num_rows} rows with zero copy from shared memory")

Comparison with Traditional Tools

FeatureDuckDB + ArrowPandasSpark
Data Exchange FormatColumnar Arrow (zero-copy)Row/Column hybrid (needs conversion)Row-based JVM (needs serialization)
Cross-Language SupportNative (C++/Python/R/Java)Python onlyJVM + Python
Memory EfficiencyHigh (columnar, zero-copy)Medium (high memory usage)Low (JVM overhead)
Query LatencyMilliseconds (embedded)Seconds (needs loading)Seconds to minutes (cluster)
Single-Node Throughput10-100 GB/s1-5 GB/sJVM-limited
StreamingYes (RecordBatch stream)LimitedYes (micro-batch)
Installationpip install duckdbpip install pandasRequires Hadoop cluster
ML Tool IntegrationArrow → NumPy zero-copyNative NumPyNeeds conversion
Remote QueryADBC/Quack protocolNot natively supportedThrift RPC
Data Source DiversityHigh (CSV/Parquet/Arrow/JSON)MediumHigh (HDFS/S3)

Best Practices

1. Choose the Right Interface

  • fetch_arrow_table(): Best for small to medium datasets (fits in memory)
  • fetch_record_batch(): Best for very large datasets (streaming)
  • Direct PyArrow table querying: When data is already in Arrow format
  • ADBC driver: For remote database scenarios

2. Performance Optimization Tips

  • Column pruning: SELECT only needed columns to reduce Arrow data volume
  • Predicate pushdown: Filter data at the SQL level to reduce Arrow data size
  • Optimal batch size: For streaming, 1M-10M rows per batch is usually optimal
  • Leverage Dictionary encoding: DuckDB automatically optimizes low-cardinality categorical columns
# Best practice example
conn.execute("""
    SELECT 
        -- Only needed columns
        customer_id,
        total_amount
    FROM orders
    WHERE date >= '2025-01-01'  -- Predicate pushdown
    ORDER BY total_amount DESC
    LIMIT 1000
""")

3. Common Pitfalls

PitfallCauseSolution
fetch_arrow_table() OOMData exceeds available memoryUse fetch_record_batch() streaming
Arrow/Pandas backend conflictMixed backendsStandardize on dtype_backend='pyarrow'
String type performance dropArrow strings vs DuckDB VARCHAREnable Dictionary encoding
Timestamp precision lossArrow nanosecond vs DuckDB microsecondExplicit CAST to target precision

Monetization Strategies

Mastering DuckDB + Arrow integration opens several revenue paths:

1. Enterprise Data Pipeline Optimization Consulting ($500-3,000/day)

  • Audit JDBC/ODBC data transfer bottlenecks and migrate to Arrow + ADBC architecture
  • Design zero-copy data pipelines to reduce server and memory costs
  • Provide performance optimization for high-throughput scenarios (finance, e-commerce)

2. Build a Data Middleware Product

  • Develop a lightweight data lake query engine based on DuckDB + Arrow
  • Offer as SaaS API: users upload CSV/Parquet, system returns analytical results via Arrow
  • Pricing $50-500/month per customer, targeting SMBs and startups

3. Open Source + Paid Support

  • Create a data migration tool leveraging DuckDB’s Arrow interface (e.g., data-arrow-sync)
  • Open source on GitHub for community traction, offer enterprise paid support
  • Follow the commercialization path of tools like dlt and dbt

4. Technical Training & Tutorials

  • Launch an online course: “High-Performance Data Engineering with DuckDB + Arrow”
  • Price $50-200 per student for data engineers and analysts
  • Offer corporate training ($1,000-5,000/day)

5. ML/AI Pipeline Specialist Service

  • Design DuckDB → Arrow → ML training pipelines for AI startups
  • Reduce feature engineering data conversion overhead, accelerate model iteration
  • Project-based pricing $3,000-15,000

Summary

DuckDB’s deep integration with Apache Arrow brings revolutionary performance improvements to modern data engineering. Through zero-copy data sharing, developers can:

  • Eliminate unnecessary data serialization overhead
  • Efficiently exchange data across languages and tools
  • Build high-performance data pipelines and ML feature engineering workflows

Combined with the ADBC protocol, DuckDB can serve as an Arrow-native analytical database, replacing traditional JDBC/ODBC solutions. As the Arrow ecosystem continues to grow, mastering this technology will become a core competency for data engineers.

Start today—replace fetchdf() with fetch_arrow_table() in your next project and experience the performance gains of zero-copy data processing!

References

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy