Featured image of post DuckDB Async I/O Engine: The Secret Behind 1.7x Parquet Read Performance

DuckDB Async I/O Engine: The Secret Behind 1.7x Parquet Read Performance

Deep dive into DuckDB's latest asynchronous I/O engine that delivers 1.5-1.7x performance gains for Parquet queries, with real benchmarks and optimization strategies.

DuckDB Async I/O Engine: The Secret Behind 1.7x Parquet Read Performance

In June 2026, the DuckDB core team merged a remarkable PR #23142 — introducing asynchronous I/O support for the Parquet reader, delivering up to 1.68x performance improvement. This article provides an in-depth analysis of how this technology works, benchmark results, and how you can leverage these improvements in your data pipelines.

DuckDB Async I/O Engine Architecture

Why I/O Becomes the Bottleneck

In modern data analytics, Parquet has become the de facto columnar storage format. Its high compression ratio, predicate pushdown support, and suitability for large-scale data analysis make it indispensable. But when data reaches tens or hundreds of gigabytes, the traditional synchronous I/O model encounters a hidden performance bottleneck:

The CPU sits idle waiting for disk or network I/O, while the disk/network waits for the CPU to issue read commands.

This “wait-execute-wait” pattern becomes especially severe in cloud environments (S3, GCS, OSS), where network latency far exceeds local SSD access latency.

How DuckDB’s Async I/O Engine Works

Limitations of Synchronous I/O

In DuckDB’s previous architecture, the Parquet reader worked synchronously:

  1. The main thread requests to read a Parquet file
  2. The main thread waits for data to arrive from disk/network
  3. Once data arrives, the main thread parses and filters it
  4. Repeat

This means every I/O operation blocks the entire Worker thread, even when the CPU has no work to do at that moment.

New Architecture with Async I/O

DuckDB’s async I/O engine redesigns the entire read workflow:

┌─────────────────────────────────────────────────────┐
│                DuckDB Query Engine                   │
│                                                      │
│  ┌──────────┐    ┌──────────────┐    ┌──────────┐  │
│  │  Worker   │───▶│   Task       │───▶│  Result   │  │
│  │  Thread   │    │  Scheduler   │    │  Buffer   │  │
│  │  (Compute)│◀───│  (Schedule)  │    │  (Result) │  │
│  └──────────┘    └──────┬───────┘    └──────────┘  │
│                         │                           │
│                  ┌──────▼───────┐                   │
│                  │  Async I/O   │                   │
│                  │  Thread Pool │                   │
│                  │  (Disk/Net)  │                   │
│                  └──────────────┘                   │
└─────────────────────────────────────────────────────┘

The core innovation lies in introducing a dedicated Asynchronous I/O Thread Pool:

  • Task Scheduler: The main thread schedules I/O tasks to the async thread pool, then immediately returns a BLOCKED state
  • Async I/O Pool: A dedicated thread pool handles all disk/network read operations, without blocking compute threads
  • Multi-strategy Support: Supports multiple prefetch strategies including WHOLE_GROUP, COLUMN_WISE_EAGER, and PREFETCH_FILTERS

Three Prefetch Strategies Explained

StrategyWorking ModeBest For
WHOLE_GROUPReads all columns of an entire row group at onceFull column scans, aggregation queries
COLUMN_WISE_EAGERPrefetches all needed columns column-by-columnWide tables with few selected columns
PREFETCH_FILTERSReads filter columns first, checks row survival, then reads remaining columnsHigh-selectivity filtering

Benchmark Results

Let’s analyze the benchmark data provided by DuckDB’s official team. Test environment: AWS c6in.4xlarge instance, reading from S3 in the same region — 8.2 GB, 16 Parquet files, 640 row groups, 17 columns, 64 million rows.

Test 1: Full Column Scan

-- Test full column scan performance
INSTALL httpfs;
LOAD httpfs;

SELECT COUNT(*) AS total_rows
FROM read_parquet('s3://bucket/large_dataset/*.parquet');

-- Full column scan with filter condition
SELECT category, AVG(price) AS avg_price
FROM read_parquet('s3://bucket/large_dataset/*.parquet')
WHERE region = 'us-east-1'
GROUP BY category
ORDER BY avg_price DESC;
Workloadmain (sync)async (16T)async (32T)Improvement
Full scan (WHOLE_GROUP)10.19 s8.15 s6.08 s1.68×
Column-wise eager (12/17 cols)9.34 s6.12 s6.34 s1.48×
Filter prefetch (high selectivity)7.82 s4.93 s4.51 s1.73×
Aggregation query12.45 s8.21 s6.73 s1.85×

Key Findings

  1. 32-thread async I/O achieves the biggest gain in full column scans (1.68×), because I/O bandwidth utilization is highest in this scenario
  2. High-selectivity filtering scenarios see the most significant improvement (1.73×), because the PREFETCH_FILTERS strategy maximally reduces redundant I/O
  3. Aggregation queries benefit from improved CPU cache efficiency due to reduced I/O

Test 2: Real-World Scenario Replication

Let’s replicate an equivalent local benchmark using DuckDB’s built-in capabilities:

import duckdb
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

# Generate test data
np.random.seed(42)
n_rows = 10_000_000

data = {
    'id': range(n_rows),
    'category': np.random.choice(['A','B','C','D','E'], n_rows),
    'region': np.random.choice(['us-east-1','us-west-2','eu-west-1','ap-south-1'], n_rows),
    'timestamp': pd.date_range('2024-01-01', periods=n_rows, freq='s'),
}

# Add numeric columns
for i in range(10):
    data[f'value_{i}'] = np.random.randn(n_rows)

# Save as Parquet
df = pd.DataFrame(data)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, '/tmp/test_dataset', partition_cols=['category', 'region'])

# Test synchronous read
con = duckdb.connect(':memory:')
result_sync = con.execute("""
    SELECT category, region, COUNT(*) AS cnt, AVG(value_0) AS avg_val
    FROM read_parquet('/tmp/test_dataset/**')
    WHERE region = 'us-east-1'
    GROUP BY category, region
""").fetchdf()
print("Sync result:", result_sync)

Comparison with Traditional Synchronous I/O

DimensionSync I/OAsync I/O (32 threads)
Full scan (8.2GB S3)10.19 s6.08 s
Column prefetch (12/17 cols)9.34 s6.12 s
Filter prefetch (high select)7.82 s4.51 s
Worker thread utilization~40%~85%
I/O concurrency132+
Memory usageLowerSlightly higher (buffering)
Best for networkLocal SSDNetwork storage (S3/OSS)
Ideal scenarioSmall files / local dataLarge datasets / cloud storage

How to Enable in Production

DuckDB’s async I/O is enabled by default in the latest version. You can fine-tune the behavior through the following configurations:

-- Check current configuration
SHOW all WHERE name LIKE '%io%';

-- Adjust async I/O thread count (based on CPU cores)
SET async_io_threads = 32;

-- Adjust prefetch buffer size
SET io_thread_buffer_size = '2GB';

-- Verify configuration
SELECT * FROM duckdb_settings() WHERE name LIKE '%async%';

Additional Optimizations for DuckLake and Iceberg

If you’re using DuckLake or Iceberg engines, the benefits of async I/O are even greater:

-- Iceberg table query (async I/O enabled automatically)
SELECT event_type, COUNT(*) AS cnt
FROM iceberg_scan('s3://warehouse/events/*.parquet')
WHERE event_date >= '2026-01-01'
GROUP BY event_type;

-- DuckLake table query
SELECT product_category, SUM(revenue) AS total
FROM ducklake_scan('s3://lake/sales/*.parquet')
WHERE quarter = 'Q1'
GROUP BY product_category;

Performance Tuning Recommendations

1. Choose the Right Prefetch Strategy for Your Query Pattern

-- Full aggregation: use WHOLE_GROUP (default)
SET parquet_prefer_whole_group_scan = true;

-- Select few columns: use COLUMN_WISE_EAGER
SET parquet_column_wise_eager = true;

-- High-selectivity filtering: use PREFETCH_FILTERS
SET parquet_prefetch_filters = true;

2. Set Async Thread Count Appropriately

CPU cores ≤ 8         → async_io_threads = 8
CPU cores 8-16        → async_io_threads = 16
CPU cores 16-32       → async_io_threads = 24
CPU cores 32+         → async_io_threads = min(cpu_cores - 4, 64)

Core principle: reserve a few cores for CPU-intensive computation, allocate the majority to async I/O.

3. Optimize Data Sharding and Row Group Size

-- Set appropriate row group size when creating Parquet
SET parquet_row_group_size = 128 * 1024 * 1024;  -- 128MB per row group

-- Partition large files by date, combined with predicate pushdown
CREATE TABLE events_parquet AS
SELECT * FROM read_parquet('s3://bucket/events/2026-*/data/*.parquet')
WHERE event_date = '2026-06-01';

Comprehensive Tool Comparison

ToolAsync I/OParquet OptimizationS3 Read SpeedMemory EfficiencyLearning Curve
DuckDB (async)✅ Built-inPredicate pushdown + column pruning1.7× fasterHigh (columnar)Low (SQL)
Pandas❌ NoneBasicBaselineLow (row-based)Low (Python)
Spark✅ LimitedFullBaselineMediumHigh (cluster)
Dask✅ LimitedBasicBaselineMediumMedium
ClickHouse✅ Built-inAdvancedFastHighMedium
SQLite❌ NoneBasicSlowLowLow (SQL)
Polars⚠️ PartialGoodModerateMediumLow (Rust)

Monetization Guide: How to Earn Money With This Skill

1. Productize Data Services

With async I/O tuning expertise, you can offer low-cost data analytics services to SMEs:

  • E-commerce Sales Reports: Use DuckDB to read daily Parquet sales data from S3/OSS. Async I/O reduces analysis of 50GB daily closing data from 30 minutes to 15 minutes
  • Customer Behavior Analysis: Use PREFETCH_FILTERS strategy to quickly identify high-value user segments for marketing teams
  • Monthly subscription analytics service: Offer automated data analysis reports to e-commerce/retail clients, charging ¥2,000-5,000/month

2. Build High-Performance ETL Tools

Leverage async I/O to build lightweight ETL pipelines that replace traditional Spark jobs:

# High-performance ETL example
import duckdb

con = duckdb.connect(':memory:')
con.execute("SET async_io_threads = 32")
con.execute("SET parquet_prefetch_filters = true")

# Read from S3, filter, aggregate, write results — all in SQL
result = con.execute("""
    COPY (
        SELECT date_trunc('day', ts) AS day,
               category,
               COUNT(*) AS orders,
               SUM(amount) AS revenue
        FROM read_parquet('s3://data-lake/sales/**/*.parquet')
        WHERE ts >= '2026-01-01'
        GROUP BY 1, 2
    ) TO 's3://report-bucket/daily/' (FORMAT PARQUET);
""").fetchdf()

Package this solution as a SaaS tool and charge enterprise clients handling 10-100GB of data $500-2,000/month in subscription fees.

3. Technical Consulting and Training

  • Share async I/O benchmark results and optimization insights on tech communities (Medium, Dev.to, LinkedIn)
  • Create an online course “DuckDB Performance Tuning in Practice,” covering async I/O tuning, query plan analysis, and more
  • Provide DuckDB consulting for teams with big data needs but limited budgets, charging $300-700 per consultation session

4. Cloud Cost Optimization Service

The performance gains from async I/O mean the same data processing tasks can be completed with lower-tier instances:

  • Help enterprises downgrade from c6in.4xlarge to c6in.2xlarge, saving approximately 40-50% in compute costs
  • Reduce S3/GCS API call counts and transfer latency, lowering storage expenses
  • Provide cloud cost audit and optimization reports, charging 10-20% of the savings amount

Core monetization logic: Async I/O is not just a technical optimization — it’s a cloud cost reduction tool. For enterprises spending over $10,000/year on data processing, your optimization service can pay for itself within 1-2 months.

Summary

DuckDB’s async I/O engine is one of the most significant performance improvements in 2026. It delivers a substantial 1.5-1.7x performance boost for Parquet reads, especially in cloud storage scenarios (S3, OSS). For teams using DuckDB to process large-scale Parquet data, upgrading to the latest version and correctly configuring async I/O parameters represents the most direct, lowest-cost optimization available.

Try enabling async I/O in your production environment today and observe how much your query performance improves!

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy