Advanced Tutorial

DuckDB List Functions Complete Guide: SQL-Level Processing of Nested JSON Arrays

Master DuckDB's complete list function toolkit — list_transform, list_filter, list_aggregate, list_zip and more. Process nested JSON arrays with pure SQL, no Python needed. Includes runnable code examples and monetization advice.

DuckDB List Functions Complete Guide: SQL-Level Processing of Nested JSON Arrays

Overview

In API response data, log analysis, and ETL pipelines, nested arrays are among the most common data structures. The traditional approach involves writing Python loops to manually traverse, expand, filter, and aggregate arrays — verbose code that’s error-prone and hard to maintain.

DuckDB provides a comprehensive suite of list (array) functions that let you handle all array operations with pure SQL. From expanding nested JSON to batch transformations, conditional filtering to multi-list concatenation — one line of SQL replaces dozens of lines of Python code.

DuckDB List Functions Processing Nested JSON Data Flow

Figure: DuckDB list functions processing pipeline for nested JSON data

One. Scenario Data: E-Commerce Order JSONL

Imagine you’ve exported order data from an e-commerce platform, where each order contains a product list (items) and tags (tags):

{"order_id":"ORD-001","customer":"Alice","items":[{"product":"Laptop","price":5000},{"product":"Mouse","price":150}],"tags":["premium","fast-shipping"]}
{"order_id":"ORD-002","customer":"Bob","items":[{"product":"Keyboard","price":350}],"tags":["standard"]}
{"order_id":"ORD-003","customer":"Carol","items":[{"product":"Monitor","price":2800},{"product":"Webcam","price":400},{"product":"Headphones","price":600}],"tags":["premium","gift-wrap"]}

Save as orders.jsonl. Let’s see how DuckDB handles it.

Two. Core Function 1: json_each + json_extract_string — Expand Nested Arrays

The most basic need: expand each order’s product list into individual rows.

SELECT 
    o.order_id,
    o.customer,
    json_extract_string(je.value, '$.product') AS product,
    (json_extract_string(je.value, '$.price'))::INTEGER AS price
FROM read_json_auto('orders.jsonl') AS o
CROSS JOIN json_each(o.items) AS je;

Key function explanations:

read_json_auto() — Automatically detects JSONL format without manual schema specification
json_each(array_column) — Splits an array into multiple rows, returning key and value columns
json_extract_string(json_value, '$.path') — Extracts a string field from a JSON object
::INTEGER — DuckDB’s type casting syntax

Result:

order_id	customer	product	price
ORD-001	Alice	Laptop	5000
ORD-001	Alice	Mouse	150
ORD-002	Bob	Keyboard	350
ORD-003	Carol	Monitor	2800
ORD-003	Carol	Webcam	400
ORD-003	Carol	Headphones	600

Three. Core Function 2: list_transform — Batch Transform Array Elements

Often you don’t need to expand the array — you want to operate on it as a whole. list_transform applies a Lambda expression to each element of an array.

Extract all product names:

SELECT 
    customer,
    list_transform(items, x -> json_extract_string(x, '$.product')) AS products
FROM read_json_auto('orders.jsonl');

Result: Alice → [‘Laptop’, ‘Mouse’], Bob → [‘Keyboard’], Carol → [‘Monitor’, ‘Webcam’, ‘Headphones’]

Extract all prices and cast to integers:

SELECT 
    customer,
    list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER) AS prices
FROM read_json_auto('orders.jsonl');

Four. Core Function 3: list_aggregate — Array Aggregation Operations

Perform sum, average, count, and other operations on array values:

SELECT 
    customer,
    list_aggregate(
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER), 
        'sum'
    ) AS total_value
FROM read_json_auto('orders.jsonl');

Result: Alice 5150, Bob 350, Carol 3800

Supported operations include: sum, avg, count, min, max, stddev, variance.

Five. Core Function 4: list_filter — Conditional Array Filtering

Filter array elements based on conditions — ideal for data cleaning:

-- Filter high-value items (price > 500)
SELECT 
    customer,
    list_transform(
        list_filter(items, x -> (json_extract_string(x, '$.price'))::INTEGER > 500),
        x -> json_extract_string(x, '$.product')
    ) AS expensive_products
FROM read_json_auto('orders.jsonl');

Result: Alice → [‘Laptop’], Bob → [], Carol → [‘Monitor’, ‘Headphones’]

Real-world scenario: Filter out null values, anomalous prices, or entries that don’t meet business rules.

Six. Core Function 5: list_zip — Multi-List Concatenation

Merge two or more lists into a tuple array:

SELECT 
    customer,
    list_zip(
        list_transform(items, x -> json_extract_string(x, '$.product')),
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER)
    ) AS product_price_pairs
FROM read_json_auto('orders.jsonl');

Result: Alice → [(‘Laptop’, 5000), (‘Mouse’, 150)]

This is invaluable when you need to pair related fields — like coordinates (x, y), timestamps with values, etc.

Seven. Core Function 6: list_sort + list_slice — Sorting and Slicing

SELECT 
    customer,
    list_sort(
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER), 
        'desc'
    ) AS sorted_prices,
    list_slice(
        list_sort(
            list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER), 
            'desc'
        ), 
        0, 2
    ) AS top_2_prices
FROM read_json_auto('orders.jsonl');

Carol’s prices sorted descending: [2800, 600, 400], Top 2 → [2800, 600].

Eight. Advanced Combination: Complete Order Analysis Pipeline

Combine all the above functions into a complete order analysis query:

WITH orders AS (
    SELECT 
        order_id,
        customer,
        items,
        tags,
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER) AS prices,
        list_aggregate(
            list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER),
            'sum'
        ) AS order_total
    FROM read_json_auto('orders.jsonl')
)
SELECT 
    order_id,
    customer,
    products,
    order_total,
    list_filter(tags, t -> t = 'premium') AS premium_tags,
    CASE WHEN order_total > 2000 THEN 'high_value' ELSE 'normal' END AS customer_tier
FROM (
    SELECT 
        *,
        list_transform(items, x -> json_extract_string(x, '$.product')) AS products
    FROM orders
);

This query demonstrates the power of DuckDB’s list functions — the entire data processing pipeline is accomplished in a single SQL statement, with zero Python code.

Nine. Comparison with Traditional Tools

Operation	Python + Pandas	jq CLI	DuckDB List Functions
Expand nested arrays	Multi-level loops + append	`jq '.items[]'`	`json_each()`
Extract fields	List comprehension	`jq '.[].product'`	`list_transform(x -> ...)`
Conditional filter	`df[df.price > 500]`	`jq 'select(.price > 500)'`	`list_filter(x -> ...)`
Array sum	`sum(x['price'])`	`jq ‘[.[].price]	add’`
Multi-list pairing	`zip(a, b)`	No built-in support	`list_zip(a, b)`
Sort + slice	`sorted()[:2]`	`sort	limit 2`
Memory usage	Loads fully into RAM	Stream processing	Columnar compression + streaming
Large datasets (10GB+)	Requires chunking	Slow	Native parallel processing

Ten. Complete Function Reference Table

Function	Purpose	Example
`json_each(array)`	Expand JSON array to rows	`CROSS JOIN json_each(items)`
`json_extract_string(json, '$.path')`	Extract JSON field	`json_extract_string(x, '$.product')`
`list_transform(arr, fn)`	Apply function to each element	`list_transform(items, x -> x.price * 0.9)`
`list_filter(arr, fn)`	Filter array elements	`list_filter(items, x -> x.price > 100)`
`list_aggregate(arr, op)`	Array aggregation	`list_aggregate(prices, 'sum')`
`list_zip(arr1, arr2)`	Merge two lists into tuples	`list_zip(names, prices)`
`list_sort(arr, dir)`	Sort array	`list_sort(prices, 'desc')`
`list_slice(arr, start, end)`	Slice array	`list_slice(prices, 0, 3)`
`list_distinct(arr)`	Remove duplicates	`list_distinct(tags)`
`list_intersect(arr1, arr2)`	Array intersection	`list_intersect(tags_a, tags_b)`
`list_concat(arr1, arr2)`	Concatenate arrays	`list_concat(items_a, items_b)`
`list_count(arr)`	Count elements	`list_count(items)`

Eleven. Monetization Advice

With mastery of DuckDB list functions, you can convert these capabilities into actual revenue:

1. Data Cleaning SaaS Service

Offer JSON log cleaning services for enterprises. Traditional solutions require building Python ETL pipelines; DuckDB list functions can reduce costs by 80%. Pricing: ¥2,000-5,000/month per client.

2. API Data Monitoring Tool

Build an API response monitoring tool using read_json_auto + list_transform + list_filter to automatically detect anomalous fields and missing data. SaaS pricing: $29-99/month.

3. E-Commerce Data Analysis Reports

Use list functions to quickly process e-commerce order JSON data and automatically generate sales analysis reports. Per-report pricing: ¥500-2,000.

4. Automated Data Pipelines

Build automated data pipelines for SMEs based on DuckDB, replacing traditional Python + Pandas solutions. Project-based pricing: ¥5,000-20,000 per project.

5. Technical Training Courses

Create advanced DuckDB data processing courses focusing on list functions in real-world scenarios. Platform pricing: ¥299-999 per student.

Summary

DuckDB’s list functions enable you to process complex nested JSON data structures with pure SQL, eliminating the need for Python loop code entirely. From json_each expansion to list_transform batch conversion, from list_filter conditional filtering to list_zip multi-list concatenation — this function set can handle virtually any nested array scenario.

For data engineers, analysts, and developers, mastering these functions means significantly reducing code volume, improving processing speed, and compressing what used to be hours-long ETL tasks into minutes.

📖 Detailed tutorials and more practical cases at duckdblab.org

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.