Featured image of post DuckDB List Functions Complete Guide: SQL-Level Processing of Nested JSON Arrays

DuckDB List Functions Complete Guide: SQL-Level Processing of Nested JSON Arrays

Master DuckDB's complete list function toolkit — list_transform, list_filter, list_aggregate, list_zip and more. Process nested JSON arrays with pure SQL, no Python needed. Includes runnable code examples and monetization advice.

DuckDB List Functions Complete Guide: SQL-Level Processing of Nested JSON Arrays

Overview

In API response data, log analysis, and ETL pipelines, nested arrays are among the most common data structures. The traditional approach involves writing Python loops to manually traverse, expand, filter, and aggregate arrays — verbose code that’s error-prone and hard to maintain.

DuckDB provides a comprehensive suite of list (array) functions that let you handle all array operations with pure SQL. From expanding nested JSON to batch transformations, conditional filtering to multi-list concatenation — one line of SQL replaces dozens of lines of Python code.

DuckDB List Functions Processing Nested JSON Data Flow

Figure: DuckDB list functions processing pipeline for nested JSON data


One. Scenario Data: E-Commerce Order JSONL

Imagine you’ve exported order data from an e-commerce platform, where each order contains a product list (items) and tags (tags):

{"order_id":"ORD-001","customer":"Alice","items":[{"product":"Laptop","price":5000},{"product":"Mouse","price":150}],"tags":["premium","fast-shipping"]}
{"order_id":"ORD-002","customer":"Bob","items":[{"product":"Keyboard","price":350}],"tags":["standard"]}
{"order_id":"ORD-003","customer":"Carol","items":[{"product":"Monitor","price":2800},{"product":"Webcam","price":400},{"product":"Headphones","price":600}],"tags":["premium","gift-wrap"]}

Save as orders.jsonl. Let’s see how DuckDB handles it.


Two. Core Function 1: json_each + json_extract_string — Expand Nested Arrays

The most basic need: expand each order’s product list into individual rows.

SELECT 
    o.order_id,
    o.customer,
    json_extract_string(je.value, '$.product') AS product,
    (json_extract_string(je.value, '$.price'))::INTEGER AS price
FROM read_json_auto('orders.jsonl') AS o
CROSS JOIN json_each(o.items) AS je;

Key function explanations:

  • read_json_auto() — Automatically detects JSONL format without manual schema specification
  • json_each(array_column) — Splits an array into multiple rows, returning key and value columns
  • json_extract_string(json_value, '$.path') — Extracts a string field from a JSON object
  • ::INTEGER — DuckDB’s type casting syntax

Result:

order_idcustomerproductprice
ORD-001AliceLaptop5000
ORD-001AliceMouse150
ORD-002BobKeyboard350
ORD-003CarolMonitor2800
ORD-003CarolWebcam400
ORD-003CarolHeadphones600

Three. Core Function 2: list_transform — Batch Transform Array Elements

Often you don’t need to expand the array — you want to operate on it as a whole. list_transform applies a Lambda expression to each element of an array.

Extract all product names:

SELECT 
    customer,
    list_transform(items, x -> json_extract_string(x, '$.product')) AS products
FROM read_json_auto('orders.jsonl');

Result: Alice → [‘Laptop’, ‘Mouse’], Bob → [‘Keyboard’], Carol → [‘Monitor’, ‘Webcam’, ‘Headphones’]

Extract all prices and cast to integers:

SELECT 
    customer,
    list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER) AS prices
FROM read_json_auto('orders.jsonl');

Four. Core Function 3: list_aggregate — Array Aggregation Operations

Perform sum, average, count, and other operations on array values:

SELECT 
    customer,
    list_aggregate(
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER), 
        'sum'
    ) AS total_value
FROM read_json_auto('orders.jsonl');

Result: Alice 5150, Bob 350, Carol 3800

Supported operations include: sum, avg, count, min, max, stddev, variance.


Five. Core Function 4: list_filter — Conditional Array Filtering

Filter array elements based on conditions — ideal for data cleaning:

-- Filter high-value items (price > 500)
SELECT 
    customer,
    list_transform(
        list_filter(items, x -> (json_extract_string(x, '$.price'))::INTEGER > 500),
        x -> json_extract_string(x, '$.product')
    ) AS expensive_products
FROM read_json_auto('orders.jsonl');

Result: Alice → [‘Laptop’], Bob → [], Carol → [‘Monitor’, ‘Headphones’]

Real-world scenario: Filter out null values, anomalous prices, or entries that don’t meet business rules.


Six. Core Function 5: list_zip — Multi-List Concatenation

Merge two or more lists into a tuple array:

SELECT 
    customer,
    list_zip(
        list_transform(items, x -> json_extract_string(x, '$.product')),
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER)
    ) AS product_price_pairs
FROM read_json_auto('orders.jsonl');

Result: Alice → [(‘Laptop’, 5000), (‘Mouse’, 150)]

This is invaluable when you need to pair related fields — like coordinates (x, y), timestamps with values, etc.


Seven. Core Function 6: list_sort + list_slice — Sorting and Slicing

SELECT 
    customer,
    list_sort(
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER), 
        'desc'
    ) AS sorted_prices,
    list_slice(
        list_sort(
            list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER), 
            'desc'
        ), 
        0, 2
    ) AS top_2_prices
FROM read_json_auto('orders.jsonl');

Carol’s prices sorted descending: [2800, 600, 400], Top 2 → [2800, 600].


Eight. Advanced Combination: Complete Order Analysis Pipeline

Combine all the above functions into a complete order analysis query:

WITH orders AS (
    SELECT 
        order_id,
        customer,
        items,
        tags,
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER) AS prices,
        list_aggregate(
            list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER),
            'sum'
        ) AS order_total
    FROM read_json_auto('orders.jsonl')
)
SELECT 
    order_id,
    customer,
    products,
    order_total,
    list_filter(tags, t -> t = 'premium') AS premium_tags,
    CASE WHEN order_total > 2000 THEN 'high_value' ELSE 'normal' END AS customer_tier
FROM (
    SELECT 
        *,
        list_transform(items, x -> json_extract_string(x, '$.product')) AS products
    FROM orders
);

This query demonstrates the power of DuckDB’s list functions — the entire data processing pipeline is accomplished in a single SQL statement, with zero Python code.


Nine. Comparison with Traditional Tools

OperationPython + Pandasjq CLIDuckDB List Functions
Expand nested arraysMulti-level loops + appendjq '.items[]'json_each()
Extract fieldsList comprehensionjq '.[].product'list_transform(x -> ...)
Conditional filterdf[df.price > 500]jq 'select(.price > 500)'list_filter(x -> ...)
Array sumsum(x['price'])`jq ‘[.[].price]add’`
Multi-list pairingzip(a, b)No built-in supportlist_zip(a, b)
Sort + slicesorted()[:2]`sortlimit 2`
Memory usageLoads fully into RAMStream processingColumnar compression + streaming
Large datasets (10GB+)Requires chunkingSlowNative parallel processing

Ten. Complete Function Reference Table

FunctionPurposeExample
json_each(array)Expand JSON array to rowsCROSS JOIN json_each(items)
json_extract_string(json, '$.path')Extract JSON fieldjson_extract_string(x, '$.product')
list_transform(arr, fn)Apply function to each elementlist_transform(items, x -> x.price * 0.9)
list_filter(arr, fn)Filter array elementslist_filter(items, x -> x.price > 100)
list_aggregate(arr, op)Array aggregationlist_aggregate(prices, 'sum')
list_zip(arr1, arr2)Merge two lists into tupleslist_zip(names, prices)
list_sort(arr, dir)Sort arraylist_sort(prices, 'desc')
list_slice(arr, start, end)Slice arraylist_slice(prices, 0, 3)
list_distinct(arr)Remove duplicateslist_distinct(tags)
list_intersect(arr1, arr2)Array intersectionlist_intersect(tags_a, tags_b)
list_concat(arr1, arr2)Concatenate arrayslist_concat(items_a, items_b)
list_count(arr)Count elementslist_count(items)

Eleven. Monetization Advice

With mastery of DuckDB list functions, you can convert these capabilities into actual revenue:

1. Data Cleaning SaaS Service

Offer JSON log cleaning services for enterprises. Traditional solutions require building Python ETL pipelines; DuckDB list functions can reduce costs by 80%. Pricing: ¥2,000-5,000/month per client.

2. API Data Monitoring Tool

Build an API response monitoring tool using read_json_auto + list_transform + list_filter to automatically detect anomalous fields and missing data. SaaS pricing: $29-99/month.

3. E-Commerce Data Analysis Reports

Use list functions to quickly process e-commerce order JSON data and automatically generate sales analysis reports. Per-report pricing: ¥500-2,000.

4. Automated Data Pipelines

Build automated data pipelines for SMEs based on DuckDB, replacing traditional Python + Pandas solutions. Project-based pricing: ¥5,000-20,000 per project.

5. Technical Training Courses

Create advanced DuckDB data processing courses focusing on list functions in real-world scenarios. Platform pricing: ¥299-999 per student.


Summary

DuckDB’s list functions enable you to process complex nested JSON data structures with pure SQL, eliminating the need for Python loop code entirely. From json_each expansion to list_transform batch conversion, from list_filter conditional filtering to list_zip multi-list concatenation — this function set can handle virtually any nested array scenario.

For data engineers, analysts, and developers, mastering these functions means significantly reducing code volume, improving processing speed, and compressing what used to be hours-long ETL tasks into minutes.

📖 Detailed tutorials and more practical cases at duckdblab.org

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.