DuckDB List Functions Complete Guide: SQL-Level Processing of Nested JSON Arrays
Overview
In API response data, log analysis, and ETL pipelines, nested arrays are among the most common data structures. The traditional approach involves writing Python loops to manually traverse, expand, filter, and aggregate arrays — verbose code that’s error-prone and hard to maintain.
DuckDB provides a comprehensive suite of list (array) functions that let you handle all array operations with pure SQL. From expanding nested JSON to batch transformations, conditional filtering to multi-list concatenation — one line of SQL replaces dozens of lines of Python code.

Figure: DuckDB list functions processing pipeline for nested JSON data
One. Scenario Data: E-Commerce Order JSONL
Imagine you’ve exported order data from an e-commerce platform, where each order contains a product list (items) and tags (tags):
{"order_id":"ORD-001","customer":"Alice","items":[{"product":"Laptop","price":5000},{"product":"Mouse","price":150}],"tags":["premium","fast-shipping"]}
{"order_id":"ORD-002","customer":"Bob","items":[{"product":"Keyboard","price":350}],"tags":["standard"]}
{"order_id":"ORD-003","customer":"Carol","items":[{"product":"Monitor","price":2800},{"product":"Webcam","price":400},{"product":"Headphones","price":600}],"tags":["premium","gift-wrap"]}
Save as orders.jsonl. Let’s see how DuckDB handles it.
Two. Core Function 1: json_each + json_extract_string — Expand Nested Arrays
The most basic need: expand each order’s product list into individual rows.
SELECT
o.order_id,
o.customer,
json_extract_string(je.value, '$.product') AS product,
(json_extract_string(je.value, '$.price'))::INTEGER AS price
FROM read_json_auto('orders.jsonl') AS o
CROSS JOIN json_each(o.items) AS je;
Key function explanations:
read_json_auto()— Automatically detects JSONL format without manual schema specificationjson_each(array_column)— Splits an array into multiple rows, returning key and value columnsjson_extract_string(json_value, '$.path')— Extracts a string field from a JSON object::INTEGER— DuckDB’s type casting syntax
Result:
| order_id | customer | product | price |
|---|---|---|---|
| ORD-001 | Alice | Laptop | 5000 |
| ORD-001 | Alice | Mouse | 150 |
| ORD-002 | Bob | Keyboard | 350 |
| ORD-003 | Carol | Monitor | 2800 |
| ORD-003 | Carol | Webcam | 400 |
| ORD-003 | Carol | Headphones | 600 |
Three. Core Function 2: list_transform — Batch Transform Array Elements
Often you don’t need to expand the array — you want to operate on it as a whole. list_transform applies a Lambda expression to each element of an array.
Extract all product names:
SELECT
customer,
list_transform(items, x -> json_extract_string(x, '$.product')) AS products
FROM read_json_auto('orders.jsonl');
Result: Alice → [‘Laptop’, ‘Mouse’], Bob → [‘Keyboard’], Carol → [‘Monitor’, ‘Webcam’, ‘Headphones’]
Extract all prices and cast to integers:
SELECT
customer,
list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER) AS prices
FROM read_json_auto('orders.jsonl');
Four. Core Function 3: list_aggregate — Array Aggregation Operations
Perform sum, average, count, and other operations on array values:
SELECT
customer,
list_aggregate(
list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER),
'sum'
) AS total_value
FROM read_json_auto('orders.jsonl');
Result: Alice 5150, Bob 350, Carol 3800
Supported operations include: sum, avg, count, min, max, stddev, variance.
Five. Core Function 4: list_filter — Conditional Array Filtering
Filter array elements based on conditions — ideal for data cleaning:
-- Filter high-value items (price > 500)
SELECT
customer,
list_transform(
list_filter(items, x -> (json_extract_string(x, '$.price'))::INTEGER > 500),
x -> json_extract_string(x, '$.product')
) AS expensive_products
FROM read_json_auto('orders.jsonl');
Result: Alice → [‘Laptop’], Bob → [], Carol → [‘Monitor’, ‘Headphones’]
Real-world scenario: Filter out null values, anomalous prices, or entries that don’t meet business rules.
Six. Core Function 5: list_zip — Multi-List Concatenation
Merge two or more lists into a tuple array:
SELECT
customer,
list_zip(
list_transform(items, x -> json_extract_string(x, '$.product')),
list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER)
) AS product_price_pairs
FROM read_json_auto('orders.jsonl');
Result: Alice → [(‘Laptop’, 5000), (‘Mouse’, 150)]
This is invaluable when you need to pair related fields — like coordinates (x, y), timestamps with values, etc.
Seven. Core Function 6: list_sort + list_slice — Sorting and Slicing
SELECT
customer,
list_sort(
list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER),
'desc'
) AS sorted_prices,
list_slice(
list_sort(
list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER),
'desc'
),
0, 2
) AS top_2_prices
FROM read_json_auto('orders.jsonl');
Carol’s prices sorted descending: [2800, 600, 400], Top 2 → [2800, 600].
Eight. Advanced Combination: Complete Order Analysis Pipeline
Combine all the above functions into a complete order analysis query:
WITH orders AS (
SELECT
order_id,
customer,
items,
tags,
list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER) AS prices,
list_aggregate(
list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER),
'sum'
) AS order_total
FROM read_json_auto('orders.jsonl')
)
SELECT
order_id,
customer,
products,
order_total,
list_filter(tags, t -> t = 'premium') AS premium_tags,
CASE WHEN order_total > 2000 THEN 'high_value' ELSE 'normal' END AS customer_tier
FROM (
SELECT
*,
list_transform(items, x -> json_extract_string(x, '$.product')) AS products
FROM orders
);
This query demonstrates the power of DuckDB’s list functions — the entire data processing pipeline is accomplished in a single SQL statement, with zero Python code.
Nine. Comparison with Traditional Tools
| Operation | Python + Pandas | jq CLI | DuckDB List Functions |
|---|---|---|---|
| Expand nested arrays | Multi-level loops + append | jq '.items[]' | json_each() |
| Extract fields | List comprehension | jq '.[].product' | list_transform(x -> ...) |
| Conditional filter | df[df.price > 500] | jq 'select(.price > 500)' | list_filter(x -> ...) |
| Array sum | sum(x['price']) | `jq ‘[.[].price] | add’` |
| Multi-list pairing | zip(a, b) | No built-in support | list_zip(a, b) |
| Sort + slice | sorted()[:2] | `sort | limit 2` |
| Memory usage | Loads fully into RAM | Stream processing | Columnar compression + streaming |
| Large datasets (10GB+) | Requires chunking | Slow | Native parallel processing |
Ten. Complete Function Reference Table
| Function | Purpose | Example |
|---|---|---|
json_each(array) | Expand JSON array to rows | CROSS JOIN json_each(items) |
json_extract_string(json, '$.path') | Extract JSON field | json_extract_string(x, '$.product') |
list_transform(arr, fn) | Apply function to each element | list_transform(items, x -> x.price * 0.9) |
list_filter(arr, fn) | Filter array elements | list_filter(items, x -> x.price > 100) |
list_aggregate(arr, op) | Array aggregation | list_aggregate(prices, 'sum') |
list_zip(arr1, arr2) | Merge two lists into tuples | list_zip(names, prices) |
list_sort(arr, dir) | Sort array | list_sort(prices, 'desc') |
list_slice(arr, start, end) | Slice array | list_slice(prices, 0, 3) |
list_distinct(arr) | Remove duplicates | list_distinct(tags) |
list_intersect(arr1, arr2) | Array intersection | list_intersect(tags_a, tags_b) |
list_concat(arr1, arr2) | Concatenate arrays | list_concat(items_a, items_b) |
list_count(arr) | Count elements | list_count(items) |
Eleven. Monetization Advice
With mastery of DuckDB list functions, you can convert these capabilities into actual revenue:
1. Data Cleaning SaaS Service
Offer JSON log cleaning services for enterprises. Traditional solutions require building Python ETL pipelines; DuckDB list functions can reduce costs by 80%. Pricing: ¥2,000-5,000/month per client.
2. API Data Monitoring Tool
Build an API response monitoring tool using read_json_auto + list_transform + list_filter to automatically detect anomalous fields and missing data. SaaS pricing: $29-99/month.
3. E-Commerce Data Analysis Reports
Use list functions to quickly process e-commerce order JSON data and automatically generate sales analysis reports. Per-report pricing: ¥500-2,000.
4. Automated Data Pipelines
Build automated data pipelines for SMEs based on DuckDB, replacing traditional Python + Pandas solutions. Project-based pricing: ¥5,000-20,000 per project.
5. Technical Training Courses
Create advanced DuckDB data processing courses focusing on list functions in real-world scenarios. Platform pricing: ¥299-999 per student.
Summary
DuckDB’s list functions enable you to process complex nested JSON data structures with pure SQL, eliminating the need for Python loop code entirely. From json_each expansion to list_transform batch conversion, from list_filter conditional filtering to list_zip multi-list concatenation — this function set can handle virtually any nested array scenario.
For data engineers, analysts, and developers, mastering these functions means significantly reducing code volume, improving processing speed, and compressing what used to be hours-long ETL tasks into minutes.
📖 Detailed tutorials and more practical cases at duckdblab.org