DuckDB 列表函数完全指南：嵌套 JSON 数组的 SQL 级处理

DuckDB 提供丰富的列表（数组）函数，让你用纯 SQL 处理嵌套 JSON 数组——list_transform、list_filter、list_aggregate、list_zip 等。本文覆盖 6 大核心场景，附可运行代码和变现建议。

DuckDB 列表函数完全指南：嵌套 JSON 数组的 SQL 级处理

概述

在处理 API 返回数据、日志分析和 ETL 管道时，嵌套数组是最常见的数据结构之一。传统做法是用 Python 写循环遍历数组，手动展开、过滤、聚合——代码冗长且容易出错。

DuckDB 提供了一套完整的列表（数组）函数，让你用纯 SQL 完成所有数组操作。从展开嵌套 JSON 到批量转换，从条件过滤到多列表拼接，一行 SQL 就能替代几十行 Python 代码。

DuckDB 列表函数处理嵌套 JSON 数据流

图：DuckDB 列表函数处理嵌套 JSON 数据流

一、场景数据：电商订单 JSONL

假设你从电商平台导出了订单数据，每个订单包含商品列表（items）和标签（tags）：

{"order_id":"ORD-001","customer":"Alice","items":[{"product":"Laptop","price":5000},{"product":"Mouse","price":150}],"tags":["premium","fast-shipping"]}
{"order_id":"ORD-002","customer":"Bob","items":[{"product":"Keyboard","price":350}],"tags":["standard"]}
{"order_id":"ORD-003","customer":"Carol","items":[{"product":"Monitor","price":2800},{"product":"Webcam","price":400},{"product":"Headphones","price":600}],"tags":["premium","gift-wrap"]}

保存为 orders.jsonl，让我们看看如何用 DuckDB 处理它。

二、核心函数 1：json_each + json_extract_string —— 展开嵌套数组

最基础的需求：把每个订单的商品列表展开成单独的行。

SELECT 
    o.order_id,
    o.customer,
    json_extract_string(je.value, '$.product') AS product,
    (json_extract_string(je.value, '$.price'))::INTEGER AS price
FROM read_json_auto('orders.jsonl') AS o
CROSS JOIN json_each(o.items) AS je;

关键函数说明：

read_json_auto() — 自动识别 JSONL 格式，无需手动指定 schema
json_each(array_column) — 将数组拆成多行，返回 key 和 value 两列
json_extract_string(json_value, '$.path') — 从 JSON 对象中提取字符串字段
::INTEGER — DuckDB 的类型转换语法

运行结果：

order_id	customer	product	price
ORD-001	Alice	Laptop	5000
ORD-001	Alice	Mouse	150
ORD-002	Bob	Keyboard	350
ORD-003	Carol	Monitor	2800
ORD-003	Carol	Webcam	400
ORD-003	Carol	Headphones	600

三、核心函数 2：list_transform —— 批量转换数组元素

很多时候你不需要展开数组，而是想对数组整体进行操作。list_transform 可以对数组每个元素应用一个 Lambda 表达式。

提取所有商品名：

SELECT 
    customer,
    list_transform(items, x -> json_extract_string(x, '$.product')) AS products
FROM read_json_auto('orders.jsonl');

结果：Alice → [‘Laptop’, ‘Mouse’]，Bob → [‘Keyboard’]，Carol → [‘Monitor’, ‘Webcam’, ‘Headphones’]

提取所有商品价格并转为整数：

SELECT 
    customer,
    list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER) AS prices
FROM read_json_auto('orders.jsonl');

四、核心函数 3：list_aggregate —— 数组聚合运算

对数组中的数值进行求和、平均、计数等操作：

SELECT 
    customer,
    list_aggregate(
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER), 
        'sum'
    ) AS total_value
FROM read_json_auto('orders.jsonl');

结果：Alice 5150，Bob 350，Carol 3800

支持的操作包括：sum、avg、count、min、max、stddev、variance。

五、核心函数 4：list_filter —— 条件过滤数组

筛选满足条件的数组元素，非常适合做数据清洗：

-- 筛选高价商品（单价 > 500）
SELECT 
    customer,
    list_transform(
        list_filter(items, x -> (json_extract_string(x, '$.price'))::INTEGER > 500),
        x -> json_extract_string(x, '$.product')
    ) AS expensive_products
FROM read_json_auto('orders.jsonl');

结果：Alice → [‘Laptop’]，Bob → []，Carol → [‘Monitor’, ‘Headphones’]

实战场景： 过滤掉空值、异常价格或不符合业务规则的条目。

六、核心函数 5：list_zip —— 多列表拼接

将两个或多个列表合并为元组数组：

SELECT 
    customer,
    list_zip(
        list_transform(items, x -> json_extract_string(x, '$.product')),
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER)
    ) AS product_price_pairs
FROM read_json_auto('orders.jsonl');

结果：Alice → [(‘Laptop’, 5000), (‘Mouse’, 150)]

这在需要将多个相关字段配对时非常有用，比如坐标点 (x, y)、时间戳和数值对等。

七、核心函数 6：list_sort + list_slice —— 排序与切片

SELECT 
    customer,
    list_sort(
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER), 
        'desc'
    ) AS sorted_prices,
    list_slice(
        list_sort(
            list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER), 
            'desc'
        ), 
        0, 2
    ) AS top_2_prices
FROM read_json_auto('orders.jsonl');

Carol 的价格排序后为 [2800, 600, 400]，Top 2 → [2800, 600]。

八、高级组合：完整订单分析流水线

将上述所有函数组合起来，构建一个完整的订单分析查询：

WITH orders AS (
    SELECT 
        order_id,
        customer,
        items,
        tags,
        list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER) AS prices,
        list_aggregate(
            list_transform(items, x -> (json_extract_string(x, '$.price'))::INTEGER),
            'sum'
        ) AS order_total
    FROM read_json_auto('orders.jsonl')
)
SELECT 
    order_id,
    customer,
    products,
    order_total,
    list_filter(tags, t -> t = 'premium') AS premium_tags,
    CASE WHEN order_total > 2000 THEN 'high_value' ELSE 'normal' END AS customer_tier
FROM (
    SELECT 
        *,
        list_transform(items, x -> json_extract_string(x, '$.product')) AS products
    FROM orders
);

这个查询展示了 DuckDB 列表函数的强大之处——整个数据处理流水线只用一个 SQL 语句完成，无需任何 Python 代码。

九、与传统工具对比

操作	Python + Pandas	jq 命令行	DuckDB 列表函数
展开嵌套数组	多层循环 + append	`jq '.items[]'`	`json_each()`
提取字段	列表推导式	`jq '.[].product'`	`list_transform(x -> ...)`
条件过滤	`df[df.price > 500]`	`jq 'select(.price > 500)'`	`list_filter(x -> ...)`
数组求和	`sum(x['price'])`	`jq ‘[.[].price]	add’`
多列表配对	`zip(a, b)`	无内置支持	`list_zip(a, b)`
排序+切片	`sorted()[:2]`	`sort	limit 2`
内存占用	全量加载到 RAM	流式处理	列式压缩 + 流式
大数据集 (10GB+)	需要分块处理	慢	原生并行处理

十、完整函数速查表

函数	作用	示例
`json_each(array)`	展开 JSON 数组为多行	`CROSS JOIN json_each(items)`
`json_extract_string(json, '$.path')`	提取 JSON 字段	`json_extract_string(x, '$.product')`
`list_transform(arr, fn)`	对数组每个元素应用函数	`list_transform(items, x -> x.price * 0.9)`
`list_filter(arr, fn)`	过滤数组元素	`list_filter(items, x -> x.price > 100)`
`list_aggregate(arr, op)`	数组聚合运算	`list_aggregate(prices, 'sum')`
`list_zip(arr1, arr2)`	合并两个列表为元组	`list_zip(names, prices)`
`list_sort(arr, dir)`	数组排序	`list_sort(prices, 'desc')`
`list_slice(arr, start, end)`	数组切片	`list_slice(prices, 0, 3)`
`list_distinct(arr)`	数组去重	`list_distinct(tags)`
`list_intersect(arr1, arr2)`	求数组交集	`list_intersect(tags_a, tags_b)`
`list_concat(arr1, arr2)`	数组合并	`list_concat(items_a, items_b)`
`list_count(arr)`	数组元素计数	`list_count(items)`

十一、变现建议

掌握 DuckDB 列表函数后，你可以将这些能力转化为实际收入：

1. 数据清洗 SaaS 服务

为企业客户提供 JSON 日志清洗服务。传统方案需要开发 Python ETL 管道，使用 DuckDB 列表函数可以将成本降低 80%。定价：每月 2000-5000 元/客户。

2. API 数据监控工具

用 read_json_auto + list_transform + list_filter 构建 API 响应监控工具，自动检测异常字段和缺失数据。SaaS 定价：$29-99/月。

3. 电商数据分析报告

利用列表函数快速处理电商平台的订单 JSON 数据，自动生成销售分析报告。按报告收费：每份 500-2000 元。

4. 自动化数据管道

为中小企业搭建基于 DuckDB 的自动化数据管道，替代传统的 Python + Pandas 方案。项目制收费：5000-20000 元/项目。

5. 技术培训课程

制作 DuckDB 高级数据处理课程，重点讲解列表函数在实际场景中的应用。平台定价：¥299-999/学员。

总结

DuckDB 的列表函数让你能够用纯 SQL 处理复杂的嵌套 JSON 数据结构，无需编写任何 Python 循环代码。从 json_each 展开到 list_transform 批量转换，从 list_filter 条件过滤到 list_zip 多列表拼接——这套函数组合几乎可以处理任何嵌套数组场景。

对于数据工程师、分析师和开发者来说，掌握这些函数意味着可以显著减少代码量、提高处理速度，并将原本需要数小时的 ETL 任务压缩到几分钟内完成。

📖 详细图文教程和更多实战案例见 duckdblab.org

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ 本站为独立社区项目，与 DuckDB 基金会及 DuckDB 官方项目无任何从属、背书或赞助关系。

"DuckDB" 是 DuckDB 基金会的注册商标，本站仅以事实描述方式使用该名称。

本站内容仅供教育与社区推广用途，不构成任何商业服务。