DuckDB in Action: Complete Guide to JSON Data Processing

Deep dive into DuckDB's JSON processing capabilities including read_json_auto, json_extract_string, UNNEST operations, and nested structure flattening with a real e-commerce order analysis scenario.

Introduction

In real-world business scenarios, JSON semi-structured data is everywhere: API responses, log records, user behavior tracking… Traditional relational databases often require complex ETL pipelines to analyze such data. But with DuckDB’s built-in JSON support, you can query JSON files directly with SQL without predefining schemas.

This article walks you through core DuckDB JSON data processing skills using a complete e-commerce order analysis case study: read_json_auto, json_extract_string, UNNEST, and nested structure flattening techniques.

Environment Setup

First, verify your DuckDB version (recommended ≥ 0.10.0):

SELECT version();

DuckDB’s JSON functionality is built-in—no additional extensions required.

Step 1: Auto-Reading JSON Files

Scenario

Assume we have an e-commerce platform’s order data file orders.json, where each record contains basic order information and a nested items array:

[
  {
    "order_id": "ORD-001",
    "customer_id": "CUST-100",
    "order_date": "2026-01-15",
    "total_amount": 299.90,
    "status": "completed",
    "items": [
      {"product": "Mechanical Keyboard", "qty": 1, "price": 199.90},
      {"product": "Mouse Pad", "qty": 2, "price": 50.00}
    ],
    "shipping": {"method": "express", "cost": 15.00, "address": {"city": "Beijing", "district": "Chaoyang"}}
  },
  {
    "order_id": "ORD-002",
    "customer_id": "CUST-101",
    "order_date": "2026-01-16",
    "total_amount": 1580.00,
    "status": "shipped",
    "items": [
      {"product": "Monitor", "qty": 1, "price": 1580.00}
    ],
    "shipping": {"method": "standard", "cost": 0.00, "address": {"city": "Shanghai", "district": "Pudong"}}
  },
  {
    "order_id": "ORD-003",
    "customer_id": "CUST-100",
    "order_date": "2026-02-03",
    "total_amount": 89.50,
    "status": "completed",
    "items": [
      {"product": "USB Hub", "qty": 1, "price": 45.50},
      {"product": "Ethernet Cable", "qty": 2, "price": 22.00}
    ],
    "shipping": {"method": "standard", "cost": 0.00, "address": {"city": "Beijing", "district": "Haidian"}}
  },
  {
    "order_id": "ORD-004",
    "customer_id": "CUST-102",
    "order_date": "2026-02-10",
    "total_amount": 450.00,
    "status": "cancelled",
    "items": [
      {"product": "Headphones", "qty": 1, "price": 350.00},
      {"product": "Ear Tips", "qty": 5, "price": 20.00}
    ],
    "shipping": {"method": "express", "cost": 15.00, "address": {"city": "Guangzhou", "district": "Tianhe"}}
  },
  {
    "order_id": "ORD-005",
    "customer_id": "CUST-103",
    "order_date": "2026-03-01",
    "total_amount": 2100.00,
    "status": "completed",
    "items": [
      {"product": "Laptop", "qty": 1, "price": 2100.00}
    ],
    "shipping": {"method": "express", "cost": 20.00, "address": {"city": "Shenzhen", "district": "Nanshan"}}
  }
]

Use read_json_auto to load it directly:

SELECT * FROM read_json_auto('orders.json');

Tip: read_json_auto automatically detects JSON format (array of objects or newline-delimited JSON) and infers column types. For large JSON files, you can also specify auto_detect=true and sample_size parameters to control sampling accuracy.

Query Result

order_idcustomer_idorder_datetotal_amountstatusitemsshipping
ORD-001CUST-1002026-01-15299.90completed[…]{…}
ORD-002CUST-1012026-01-161580.00shipped[…]{…}
ORD-003CUST-1002026-02-0389.50completed[…]{…}
ORD-004CUST-1022026-02-10450.00cancelled[…]{…}
ORD-005CUST-1032026-03-012100.00completed[…]{…}

Architecture Diagram

Figure: DuckDB JSON processing architecture—from raw JSON files to structured queries

Step 2: Extracting Nested Fields

Using json_extract_string to Access Deep Properties

When we need to extract deeply nested fields, we use the json_extract_string function. It accepts two arguments: a JSON value and a JSONPath expression.

SELECT
    order_id,
    customer_id,
    json_extract_string(shipping, '$.address.city') AS city,
    json_extract_string(shipping, '$.method') AS ship_method,
    CAST(json_extract(shipping, '$.cost') AS DOUBLE) AS shipping_cost
FROM read_json_auto('orders.json');

Key distinction:

  • json_extract() returns a JSON type (quoted strings)
  • json_extract_string() returns a plain string value (unquoted)
  • CAST(json_extract(...) AS DOUBLE) converts numeric JSON fields to numbers

Running result:

order_idcustomer_idcityship_methodshipping_cost
ORD-001CUST-100Beijingexpress15.00
ORD-002CUST-101Shanghaistandard0.00
ORD-003CUST-100Beijingstandard0.00
ORD-004CUST-102Guangzhouexpress15.00
ORD-005CUST-103Shenzhenexpress20.00

Terminal Output

Figure: SQL execution result using json_extract_string to retrieve nested address information

Statistics by City

SELECT
    json_extract_string(shipping, '$.address.city') AS city,
    COUNT(*) AS order_count,
    SUM(CAST(json_extract(shipping, '$.cost') AS DOUBLE)) AS total_shipping_cost
FROM read_json_auto('orders.json')
GROUP BY city
ORDER BY total_shipping_cost DESC;
cityorder_counttotal_shipping_cost
Shenzhen120.00
Beijing215.00
Guangzhou115.00
Shanghai10.00

Step 3: Flattening Array Fields (UNNEST)

Scenario: Product Details Per Order

The items field is a JSON array. To flatten it, use UNNEST. In DuckDB, you need to assign a column alias to the UNNEST result:

SELECT
    o.order_id,
    o.customer_id,
    t.unnest_col['product'] AS product,
    t.unnest_col['qty'] AS qty,
    t.unnest_col['price'] AS price
FROM read_json_auto('orders.json') o,
UNNEST(o.items) AS t(unnest_col);

Running result:

order_idcustomer_idproductqtyprice
ORD-001CUST-100Mechanical Keyboard1199.90
ORD-001CUST-100Mouse Pad250.00
ORD-002CUST-101Monitor11580.00
ORD-003CUST-100USB Hub145.50
ORD-003CUST-100Ethernet Cable222.00
ORD-004CUST-102Headphones1350.00
ORD-004CUST-102Ear Tips520.00
ORD-005CUST-103Laptop12100.00

Business Analysis: Sales by Product Category

SELECT
    t.unnest_col['product'] AS category,
    SUM(t.unnest_col['qty']) AS total_quantity,
    SUM(t.unnest_col['qty'] * t.unnest_col['price']) AS total_revenue
FROM read_json_auto('orders.json') o,
UNNEST(o.items) AS t(unnest_col)
WHERE o.status != 'cancelled'
GROUP BY category
ORDER BY total_revenue DESC;

Step 4: Handling Dirty Data with json_valid

Real-world JSON data often contains malformed records. The json_valid function helps identify invalid entries:

-- Assuming orders_with_bad.json contains some invalid JSON records
SELECT
    order_id,
    json_valid(to_json(items)) AS items_valid,
    CASE WHEN NOT json_valid(to_json(items)) THEN items END AS invalid_items
FROM read_json_auto('orders_with_bad.json');

In production, filter first then process:

SELECT *
FROM read_json_auto('raw_orders.json')
WHERE json_valid(to_json(items));

Step 5: Performance Optimization Tips

1. Use lines=true for NDJSON

If your JSON file is newline-delimited (NDJSON), add lines=true:

SELECT * FROM read_json_auto('orders.ndjson', lines=true);

2. Column Projection to Reduce I/O

Only query the fields you need instead of loading the entire JSON structure:

-- Efficient: read only necessary columns
SELECT order_id, customer_id, total_amount
FROM read_json_auto('orders.json');

-- Less efficient: load all then filter
SELECT * FROM read_json_auto('orders.json') WHERE order_id = 'ORD-001';

3. Write JSON Query Results to Parquet

For repeatedly queried JSON data, converting to columnar storage significantly improves performance:

COPY (
    SELECT
        order_id,
        customer_id,
        order_date,
        total_amount,
        status
    FROM read_json_auto('orders.json')
) TO 'orders.parquet' (FORMAT PARQUET);

Subsequent queries against the Parquet file are typically 5-10x faster than querying JSON directly.

Real-World Case: Monthly Sales Report

Combining all the techniques above into a complete monthly sales report:

WITH order_items AS (
    SELECT
        o.order_id,
        o.customer_id,
        o.order_date,
        o.status,
        t.unnest_col['product'] AS product,
        t.unnest_col['qty'] AS qty,
        t.unnest_col['price'] AS price,
        t.unnest_col['qty'] * t.unnest_col['price'] AS line_total
    FROM read_json_auto('orders.json') o,
    UNNEST(o.items) AS t(unnest_col)
),
monthly_summary AS (
    SELECT
        DATE_TRUNC('month', order_date::DATE) AS month,
        COUNT(DISTINCT order_id) AS order_count,
        SUM(line_total) AS revenue,
        AVG(line_total) AS avg_order_value,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM order_items
    WHERE status != 'cancelled'
    GROUP BY month
)
SELECT * FROM monthly_summary
ORDER BY month;

Running result:

monthorder_countrevenueavg_order_valueunique_customers
2026-01-0121879.90626.632
2026-02-01189.5044.751
2026-03-0112100.002100.001

Note: January revenue is 1879.90 (excluding the cancelled order ORD-004 of 450.00). Including cancelled orders, it would be 2329.90.

Data Flow

Figure: Complete data pipeline—JSON → DuckDB → Monthly Sales Report

Summary

DuckDB’s JSON processing capabilities make semi-structured data analysis remarkably simple:

FeatureCore FunctionUse Case
Read JSONread_json_auto()Auto-detect format and load
Extract fieldsjson_extract_string() / json_extract()Access nested values by path
Flatten arraysUNNEST(...) + AS t(col_name)Convert JSON arrays to rows
Validate datajson_valid()Filter invalid JSON records
Type conversionto_json()Interconvert types and structures

With these skills mastered, you can complete the entire pipeline from raw JSON files to analysis reports directly in DuckDB, eliminating ETL steps and significantly improving data analysis efficiency.

For more DuckDB tips and tricks, follow DuckDB Lab (duckdblab.org).

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.