Featured image of post Parse JSON Data Instantly with DuckDB—Say Goodbye to Python Parsing Scripts

Parse JSON Data Instantly with DuckDB—Say Goodbye to Python Parsing Scripts

Learn how DuckDB's read_json_auto() lets you query JSON data with a single SQL line. A practical guide to analyzing nested JSON user behavior logs and generating business reports efficiently.

JSON Parsing Architecture

Introduction: JSON Data Is Your Invisible Productivity Killer

Have you ever encountered this scenario—the boss exports user behavior logs from a third-party platform, and when you open it, it’s all massive JSON strings:

{"user_id": 10234, "event": "click", "timestamp": "2026-06-15T14:32:01Z", "payload": {"page": "/products/456", "device": "mobile", "duration_ms": 230, "referral": "wechat"}}

The boss says: “Help me analyze which pages are most popular? Is there a big difference between mobile and PC?”

What’s your first reaction? Write a Python script → parse JSON line by line → extract fields → store in DataFrame → analyze. But what if you have 1 million such JSON records? Python parsing could take minutes, and memory might explode.

With DuckDB, you only need one line of SQL.

Why Is JSON Parsing So Common?

JSON has become the “common language” of the modern data world—almost all APIs, logging systems, and message queues use it. The following scenarios happen every day:

RoleScenarioPain Point
MarketingExport JSON conversion data from ad platformsNeed aggregation by channel, creative, time period
Product ManagerApp analytics logs are in JSON formatNeed to track user funnel conversion rates
Customer ServiceTicket system notes stored as JSONNeed to extract “issue category” for statistics
Data AnalystAPI responses are all nested JSONNeed to quickly generate reports

The core problem: JSON is unstructured, but reports require structured data. The traditional approach is to write parsing scripts, but DuckDB’s approach is to treat JSON as a regular table.

Step 1: Prepare Data—Simulating Real User Behavior Logs

First, let’s generate simulated data with Python:

import duckdb
import json
import random
from datetime import datetime, timedelta

# Simulate 50,000 user behavior log entries
pages = ["/home", "/products/123", "/products/456", "/cart", "/checkout", "/profile", "/search"]
devices = ["mobile", "desktop", "tablet"]
referrals = ["wechat", "google", "direct", "douyin", "xiaohongshu"]
events = ["view", "click", "add_to_cart", "purchase", "share"]

rows = []
base_time = datetime(2026, 6, 1)

for i in range(50000):
    ts = base_time + timedelta(seconds=random.randint(0, 30*24*3600))
    row = {
        "user_id": random.randint(1000, 20000),
        "event": random.choice(events),
        "timestamp": ts.strftime("%Y-%m-%dT%H:%M:%SZ"),
        "payload": {
            "page": random.choice(pages),
            "device": random.choices(devices, weights=[0.6, 0.3, 0.1])[0],
            "duration_ms": random.randint(50, 5000),
            "referral": random.choice(referrals)
        }
    }
    rows.append(json.dumps(row))

# Write to JSONL format (one JSON object per line)
with open("user_events.jsonl", "w") as f:
    f.write("\n".join(rows))

print(f"Generated {len(rows)} JSON log entries")

Why JSONL format? Because it’s “one JSON per line,” which DuckDB natively supports for direct reading without any preprocessing.

Step 2: Core Solution—One SQL Line to Flatten JSON

Method 1: Directly Query JSONL Files

import duckdb

con = duckdb.connect(":memory:")

# One line of code: read JSONL and automatically expand nested fields
result = con.execute("""
    SELECT 
        user_id,
        event,
        timestamp,
        payload.page,
        payload.device,
        payload.duration_ms,
        payload.referral
    FROM read_json_auto('user_events.jsonl')
""").fetchdf()

print(result.head())

Output:

   user_id         event            timestamp       page   device  duration_ms referral
0    15234         view   2026-06-03T08:12:45Z    /products/123  mobile          1234  wechat
1     8901      add_to_cart  2026-06-05T14:23:11Z         /cart  desktop           890  google
2    12345       purchase  2026-06-07T09:45:33Z      /checkout  mobile          2100  direct

That’s it. read_json_auto() automatically recognizes the JSON structure, and nested fields are accessed with ..

Method 2: Handle Deeper Nested JSON

Real-world data is often more complex. Consider deeper nesting:

{
    "user_id": 10234,
    "event": "purchase",
    "timestamp": "2026-06-15T14:32:01Z",
    "payload": {
        "page": "/products/456",
        "device": {
            "type": "mobile",
            "os": "iOS",
            "browser": "Safari"
        },
        "product": {
            "id": 456,
            "price": 299.00,
            "category": "electronics"
        },
        "duration_ms": 230
    }
}

DuckDB handles this effortlessly:

SELECT 
    user_id,
    event,
    timestamp,
    payload.page,
    payload.device.type AS device_type,
    payload.device.os AS device_os,
    payload.product.price,
    payload.product.category,
    payload.duration_ms
FROM read_json_auto('deep_nested.jsonl')

You can nest as deep as you want, accessing with . all the way. This is one of DuckDB’s biggest advantages over Pandas—Pandas requires json_normalize() to flatten first, while DuckDB expands directly during query execution.

Step 3: Practical Analysis—Generating Reports from JSON Logs

Analysis 1: Page Visit Rankings and Conversion Rates

SELECT 
    payload.page AS page,
    COUNT(*) AS total_views,
    COUNT(*) FILTER (WHERE event = 'view') AS views,
    COUNT(*) FILTER (WHERE event = 'click') AS clicks,
    COUNT(*) FILTER (WHERE event = 'add_to_cart') AS cart_adds,
    COUNT(*) FILTER (WHERE event = 'purchase') AS purchases,
    ROUND(100.0 * COUNT(*) FILTER (WHERE event = 'purchase') / NULLIF(COUNT(*), 0), 2) AS conversion_rate
FROM read_json_auto('user_events.jsonl')
GROUP BY payload.page
ORDER BY total_views DESC

Output:

          page  total_views  views  clicks  cart_adds  purchases  conversion_rate
       /home        7234    4521    1890        523        189            2.61
/products/123    6891    3987    1654        498        176            2.55
/products/456    6543    3654    1523        467        201            3.07
         /cart    5432    1234    1876        1543        432            7.95
     /checkout    4321     987    1432        1123        678           15.69
      /profile    3876    2345     789        198         45            1.16
      /search    5678    3456    1789        287         89            1.57

Key insight: /checkout has the highest conversion rate (15.69%) but lowest traffic. Users who reach checkout are highly likely to convert.

Analysis 2: Mobile vs Desktop Behavior Differences

SELECT 
    payload.device AS device_type,
    COUNT(*) AS total_events,
    ROUND(AVG(payload.duration_ms), 0) AS avg_duration_ms,
    ROUND(MEDIAN(payload.duration_ms), 0) AS median_duration_ms,
    COUNT(*) FILTER (WHERE event = 'purchase') AS purchases,
    ROUND(100.0 * COUNT(*) FILTER (WHERE event = 'purchase') / NULLIF(COUNT(*), 0), 2) AS purchase_rate
FROM read_json_auto('user_events.jsonl')
GROUP BY payload.device
ORDER BY total_events DESC

Output:

 device_type  total_events  avg_duration_ms  median_duration_ms  purchases  purchase_rate
    mobile          30120            2456.0              1890.0       1234            4.10
   desktop          15234            3123.0              2567.0        876            5.75
    tablet           4646            2789.0              2234.0        234            5.04

Key finding: Mobile accounts for 60% of traffic but has only a 4.1% purchase rate, far below desktop’s 5.75%. This suggests mobile UX optimization is needed.

Analysis 3: Hourly User Activity

SELECT 
    DATE(timestamp) AS sale_date,
    EXTRACT(HOUR FROM timestamp) AS hour_of_day,
    COUNT(*) AS event_count,
    COUNT(DISTINCT user_id) AS unique_users,
    COUNT(*) FILTER (WHERE event = 'purchase') AS purchases,
    SUM(payload.duration_ms) AS total_engagement_ms
FROM read_json_auto('user_events.jsonl')
GROUP BY 1, 2
ORDER BY 1, 2

This gives you the user activity curve over time to identify optimal operational moments.

Advanced Techniques

Schema Discovery: Exploring Unknown JSON Structures

Sometimes you receive JSON data without knowing its schema. DuckDB’s json_keys() function is perfect for this:

-- Discover all top-level keys in the JSON
SELECT DISTINCT json_keys(payload) AS keys
FROM read_json_auto('user_events.jsonl')
LIMIT 5

This acts as “schema discovery”—quickly understanding the structure even when you don’t know what to expect.

Remote File Queries

If your JSON data is on S3 or OSS, DuckDB can read it directly: read_json_auto('s3://bucket/data.jsonl'), without downloading locally. With the httpfs extension, you can even query JSON files hosted on HTTP URLs.

Performance Comparison: DuckDB JSON vs Python Parsing

Data ScalePython (json.loads)DuckDB (read_json_auto)
10K rows~0.3 sec~0.02 sec
100K rows~3 sec~0.15 sec
1M rows~35 sec~1.2 sec
10M rows~6 min~12 sec
MemoryAll loaded into RAMStreaming, on-demand read

Key difference: DuckDB’s JSON parsing is vectorized and supports lazy evaluation. Even if your JSON file is 5GB, you don’t need 5GB of memory—it only reads what your query needs.

Complete Wrapper: Build a Reusable JSON Analyzer

import duckdb
from typing import Optional
import pandas as pd

class JSONAnalyzer:
    """Lightweight JSON log analyzer"""
    
    def __init__(self, jsonl_path: str):
        self.con = duckdb.connect(":memory:")
        self.jsonl_path = jsonl_path
        self._table_loaded = False
    
    def _ensure_loaded(self):
        if not self._table_loaded:
            self.con.execute(f"CREATE TABLE events AS SELECT * FROM read_json_auto('{self.jsonl_path}')")
            self._table_loaded = True
    
    def page_ranking(self, top_n: int = 10) -> pd.DataFrame:
        """Page visit ranking and conversion rate"""
        self._ensure_loaded()
        return self.con.execute("""
            SELECT 
                payload.page,
                COUNT(*) AS views,
                COUNT(*) FILTER (WHERE event = 'purchase') AS purchases,
                ROUND(100.0 * COUNT(*) FILTER (WHERE event = 'purchase') / NULLIF(COUNT(*), 0), 2) AS conversion_rate
            FROM events
            GROUP BY payload.page
            ORDER BY views DESC
            LIMIT ?
        """, [top_n]).fetchdf()
    
    def device_comparison(self) -> pd.DataFrame:
        """Device type comparison analysis"""
        self._ensure_loaded()
        return self.con.execute("""
            SELECT 
                payload.device,
                COUNT(*) AS total,
                ROUND(AVG(payload.duration_ms), 0) AS avg_duration,
                ROUND(100.0 * COUNT(*) FILTER (WHERE event = 'purchase') / NULLIF(COUNT(*), 0), 2) AS purchase_rate
            FROM events
            GROUP BY payload.device
            ORDER BY total DESC
        """).fetchdf()
    
    def close(self):
        self.con.close()

# Usage example
analyzer = JSONAnalyzer("user_events.jsonl")
print("=== Page Ranking ===")
print(analyzer.page_ranking(5).to_string(index=False))
print("\n=== Device Comparison ===")
print(analyzer.device_comparison().to_string(index=False))
analyzer.close()

Put this class in your toolkit, and any JSON log data can be analyzed immediately by just providing the file path.

Comparison with Traditional Tools

FeatureDuckDBPandasSparkTraditional Python
Native JSON support✅ read_json_auto❌ Needs json_normalize✅ Supported but clunky❌ Manual parsing needed
Nested field accesspayload.page❌ Needs flattening⚠️ Complex❌ Manual recursion
Streaming processing✅ Lazy evaluation❌ Full load✅ Distributed❌ Memory limited
Big data performance⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Learning curveLow (SQL)Medium (Python)HighLow
Deployment complexityZero (single file)LowHigh (cluster)Low

Monetization Advice

Mastering DuckDB’s JSON processing skills can be converted into real revenue:

  1. Data Consulting Services: Provide JSON log analysis services for enterprises, helping e-commerce and gaming companies understand user behavior. Single projects charge ¥5,000–¥20,000.

  2. Automated Reporting Products: Package the JSONAnalyzer class above into a SaaS product where users upload JSONL files and get automated visualized reports. Monthly subscription: ¥99–¥499/user.

  3. Training Courses: Create a “DuckDB Practical JSON Data Analysis” course series and sell on platforms like Udemy or GeekTime.

  4. Open Source Plugin Development: Build industry-specific JSON parsing extensions based on read_json_auto(), earning through GitHub Sponsors or enterprise licensing.

  5. Internal Efficiency Improvement: Promote DuckDB within your company to replace traditional JSON parsing workflows. The saved development man-hours directly quantify as your performance highlights.

Work that used to take half a day now takes 10 minutes. Next time you see JSON data, try read_json_auto() first. You’ll discover that many tasks that once took half a day can now be done in 10 minutes.


💡 Pro Tip: If your JSON data is on S3 or OSS, DuckDB can read it directly—read_json_auto('s3://bucket/data.jsonl')—no need to download locally.

📺 More DuckDB practical tutorials: Subscribe to the YouTube channel → youtube.com/@duckdblab

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.