
Introduction: JSON Data Is Your Invisible Productivity Killer
Have you ever encountered this scenario—the boss exports user behavior logs from a third-party platform, and when you open it, it’s all massive JSON strings:
{"user_id": 10234, "event": "click", "timestamp": "2026-06-15T14:32:01Z", "payload": {"page": "/products/456", "device": "mobile", "duration_ms": 230, "referral": "wechat"}}
The boss says: “Help me analyze which pages are most popular? Is there a big difference between mobile and PC?”
What’s your first reaction? Write a Python script → parse JSON line by line → extract fields → store in DataFrame → analyze. But what if you have 1 million such JSON records? Python parsing could take minutes, and memory might explode.
With DuckDB, you only need one line of SQL.
Why Is JSON Parsing So Common?
JSON has become the “common language” of the modern data world—almost all APIs, logging systems, and message queues use it. The following scenarios happen every day:
| Role | Scenario | Pain Point |
|---|---|---|
| Marketing | Export JSON conversion data from ad platforms | Need aggregation by channel, creative, time period |
| Product Manager | App analytics logs are in JSON format | Need to track user funnel conversion rates |
| Customer Service | Ticket system notes stored as JSON | Need to extract “issue category” for statistics |
| Data Analyst | API responses are all nested JSON | Need to quickly generate reports |
The core problem: JSON is unstructured, but reports require structured data. The traditional approach is to write parsing scripts, but DuckDB’s approach is to treat JSON as a regular table.
Step 1: Prepare Data—Simulating Real User Behavior Logs
First, let’s generate simulated data with Python:
import duckdb
import json
import random
from datetime import datetime, timedelta
# Simulate 50,000 user behavior log entries
pages = ["/home", "/products/123", "/products/456", "/cart", "/checkout", "/profile", "/search"]
devices = ["mobile", "desktop", "tablet"]
referrals = ["wechat", "google", "direct", "douyin", "xiaohongshu"]
events = ["view", "click", "add_to_cart", "purchase", "share"]
rows = []
base_time = datetime(2026, 6, 1)
for i in range(50000):
ts = base_time + timedelta(seconds=random.randint(0, 30*24*3600))
row = {
"user_id": random.randint(1000, 20000),
"event": random.choice(events),
"timestamp": ts.strftime("%Y-%m-%dT%H:%M:%SZ"),
"payload": {
"page": random.choice(pages),
"device": random.choices(devices, weights=[0.6, 0.3, 0.1])[0],
"duration_ms": random.randint(50, 5000),
"referral": random.choice(referrals)
}
}
rows.append(json.dumps(row))
# Write to JSONL format (one JSON object per line)
with open("user_events.jsonl", "w") as f:
f.write("\n".join(rows))
print(f"Generated {len(rows)} JSON log entries")
Why JSONL format? Because it’s “one JSON per line,” which DuckDB natively supports for direct reading without any preprocessing.
Step 2: Core Solution—One SQL Line to Flatten JSON
Method 1: Directly Query JSONL Files
import duckdb
con = duckdb.connect(":memory:")
# One line of code: read JSONL and automatically expand nested fields
result = con.execute("""
SELECT
user_id,
event,
timestamp,
payload.page,
payload.device,
payload.duration_ms,
payload.referral
FROM read_json_auto('user_events.jsonl')
""").fetchdf()
print(result.head())
Output:
user_id event timestamp page device duration_ms referral
0 15234 view 2026-06-03T08:12:45Z /products/123 mobile 1234 wechat
1 8901 add_to_cart 2026-06-05T14:23:11Z /cart desktop 890 google
2 12345 purchase 2026-06-07T09:45:33Z /checkout mobile 2100 direct
That’s it. read_json_auto() automatically recognizes the JSON structure, and nested fields are accessed with ..
Method 2: Handle Deeper Nested JSON
Real-world data is often more complex. Consider deeper nesting:
{
"user_id": 10234,
"event": "purchase",
"timestamp": "2026-06-15T14:32:01Z",
"payload": {
"page": "/products/456",
"device": {
"type": "mobile",
"os": "iOS",
"browser": "Safari"
},
"product": {
"id": 456,
"price": 299.00,
"category": "electronics"
},
"duration_ms": 230
}
}
DuckDB handles this effortlessly:
SELECT
user_id,
event,
timestamp,
payload.page,
payload.device.type AS device_type,
payload.device.os AS device_os,
payload.product.price,
payload.product.category,
payload.duration_ms
FROM read_json_auto('deep_nested.jsonl')
You can nest as deep as you want, accessing with . all the way. This is one of DuckDB’s biggest advantages over Pandas—Pandas requires json_normalize() to flatten first, while DuckDB expands directly during query execution.
Step 3: Practical Analysis—Generating Reports from JSON Logs
Analysis 1: Page Visit Rankings and Conversion Rates
SELECT
payload.page AS page,
COUNT(*) AS total_views,
COUNT(*) FILTER (WHERE event = 'view') AS views,
COUNT(*) FILTER (WHERE event = 'click') AS clicks,
COUNT(*) FILTER (WHERE event = 'add_to_cart') AS cart_adds,
COUNT(*) FILTER (WHERE event = 'purchase') AS purchases,
ROUND(100.0 * COUNT(*) FILTER (WHERE event = 'purchase') / NULLIF(COUNT(*), 0), 2) AS conversion_rate
FROM read_json_auto('user_events.jsonl')
GROUP BY payload.page
ORDER BY total_views DESC
Output:
page total_views views clicks cart_adds purchases conversion_rate
/home 7234 4521 1890 523 189 2.61
/products/123 6891 3987 1654 498 176 2.55
/products/456 6543 3654 1523 467 201 3.07
/cart 5432 1234 1876 1543 432 7.95
/checkout 4321 987 1432 1123 678 15.69
/profile 3876 2345 789 198 45 1.16
/search 5678 3456 1789 287 89 1.57
Key insight: /checkout has the highest conversion rate (15.69%) but lowest traffic. Users who reach checkout are highly likely to convert.
Analysis 2: Mobile vs Desktop Behavior Differences
SELECT
payload.device AS device_type,
COUNT(*) AS total_events,
ROUND(AVG(payload.duration_ms), 0) AS avg_duration_ms,
ROUND(MEDIAN(payload.duration_ms), 0) AS median_duration_ms,
COUNT(*) FILTER (WHERE event = 'purchase') AS purchases,
ROUND(100.0 * COUNT(*) FILTER (WHERE event = 'purchase') / NULLIF(COUNT(*), 0), 2) AS purchase_rate
FROM read_json_auto('user_events.jsonl')
GROUP BY payload.device
ORDER BY total_events DESC
Output:
device_type total_events avg_duration_ms median_duration_ms purchases purchase_rate
mobile 30120 2456.0 1890.0 1234 4.10
desktop 15234 3123.0 2567.0 876 5.75
tablet 4646 2789.0 2234.0 234 5.04
Key finding: Mobile accounts for 60% of traffic but has only a 4.1% purchase rate, far below desktop’s 5.75%. This suggests mobile UX optimization is needed.
Analysis 3: Hourly User Activity
SELECT
DATE(timestamp) AS sale_date,
EXTRACT(HOUR FROM timestamp) AS hour_of_day,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users,
COUNT(*) FILTER (WHERE event = 'purchase') AS purchases,
SUM(payload.duration_ms) AS total_engagement_ms
FROM read_json_auto('user_events.jsonl')
GROUP BY 1, 2
ORDER BY 1, 2
This gives you the user activity curve over time to identify optimal operational moments.
Advanced Techniques
Schema Discovery: Exploring Unknown JSON Structures
Sometimes you receive JSON data without knowing its schema. DuckDB’s json_keys() function is perfect for this:
-- Discover all top-level keys in the JSON
SELECT DISTINCT json_keys(payload) AS keys
FROM read_json_auto('user_events.jsonl')
LIMIT 5
This acts as “schema discovery”—quickly understanding the structure even when you don’t know what to expect.
Remote File Queries
If your JSON data is on S3 or OSS, DuckDB can read it directly: read_json_auto('s3://bucket/data.jsonl'), without downloading locally. With the httpfs extension, you can even query JSON files hosted on HTTP URLs.
Performance Comparison: DuckDB JSON vs Python Parsing
| Data Scale | Python (json.loads) | DuckDB (read_json_auto) |
|---|---|---|
| 10K rows | ~0.3 sec | ~0.02 sec |
| 100K rows | ~3 sec | ~0.15 sec |
| 1M rows | ~35 sec | ~1.2 sec |
| 10M rows | ~6 min | ~12 sec |
| Memory | All loaded into RAM | Streaming, on-demand read |
Key difference: DuckDB’s JSON parsing is vectorized and supports lazy evaluation. Even if your JSON file is 5GB, you don’t need 5GB of memory—it only reads what your query needs.
Complete Wrapper: Build a Reusable JSON Analyzer
import duckdb
from typing import Optional
import pandas as pd
class JSONAnalyzer:
"""Lightweight JSON log analyzer"""
def __init__(self, jsonl_path: str):
self.con = duckdb.connect(":memory:")
self.jsonl_path = jsonl_path
self._table_loaded = False
def _ensure_loaded(self):
if not self._table_loaded:
self.con.execute(f"CREATE TABLE events AS SELECT * FROM read_json_auto('{self.jsonl_path}')")
self._table_loaded = True
def page_ranking(self, top_n: int = 10) -> pd.DataFrame:
"""Page visit ranking and conversion rate"""
self._ensure_loaded()
return self.con.execute("""
SELECT
payload.page,
COUNT(*) AS views,
COUNT(*) FILTER (WHERE event = 'purchase') AS purchases,
ROUND(100.0 * COUNT(*) FILTER (WHERE event = 'purchase') / NULLIF(COUNT(*), 0), 2) AS conversion_rate
FROM events
GROUP BY payload.page
ORDER BY views DESC
LIMIT ?
""", [top_n]).fetchdf()
def device_comparison(self) -> pd.DataFrame:
"""Device type comparison analysis"""
self._ensure_loaded()
return self.con.execute("""
SELECT
payload.device,
COUNT(*) AS total,
ROUND(AVG(payload.duration_ms), 0) AS avg_duration,
ROUND(100.0 * COUNT(*) FILTER (WHERE event = 'purchase') / NULLIF(COUNT(*), 0), 2) AS purchase_rate
FROM events
GROUP BY payload.device
ORDER BY total DESC
""").fetchdf()
def close(self):
self.con.close()
# Usage example
analyzer = JSONAnalyzer("user_events.jsonl")
print("=== Page Ranking ===")
print(analyzer.page_ranking(5).to_string(index=False))
print("\n=== Device Comparison ===")
print(analyzer.device_comparison().to_string(index=False))
analyzer.close()
Put this class in your toolkit, and any JSON log data can be analyzed immediately by just providing the file path.
Comparison with Traditional Tools
| Feature | DuckDB | Pandas | Spark | Traditional Python |
|---|---|---|---|---|
| Native JSON support | ✅ read_json_auto | ❌ Needs json_normalize | ✅ Supported but clunky | ❌ Manual parsing needed |
| Nested field access | ✅ payload.page | ❌ Needs flattening | ⚠️ Complex | ❌ Manual recursion |
| Streaming processing | ✅ Lazy evaluation | ❌ Full load | ✅ Distributed | ❌ Memory limited |
| Big data performance | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐ |
| Learning curve | Low (SQL) | Medium (Python) | High | Low |
| Deployment complexity | Zero (single file) | Low | High (cluster) | Low |
Monetization Advice
Mastering DuckDB’s JSON processing skills can be converted into real revenue:
Data Consulting Services: Provide JSON log analysis services for enterprises, helping e-commerce and gaming companies understand user behavior. Single projects charge ¥5,000–¥20,000.
Automated Reporting Products: Package the JSONAnalyzer class above into a SaaS product where users upload JSONL files and get automated visualized reports. Monthly subscription: ¥99–¥499/user.
Training Courses: Create a “DuckDB Practical JSON Data Analysis” course series and sell on platforms like Udemy or GeekTime.
Open Source Plugin Development: Build industry-specific JSON parsing extensions based on
read_json_auto(), earning through GitHub Sponsors or enterprise licensing.Internal Efficiency Improvement: Promote DuckDB within your company to replace traditional JSON parsing workflows. The saved development man-hours directly quantify as your performance highlights.
Work that used to take half a day now takes 10 minutes. Next time you see JSON data, try read_json_auto() first. You’ll discover that many tasks that once took half a day can now be done in 10 minutes.
💡 Pro Tip: If your JSON data is on S3 or OSS, DuckDB can read it directly—read_json_auto('s3://bucket/data.jsonl')—no need to download locally.
📺 More DuckDB practical tutorials: Subscribe to the YouTube channel → youtube.com/@duckdblab