
From “Making Reports” to “Building Products”: What Is Your Data Skill Worth?
Many data analysts do the same thing every day: pull data, compute summaries, draw charts, write conclusions. Then the boss says, “Keep doing this tomorrow.”
But what’s truly valuable isn’t “making the daily report”—it’s discovering monetizable opportunities within the report.
Today we’re diving into two severely underrated DuckDB analytical functions—ARRAY_AGG and QUANTILE_CONT—and how they can transform an ordinary sales daily report into a chargeable business intelligence product.
Setting the Scene: Real E-commerce Sales Data
Imagine you run a small e-commerce business with sales data in sales.csv:
date,product,revenue,quantity,region
2026-01-01,Widget A,1500.00,10,North
2026-01-01,Widget B,2300.50,15,South
2026-01-01,Widget C,950.00,8,East
2026-01-02,Widget A,1800.00,12,North
2026-01-02,Widget B,2100.00,14,West
2026-01-02,Widget C,1200.00,10,South
...
The traditional approach uses GROUP BY to compute averages—but that masks the true distribution of your data. Your goal should be to extract insights across three dimensions with a single SQL query:
- Product dimension: Which product earns the most? What’s the average unit price? Is the revenue distribution healthy?
- Time dimension: Did today’s sales grow or shrink compared to yesterday? What’s the trend signal?
- Regional dimension: Which region contributes the most to top-selling products?
Step 1: See Through Data with QUANTILE_CONT
The traditional approach uses AVG(revenue) to see average revenue, but the mean is the most easily misleading metric. If one day a big client places a huge order, the average inflates, making you think the business is growing when it’s not.
QUANTILE_CONT reveals the true distribution:
SELECT
product,
SUM(revenue) AS total_revenue,
ROUND(AVG(revenue), 2) AS avg_daily_revenue,
COUNT(DISTINCT date) AS active_days,
ROUND(SUM(revenue) / SUM(quantity), 2) AS avg_unit_price,
QUANTILE_CONT(revenue, 0.25) AS q25,
QUANTILE_CONT(revenue, 0.50) AS median,
QUANTILE_CONT(revenue, 0.75) AS q75,
QUANTILE_CONT(revenue, 0.95) AS p95
FROM read_csv_auto('sales.csv')
GROUP BY product
ORDER BY total_revenue DESC;
Why is this more valuable than the mean?
Suppose Widget A has a mean of 1787 but a median of only 1650—that means a few days of large orders inflated the average. Your revenue is unstable! You need to investigate what happened on those outlier days: promotions? key accounts? seasonal factors?
This “spot anomalies then dig deeper” approach is the core capability clients pay for. You’re not “making reports”—you’re providing business diagnostics.
QUANTILE_CONT vs Traditional Statistical Functions
| Metric | AVG() | QUANTILE_CONT(x, 0.5) | Practical Meaning |
|---|---|---|---|
| Outlier resistance | Poor | Strong | Median unaffected by individual large orders |
| Data coverage | Single value | Can view 25%/50%/75%/95% percentiles | See the full picture, not just one point |
| Anomaly detection | Requires extra logic | Observe Q1-Q3 gap directly | IQR method identifies outliers |
| Business interpretability | “Average daily revenue is 1787” | “50% of days have revenue below 1650” | The latter guides decisions better |
Step 2: Calculate MoM Growth with Window Function LAG()
The soul of a daily report is “change,” not “absolute value.” Use the LAG() window function to calculate the month-over-month growth rate:
WITH daily_summary AS (
SELECT
date,
SUM(revenue) AS daily_revenue,
SUM(quantity) AS daily_quantity,
COUNT(DISTINCT product) AS sku_count
FROM read_csv_auto('sales.csv')
GROUP BY date
)
SELECT
date,
daily_revenue,
daily_quantity,
sku_count,
LAG(daily_revenue) OVER (ORDER BY date) AS prev_day_revenue,
ROUND(
(daily_revenue - LAG(daily_revenue) OVER (ORDER BY date))
/ NULLIF(LAG(daily_revenue) OVER (ORDER BY date), 0) * 100,
1
) AS growth_pct
FROM daily_summary
ORDER BY date;
Note the use of NULLIF(..., 0)—to avoid division-by-zero errors. This is basic production SQL hygiene.
Practical tip: Wrap this SQL in a Python function, run it via schedule library or cron every early morning, and write results directly to a Parquet file. Open it the next day and the data is already fresh—no waiting for BI tool refresh.
Step 3: ARRAY_AGG — One SQL Line Replacing Multi-step Python
This is one of DuckDB’s most underrated capabilities. Array aggregation compresses all product sales for a given day into a single array, giving you the full picture at a glance.
SELECT
date,
region,
ARRAY_AGG(product ORDER BY revenue DESC) AS top_products,
ARRAY_AGG(revenue ORDER BY revenue DESC) AS revenues,
ARRAY_AGG(quantity ORDER BY revenue DESC) AS quantities,
ARRAY_LENGTH(ARRAY_AGG(product ORDER BY revenue DESC)) AS product_count
FROM read_csv_auto('sales.csv')
GROUP BY date, region
ORDER BY date, region;
Example output:
date | region | top_products | revenues
2026-01-01 | North | [Widget A, Widget C] | [1500.0, 950.0]
2026-01-01 | South | [Widget B] | [2300.5]
2026-01-02 | West | [Widget B, Widget A] | [2100.0, 1800.0]
Advanced ARRAY_AGG Usage
Beyond simple aggregation, you can perform various operations on arrays:
-- Get top N elements from an array (TOP N products)
SELECT
date,
region,
ARRAY_SLICE(top_products, 0, 3) AS top_3_products,
ARRAY_SUM(revenues) AS total_region_revenue
FROM (
SELECT
date,
region,
ARRAY_AGG(product ORDER BY revenue DESC) AS top_products,
ARRAY_AGG(revenue ORDER BY revenue DESC) AS revenues
FROM read_csv_auto('sales.csv')
GROUP BY date, region
)
-- Unnest arrays for further analysis
SELECT
date,
region,
UNNEST(top_products) AS ranked_product,
UNNEST(revenues) AS ranked_revenue,
ROW_NUMBER() OVER (PARTITION BY date, region ORDER BY revenues DESC) AS rank
FROM (
SELECT
date,
region,
ARRAY_AGG(product ORDER BY revenue DESC) AS top_products,
ARRAY_AGG(revenue ORDER BY revenue DESC) AS revenues
FROM read_csv_auto('sales.csv')
GROUP BY date, region
) sub;
ARRAY_AGG vs Traditional GROUP BY
| Need | Traditional Method | DuckDB ARRAY_AGG |
|---|---|---|
| Top 3 products per region | Self-join + row number filter | ARRAY_SLICE(ARRAY_AGG(...), 0, 3) |
| Array length stats | COUNT(*) + subquery | ARRAY_LENGTH(ARRAY_AGG(...)) |
| Array deduplication | Custom logic | ARRAY_UNIQUE(ARRAY_AGG(...)) |
| Sort-then-aggregate | ORDER BY then GROUP BY | ARRAY_AGG(col ORDER BY ...) |
Step 4: Export to Parquet in One Line, Feed Downstream Systems
The ultimate purpose of a daily report isn’t “to look at it”—it’s “to use it.” Write analysis results to Parquet and seamlessly integrate with any big data tool:
COPY (
WITH daily_summary AS (
SELECT
date,
SUM(revenue) AS daily_revenue,
SUM(quantity) AS daily_quantity,
LAG(SUM(revenue)) OVER (ORDER BY date) AS prev_day_revenue,
ROUND(
(SUM(revenue) - LAG(SUM(revenue)) OVER (ORDER BY date))
/ NULLIF(LAG(SUM(revenue)) OVER (ORDER BY date), 0) * 100, 1
) AS growth_pct
FROM read_csv_auto('sales.csv')
GROUP BY date
)
SELECT *,
CASE WHEN growth_pct > 10 THEN '🟢 High Growth'
WHEN growth_pct > 0 THEN '🟡 Stable Growth'
WHEN growth_pct > -10 THEN '🟠 Slight Decline'
ELSE '🔴 Significant Decline'
END AS trend_signal
FROM daily_summary
) TO 'daily_report.parquet' (FORMAT PARQUET);
One COPY command, and the result becomes a Parquet file directly. Then you can:
- Read it with Python and send alerts via WeChat/DingTalk bot
- Connect Metabase to the Parquet file for visualization
- Expose it via FastAPI as an API for frontend consumption
Python Integration: Complete Daily Report Generation Script
import duckdb
import json
from datetime import datetime
def generate_daily_report(csv_path='sales.csv'):
"""Generate a daily report with trend signals"""
conn = duckdb.connect()
# Main analysis query
report = conn.execute("""
WITH daily_summary AS (
SELECT
date,
SUM(revenue) AS daily_revenue,
SUM(quantity) AS daily_quantity,
LAG(SUM(revenue)) OVER (ORDER BY date) AS prev_day_revenue,
ROUND(
(SUM(revenue) - LAG(SUM(revenue)) OVER (ORDER BY date))
/ NULLIF(LAG(SUM(revenue)) OVER (ORDER BY date), 0) * 100, 1
) AS growth_pct
FROM read_csv_auto(?)
GROUP BY date
)
SELECT *,
CASE WHEN growth_pct > 10 THEN '🟢 High Growth'
WHEN growth_pct > 0 THEN '🟡 Stable Growth'
WHEN growth_pct > -10 THEN '🟠 Slight Decline'
ELSE '🔴 Significant Decline'
END AS trend_signal
FROM daily_summary
ORDER BY date DESC
""", [csv_path]).fetchdf()
# Regional × Product cross-analysis
regional_insight = conn.execute("""
SELECT
date,
region,
ARRAY_AGG(product ORDER BY revenue DESC) AS top_products,
ARRAY_AGG(revenue ORDER BY revenue DESC) AS revenues
FROM read_csv_auto(?)
GROUP BY date, region
ORDER BY date, region
""", [csv_path]).fetchdf()
# Export Parquet
report.to_parquet('daily_report.parquet')
# Return JSON format for API use
return {
'generated_at': datetime.now().isoformat(),
'daily_report': report.to_dict(orient='records'),
'regional_insights': regional_insight.to_dict(orient='records')
}
# Usage example
result = generate_daily_report()
print(json.dumps(result['daily_report'][0], indent=2, ensure_ascii=False))
Monetization Ideas: Package SQL Capabilities into Data Products
With ARRAY_AGG and QUANTILE_CONT mastered, you’ve evolved from “report-making tool” to “data-driven decision product builder.” Here are concrete monetization paths:
Direction 1: Regional Sales Heatmap Dashboard (SaaS Product)
This functionality can be packaged as a “regional sales heatmap dashboard”—top products per region at a glance. Small e-commerce owners don’t need to know SQL; you just provide a frontend interface that calls this query.
Tech stack: DuckDB + FastAPI + Streamlit Pricing: Basic ¥99/month, Pro ¥299/month Acquisition channels: E-commerce communities, Xianyu, Xiaohongshu
Direction 2: Data Monitoring as a Service
Run SQL analysis automatically every day. When abnormal trends are detected, notify clients via WeChat/email. You’re not selling “reports”—you’re selling “early warnings.”
Tech stack: DuckDB + cron + WeChat Work bot Pricing: Charged by monitored data source count, ¥199-599/month Customer profile: Small merchants with multiple online stores
Direction 3: Business Diagnostic Consulting
Use QUANTILE_CONT to analyze a client’s revenue distribution, find “instability masked by the mean,” then provide improvement recommendations. This isn’t a report—it’s a diagnostic report, and the value proposition is entirely different.
Pricing: Single diagnostic ¥500-2000, monthly advisor ¥3000-8000
Summary: The Four-Step Methodology
- Use percentiles instead of means—see the full picture of your data, not be fooled by averages
- Use window functions for MoM growth—the soul of a daily report is “rate of change,” not “absolute value”
- Use array aggregation for cross-dimensional analysis—one SQL line replaces multi-step Python processing
- Use COPY to export Parquet—SQL analysis feeds directly into downstream systems
Master these four steps, and your data capability evolves from “auxiliary tool” into a “profit engine.”
📖 All code in this article verified on DuckDB 1.5.4. Want to systematically learn DuckDB from basics to production? duckdblab.org has a comprehensive tutorial series covering CSV/Parquet read/write, window functions, array operations, HTTPFS remote querying, and more—with runnable code and real-world case studies. Learn more DuckDB practical experience → duckdblab.org