Featured image of post DuckDB Array Aggregation in Practice: Building Regional Sales Heatmaps with SQL

DuckDB Array Aggregation in Practice: Building Regional Sales Heatmaps with SQL

Master DuckDB's ARRAY_AGG and QUANTILE_CONT to transform ordinary daily reports into monetizable business intelligence products. Covers cross-dimensional analysis, trend detection, and Parquet export pipelines.

From “Making Reports” to “Building Products”: What Is Your Data Skill Worth?

Many data analysts do the same thing every day: pull data, compute summaries, draw charts, write conclusions. Then the boss says, “Keep doing this tomorrow.”

But what’s truly valuable isn’t “making the daily report”—it’s discovering monetizable opportunities within the report.

Today we’re diving into two severely underrated DuckDB analytical functions—ARRAY_AGG and QUANTILE_CONT—and how they can transform an ordinary sales daily report into a chargeable business intelligence product.

Setting the Scene: Real E-commerce Sales Data

Imagine you run a small e-commerce business with sales data in sales.csv:

date,product,revenue,quantity,region
2026-01-01,Widget A,1500.00,10,North
2026-01-01,Widget B,2300.50,15,South
2026-01-01,Widget C,950.00,8,East
2026-01-02,Widget A,1800.00,12,North
2026-01-02,Widget B,2100.00,14,West
2026-01-02,Widget C,1200.00,10,South
...

The traditional approach uses GROUP BY to compute averages—but that masks the true distribution of your data. Your goal should be to extract insights across three dimensions with a single SQL query:

  1. Product dimension: Which product earns the most? What’s the average unit price? Is the revenue distribution healthy?
  2. Time dimension: Did today’s sales grow or shrink compared to yesterday? What’s the trend signal?
  3. Regional dimension: Which region contributes the most to top-selling products?

Step 1: See Through Data with QUANTILE_CONT

The traditional approach uses AVG(revenue) to see average revenue, but the mean is the most easily misleading metric. If one day a big client places a huge order, the average inflates, making you think the business is growing when it’s not.

QUANTILE_CONT reveals the true distribution:

SELECT 
    product,
    SUM(revenue) AS total_revenue,
    ROUND(AVG(revenue), 2) AS avg_daily_revenue,
    COUNT(DISTINCT date) AS active_days,
    ROUND(SUM(revenue) / SUM(quantity), 2) AS avg_unit_price,
    QUANTILE_CONT(revenue, 0.25) AS q25,
    QUANTILE_CONT(revenue, 0.50) AS median,
    QUANTILE_CONT(revenue, 0.75) AS q75,
    QUANTILE_CONT(revenue, 0.95) AS p95
FROM read_csv_auto('sales.csv')
GROUP BY product
ORDER BY total_revenue DESC;

Why is this more valuable than the mean?

Suppose Widget A has a mean of 1787 but a median of only 1650—that means a few days of large orders inflated the average. Your revenue is unstable! You need to investigate what happened on those outlier days: promotions? key accounts? seasonal factors?

This “spot anomalies then dig deeper” approach is the core capability clients pay for. You’re not “making reports”—you’re providing business diagnostics.

QUANTILE_CONT vs Traditional Statistical Functions

MetricAVG()QUANTILE_CONT(x, 0.5)Practical Meaning
Outlier resistancePoorStrongMedian unaffected by individual large orders
Data coverageSingle valueCan view 25%/50%/75%/95% percentilesSee the full picture, not just one point
Anomaly detectionRequires extra logicObserve Q1-Q3 gap directlyIQR method identifies outliers
Business interpretability“Average daily revenue is 1787”“50% of days have revenue below 1650”The latter guides decisions better

Step 2: Calculate MoM Growth with Window Function LAG()

The soul of a daily report is “change,” not “absolute value.” Use the LAG() window function to calculate the month-over-month growth rate:

WITH daily_summary AS (
    SELECT 
        date,
        SUM(revenue) AS daily_revenue,
        SUM(quantity) AS daily_quantity,
        COUNT(DISTINCT product) AS sku_count
    FROM read_csv_auto('sales.csv')
    GROUP BY date
)
SELECT 
    date,
    daily_revenue,
    daily_quantity,
    sku_count,
    LAG(daily_revenue) OVER (ORDER BY date) AS prev_day_revenue,
    ROUND(
        (daily_revenue - LAG(daily_revenue) OVER (ORDER BY date)) 
        / NULLIF(LAG(daily_revenue) OVER (ORDER BY date), 0) * 100, 
        1
    ) AS growth_pct
FROM daily_summary
ORDER BY date;

Note the use of NULLIF(..., 0)—to avoid division-by-zero errors. This is basic production SQL hygiene.

Practical tip: Wrap this SQL in a Python function, run it via schedule library or cron every early morning, and write results directly to a Parquet file. Open it the next day and the data is already fresh—no waiting for BI tool refresh.

Step 3: ARRAY_AGG — One SQL Line Replacing Multi-step Python

This is one of DuckDB’s most underrated capabilities. Array aggregation compresses all product sales for a given day into a single array, giving you the full picture at a glance.

SELECT 
    date,
    region,
    ARRAY_AGG(product ORDER BY revenue DESC) AS top_products,
    ARRAY_AGG(revenue ORDER BY revenue DESC) AS revenues,
    ARRAY_AGG(quantity ORDER BY revenue DESC) AS quantities,
    ARRAY_LENGTH(ARRAY_AGG(product ORDER BY revenue DESC)) AS product_count
FROM read_csv_auto('sales.csv')
GROUP BY date, region
ORDER BY date, region;

Example output:

  date       | region | top_products         | revenues
  2026-01-01 | North  | [Widget A, Widget C] | [1500.0, 950.0]
  2026-01-01 | South  | [Widget B]           | [2300.5]
  2026-01-02 | West   | [Widget B, Widget A] | [2100.0, 1800.0]

Advanced ARRAY_AGG Usage

Beyond simple aggregation, you can perform various operations on arrays:

-- Get top N elements from an array (TOP N products)
SELECT 
    date,
    region,
    ARRAY_SLICE(top_products, 0, 3) AS top_3_products,
    ARRAY_SUM(revenues) AS total_region_revenue
FROM (
    SELECT 
        date,
        region,
        ARRAY_AGG(product ORDER BY revenue DESC) AS top_products,
        ARRAY_AGG(revenue ORDER BY revenue DESC) AS revenues
    FROM read_csv_auto('sales.csv')
    GROUP BY date, region
)
-- Unnest arrays for further analysis
SELECT 
    date,
    region,
    UNNEST(top_products) AS ranked_product,
    UNNEST(revenues) AS ranked_revenue,
    ROW_NUMBER() OVER (PARTITION BY date, region ORDER BY revenues DESC) AS rank
FROM (
    SELECT 
        date,
        region,
        ARRAY_AGG(product ORDER BY revenue DESC) AS top_products,
        ARRAY_AGG(revenue ORDER BY revenue DESC) AS revenues
    FROM read_csv_auto('sales.csv')
    GROUP BY date, region
) sub;

ARRAY_AGG vs Traditional GROUP BY

NeedTraditional MethodDuckDB ARRAY_AGG
Top 3 products per regionSelf-join + row number filterARRAY_SLICE(ARRAY_AGG(...), 0, 3)
Array length statsCOUNT(*) + subqueryARRAY_LENGTH(ARRAY_AGG(...))
Array deduplicationCustom logicARRAY_UNIQUE(ARRAY_AGG(...))
Sort-then-aggregateORDER BY then GROUP BYARRAY_AGG(col ORDER BY ...)

Step 4: Export to Parquet in One Line, Feed Downstream Systems

The ultimate purpose of a daily report isn’t “to look at it”—it’s “to use it.” Write analysis results to Parquet and seamlessly integrate with any big data tool:

COPY (
    WITH daily_summary AS (
        SELECT 
            date,
            SUM(revenue) AS daily_revenue,
            SUM(quantity) AS daily_quantity,
            LAG(SUM(revenue)) OVER (ORDER BY date) AS prev_day_revenue,
            ROUND(
                (SUM(revenue) - LAG(SUM(revenue)) OVER (ORDER BY date)) 
                / NULLIF(LAG(SUM(revenue)) OVER (ORDER BY date), 0) * 100, 1
            ) AS growth_pct
        FROM read_csv_auto('sales.csv')
        GROUP BY date
    )
    SELECT *, 
        CASE WHEN growth_pct > 10 THEN '🟢 High Growth'
             WHEN growth_pct > 0 THEN '🟡 Stable Growth'
             WHEN growth_pct > -10 THEN '🟠 Slight Decline'
             ELSE '🔴 Significant Decline'
        END AS trend_signal
    FROM daily_summary
) TO 'daily_report.parquet' (FORMAT PARQUET);

One COPY command, and the result becomes a Parquet file directly. Then you can:

  • Read it with Python and send alerts via WeChat/DingTalk bot
  • Connect Metabase to the Parquet file for visualization
  • Expose it via FastAPI as an API for frontend consumption

Python Integration: Complete Daily Report Generation Script

import duckdb
import json
from datetime import datetime

def generate_daily_report(csv_path='sales.csv'):
    """Generate a daily report with trend signals"""
    conn = duckdb.connect()
    
    # Main analysis query
    report = conn.execute("""
        WITH daily_summary AS (
            SELECT 
                date,
                SUM(revenue) AS daily_revenue,
                SUM(quantity) AS daily_quantity,
                LAG(SUM(revenue)) OVER (ORDER BY date) AS prev_day_revenue,
                ROUND(
                    (SUM(revenue) - LAG(SUM(revenue)) OVER (ORDER BY date)) 
                    / NULLIF(LAG(SUM(revenue)) OVER (ORDER BY date), 0) * 100, 1
                ) AS growth_pct
            FROM read_csv_auto(?)
            GROUP BY date
        )
        SELECT *, 
            CASE WHEN growth_pct > 10 THEN '🟢 High Growth'
                 WHEN growth_pct > 0 THEN '🟡 Stable Growth'
                 WHEN growth_pct > -10 THEN '🟠 Slight Decline'
                 ELSE '🔴 Significant Decline'
            END AS trend_signal
        FROM daily_summary
        ORDER BY date DESC
    """, [csv_path]).fetchdf()
    
    # Regional × Product cross-analysis
    regional_insight = conn.execute("""
        SELECT 
            date,
            region,
            ARRAY_AGG(product ORDER BY revenue DESC) AS top_products,
            ARRAY_AGG(revenue ORDER BY revenue DESC) AS revenues
        FROM read_csv_auto(?)
        GROUP BY date, region
        ORDER BY date, region
    """, [csv_path]).fetchdf()
    
    # Export Parquet
    report.to_parquet('daily_report.parquet')
    
    # Return JSON format for API use
    return {
        'generated_at': datetime.now().isoformat(),
        'daily_report': report.to_dict(orient='records'),
        'regional_insights': regional_insight.to_dict(orient='records')
    }

# Usage example
result = generate_daily_report()
print(json.dumps(result['daily_report'][0], indent=2, ensure_ascii=False))

Monetization Ideas: Package SQL Capabilities into Data Products

With ARRAY_AGG and QUANTILE_CONT mastered, you’ve evolved from “report-making tool” to “data-driven decision product builder.” Here are concrete monetization paths:

Direction 1: Regional Sales Heatmap Dashboard (SaaS Product)

This functionality can be packaged as a “regional sales heatmap dashboard”—top products per region at a glance. Small e-commerce owners don’t need to know SQL; you just provide a frontend interface that calls this query.

Tech stack: DuckDB + FastAPI + Streamlit Pricing: Basic ¥99/month, Pro ¥299/month Acquisition channels: E-commerce communities, Xianyu, Xiaohongshu

Direction 2: Data Monitoring as a Service

Run SQL analysis automatically every day. When abnormal trends are detected, notify clients via WeChat/email. You’re not selling “reports”—you’re selling “early warnings.”

Tech stack: DuckDB + cron + WeChat Work bot Pricing: Charged by monitored data source count, ¥199-599/month Customer profile: Small merchants with multiple online stores

Direction 3: Business Diagnostic Consulting

Use QUANTILE_CONT to analyze a client’s revenue distribution, find “instability masked by the mean,” then provide improvement recommendations. This isn’t a report—it’s a diagnostic report, and the value proposition is entirely different.

Pricing: Single diagnostic ¥500-2000, monthly advisor ¥3000-8000

Summary: The Four-Step Methodology

  1. Use percentiles instead of means—see the full picture of your data, not be fooled by averages
  2. Use window functions for MoM growth—the soul of a daily report is “rate of change,” not “absolute value”
  3. Use array aggregation for cross-dimensional analysis—one SQL line replaces multi-step Python processing
  4. Use COPY to export Parquet—SQL analysis feeds directly into downstream systems

Master these four steps, and your data capability evolves from “auxiliary tool” into a “profit engine.”


📖 All code in this article verified on DuckDB 1.5.4. Want to systematically learn DuckDB from basics to production? duckdblab.org has a comprehensive tutorial series covering CSV/Parquet read/write, window functions, array operations, HTTPFS remote querying, and more—with runnable code and real-world case studies. Learn more DuckDB practical experience → duckdblab.org

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.