Featured image of post DuckDB GROUP BY & Aggregation Guide: Master Data Summarization

DuckDB GROUP BY & Aggregation Guide: Master Data Summarization

Complete DuckDB GROUP BY and aggregation guide with COUNT, SUM, AVG, MIN, MAX, HAVING, GROUPING SETS, CUBE, ROLLUP, and GROUP BY ALL. Real-world SQL examples included.

Introduction

If you work with data, GROUP BY and aggregation are the backbone of your analytical queries. Whether you’re calculating total sales per region, counting users per day, or finding average transaction values, DuckDB provides a rich set of aggregation features that go far beyond the SQL standard.

DuckDB is an in-process analytical database designed for data science and analytical workloads. Its GROUP BY implementation includes modern extensions like GROUP BY ALL, GROUPING SETS, CUBE, and ROLLUP that make complex reporting queries dramatically simpler.

In this guide, you’ll learn:

  • Basic GROUP BY syntax and common aggregate functions
  • Multi-column grouping and the HAVING clause
  • DuckDB-specific features: GROUPING SETS, CUBE, ROLLUP
  • The convenient GROUP BY ALL syntax
  • Practical examples with real-world data scenarios

This guide builds on concepts from the DuckDB SQL Syntax Guide and the DuckDB Beginner’s Guide 2026. If you’re new to DuckDB, start there first.


1. Basic GROUP BY Syntax

The GROUP BY clause groups rows that have the same values in specified columns, then applies aggregate functions to each group.

SELECT column_name, AGGREGATE_FUNCTION(column_name)
FROM table_name
GROUP BY column_name;

Let’s start with a sample dataset. Create a sales table and insert some data:

CREATE TABLE sales AS
SELECT * FROM (VALUES
  ('Electronics', 'North', 1200, '2026-01-15'),
  ('Clothing',    'North', 450,  '2026-01-16'),
  ('Electronics', 'South', 1800, '2026-01-17'),
  ('Clothing',    'South', 600,  '2026-01-18'),
  ('Electronics', 'North', 900,  '2026-01-19'),
  ('Clothing',    'North', 300,  '2026-02-01'),
  ('Electronics', 'South', 2100, '2026-02-02'),
  ('Clothing',    'South', 750,  '2026-02-03')
) AS t(category, region, amount, sale_date);

Now, a simple GROUP BY to get total sales by category:

SELECT category, SUM(amount) AS total_sales
FROM sales
GROUP BY category;

Result:

categorytotal_sales
Electronics6000
Clothing2100

2. Common Aggregate Functions

DuckDB supports all standard SQL aggregate functions. Here are the most commonly used ones:

COUNT — Count Rows

SELECT category, COUNT(*) AS num_orders, COUNT(DISTINCT region) AS regions
FROM sales
GROUP BY category;
categorynum_ordersregions
Electronics42
Clothing42

SUM — Calculate Total

SELECT region, SUM(amount) AS total_revenue
FROM sales
GROUP BY region;
regiontotal_revenue
North2850
South5250

AVG — Calculate Average

SELECT category, AVG(amount) AS avg_order_value
FROM sales
GROUP BY category;
categoryavg_order_value
Electronics1500.0
Clothing525.0

MIN / MAX — Find Extremes

SELECT category,
       MIN(amount) AS smallest_order,
       MAX(amount) AS largest_order
FROM sales
GROUP BY category;
categorysmallest_orderlargest_order
Electronics9002100
Clothing300750

Combining Multiple Aggregates

You can mix multiple aggregate functions in a single query:

SELECT category,
       COUNT(*)          AS num_orders,
       SUM(amount)       AS total_revenue,
       AVG(amount)       AS avg_order_value,
       MIN(amount)       AS min_order,
       MAX(amount)       AS max_order
FROM sales
GROUP BY category;

3. GROUP BY with Multiple Columns

Grouping by multiple columns creates a separate group for each unique combination of values:

SELECT category, region,
       SUM(amount) AS total_sales,
       COUNT(*)    AS num_orders
FROM sales
GROUP BY category, region;
categoryregiontotal_salesnum_orders
ElectronicsNorth21002
ElectronicsSouth39002
ClothingNorth7502
ClothingSouth13502

This is extremely useful for hierarchical reporting — you can see performance broken down by multiple dimensions in a single query.


4. The HAVING Clause

HAVING filters groups after aggregation, similar to how WHERE filters rows before aggregation.

SELECT category, SUM(amount) AS total_sales
FROM sales
GROUP BY category
HAVING SUM(amount) > 2500;
categorytotal_sales
Electronics6000

Key difference between WHERE and HAVING:

  • WHERE filters rows before grouping (can’t use aggregate functions)
  • HAVING filters groups after grouping (can use aggregate functions)
-- WHERE filters rows before aggregation
-- HAVING filters groups after aggregation
SELECT region, SUM(amount) AS total_sales
FROM sales
WHERE amount > 500         -- Exclude small orders before grouping
GROUP BY region
HAVING SUM(amount) > 2000; -- Only show regions over $2K total
regiontotal_sales
South4650

(Note: North’s small orders over $500 total only $1350, so North is excluded by HAVING.)


5. GROUPING SETS, CUBE, and ROLLUP

This is where DuckDB’s GROUP BY capabilities really shine. These features generate subtotals and grand totals in a single query.

GROUPING SETS

GROUPING SETS lets you specify multiple grouping levels explicitly:

SELECT category, region, SUM(amount) AS total_sales
FROM sales
GROUP BY GROUPING SETS (
    (category, region),  -- detail level
    (category),          -- subtotal by category
    (region),            -- subtotal by region
    ()                   -- grand total
);
categoryregiontotal_sales
ElectronicsNorth2100
ElectronicsSouth3900
ClothingNorth750
ClothingSouth1350
ElectronicsNULL6000
ClothingNULL2100
NULLNorth2850
NULLSouth5250
NULLNULL8100

The GROUPING() function helps distinguish NULLs from actual data values vs. subtotal markers:

SELECT
  CASE WHEN GROUPING(category) = 0 THEN category ELSE 'ALL' END AS category,
  CASE WHEN GROUPING(region) = 0 THEN region ELSE 'ALL' END AS region,
  SUM(amount) AS total_sales
FROM sales
GROUP BY GROUPING SETS ((category, region), (category), (region), ());

ROLLUP

ROLLUP creates a hierarchy of subtotals — perfect for time-series and hierarchical data:

-- Hierarchical: category → region → grand total
SELECT category, region, SUM(amount) AS total_sales
FROM sales
GROUP BY ROLLUP (category, region);

This is equivalent to:

GROUP BY GROUPING SETS ((category, region), (category), ())

ROLLUP is ideal for reporting by year → quarter → month, or department → team → employee.

CUBE

CUBE generates all possible combination subtotals:

-- All combinations: category × region
SELECT category, region, SUM(amount) AS total_sales
FROM sales
GROUP BY CUBE (category, region);

This is equivalent to:

GROUP BY GROUPING SETS ((category, region), (category), (region), ())

For N dimensions, CUBE generates 2^N grouping sets, while ROLLUP generates N+1.


6. GROUP BY ALL — DuckDB’s Productivity Booster

Introduced in DuckDB 0.8.0, GROUP BY ALL is a game-changer for writing quick analytical queries. It automatically groups by all non-aggregated columns in the SELECT list — you don’t need to manually list them.

Without GROUP BY ALL:

SELECT category, region, SUM(amount) AS total_sales
FROM sales
GROUP BY category, region;  -- Must repeat column names

With GROUP BY ALL:

SELECT category, region, SUM(amount) AS total_sales
FROM sales
GROUP BY ALL;  -- Automatically groups by category and region

This seems simple, but it saves enormous time in exploratory analysis where you’re rapidly iterating on queries:

-- Add more dimensions — GROUP BY ALL handles it automatically
SELECT category, region, sale_date, SUM(amount) AS daily_sales
FROM sales
GROUP BY ALL;

-- Complex expressions also work
SELECT category,
       year(sale_date) AS sale_year,
       SUM(amount) AS total_sales
FROM sales
GROUP BY ALL;

GROUP BY ALL follows a simple rule: every column in the SELECT list that is not wrapped in an aggregate function becomes a grouping column. This is especially powerful when you have many columns and don’t want to type them all twice.

Pro tip: GROUP BY ALL is unique to DuckDB among major databases. It’s one of the features that makes DuckDB exceptionally pleasant for interactive data exploration.


7. Practical Examples with Real Data

Example 1: E-commerce Order Analysis

-- Create an orders table representing real e-commerce data
CREATE TABLE orders AS
SELECT * FROM (VALUES
  ('ORD-001', 'Alice',   'Widget',  3,  15.99, '2026-01-05'),
  ('ORD-002', 'Bob',     'Gadget',  1,  49.99, '2026-01-06'),
  ('ORD-003', 'Alice',   'Widget',  2,  15.99, '2026-01-10'),
  ('ORD-004', 'Charlie', 'Gadget',  5,  49.99, '2026-01-12'),
  ('ORD-005', 'Bob',     'Widget',  10, 15.99, '2026-01-15'),
  ('ORD-006', 'Alice',   'Gadget',  1,  49.99, '2026-01-20'),
  ('ORD-007', 'Charlie', 'Widget',  4,  15.99, '2026-01-22'),
  ('ORD-008', 'Bob',     'Premium', 2,  199.99,'2026-02-01'),
  ('ORD-009', 'Alice',   'Premium', 1,  199.99,'2026-02-05'),
  ('ORD-010', 'Charlie', 'Premium', 3,  199.99,'2026-02-10')
) AS t(order_id, customer, product, quantity, price, order_date);

-- Total revenue per customer (with order count and average value)
SELECT customer,
       COUNT(*)                    AS num_orders,
       SUM(quantity * price)      AS total_spent,
       AVG(quantity * price)      AS avg_order_value,
       MAX(order_date)            AS last_order_date
FROM orders
GROUP BY customer
ORDER BY total_spent DESC;

-- Monthly revenue with subtotals using ROLLUP
SELECT year(order_date) AS yr,
       month(order_date) AS mo,
       SUM(quantity * price) AS revenue
FROM orders
GROUP BY ROLLUP (yr, mo)
ORDER BY yr NULLS LAST, mo NULLS LAST;

Example 2: Customer Segmentation

-- Classify customers by spending behavior
SELECT customer,
       SUM(quantity * price) AS total_spent,
       COUNT(*) AS order_count,
       CASE
         WHEN SUM(quantity * price) >= 500 THEN 'Premium'
         WHEN SUM(quantity * price) >= 200 THEN 'Regular'
         ELSE 'Budget'
       END AS segment
FROM orders
GROUP BY customer
ORDER BY total_spent DESC;

Example 3: Finding Top Categories Per Region

-- Using GROUP BY with filtering
SELECT region, category, SUM(amount) AS total_sales
FROM sales
GROUP BY region, category
HAVING SUM(amount) > 1000
ORDER BY region, total_sales DESC;

Example 4: Statistical Aggregations

DuckDB also supports statistical aggregate functions:

SELECT category,
       AVG(amount)          AS mean,
       STDDEV(amount)       AS std_dev,
       VARIANCE(amount)     AS variance,
       MEDIAN(amount)       AS median_value,
       MODE(amount)         AS most_common_value
FROM sales
GROUP BY category;

8. Performance Tips for GROUP BY in DuckDB

  1. Order of columns matters — Put high-cardinality columns first in GROUP BY for better hash-table performance.
  2. Use GROUP BY ALL in exploration — It reduces typos and speeds up iterative querying.
  3. Prefer ROLLUP over multiple UNION ALL queries — ROLLUP computes subtotals in a single pass.
  4. Filter early — Use WHERE to reduce row count before aggregation for faster queries.
  5. Consider materialized views — For repeated aggregations on large datasets, DuckDB’s materialized views can cache results.

Conclusion

DuckDB’s GROUP BY and aggregation capabilities are among the most powerful in any database system. From basic COUNT and SUM to advanced multi-dimensional analysis with GROUPING SETS, CUBE, and ROLLUP, DuckDB provides everything you need for data summarization and reporting.

The GROUP BY ALL syntax is a standout feature that dramatically improves productivity during exploratory data analysis — no other major database offers this convenience. Combined with DuckDB’s excellent performance for analytical workloads, it makes DuckDB an ideal choice for data scientists, analysts, and engineers who need to aggregate and analyze data quickly.

To continue your DuckDB journey, check out the DuckDB SQL Syntax Guide and the DuckDB Beginner’s Guide 2026 for more foundational knowledge.


Last updated: May 30, 2026

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy