Post

Aggregation and Grouping

The foundational knowledge stems from the NYU’s “CSCI-UA 479 Data Management and Analysis” course by Joe Versoza.

GROUP BY and Aggregate Functions

Adding a GROUP BY clause enables the grouping of rows and the application of aggregate functions to each group:

  • Place the GROUP BY clause after FROM and WHERE, followed by the column name(s) used for grouping.
  • It is used to group rows after the application of the WHERE clause.
  • Aggregate functions can be applied to each group.
  • Distinct from DISTINCT, which only provides unique values without aggregation.

Additional information can be found in the GROUP BY documentation.

Aggregate Functions

When grouping rows using GROUP BY, various aggregate functions can be applied to each group in different parts of the query, such as the SELECT list or ORDER BY clause:

  • AVG
  • SUM
  • MAX
  • MIN
  • COUNT

Reviewing the processing order of a SELECT statement, let’s identify the placement of GROUP BY

  1. FROM: Determines the set of all possible rows to return.
  2. WHERE: Filters out rows that do not match specified criteria.
  3. GROUP BY: Combines rows into groups, and aggregate functions are computed (HAVING can be used to filter out groups).
  4. SELECT list: Determines the actual values of resulting rows by evaluating expressions, resolving column names, etc.
  5. DISTINCT: Eliminates duplicate rows in the output.
  6. ORDER BY: Sorts the output rows.
  7. LIMIT: Restricts the output rows to a specific Reviewing

GROUP BY Examples

Assume the existence of the following table containing some data

1
2
3
4
5
6
7
CREATE TABLE student(
	netid varchar(20) PRIMARY KEY,
	first varchar(255) NOT NULL,
	last varchar(255) NOT NULL,
	midterm numeric,
	registered timestamptz DEFAULT NOW()
);

Group by first name, show counts for each group

1
SELECT first, COUNT(*) FROM student GROUP BY first;

Group by first name, show the midterm everage for students with same first name

1
SELECT first, AVG(midterm) FROM stduent GROUP BY first;

Depending on the processing order of clauses, a group may become available for use!

For instance, as ORDER BY is processed last, both GROUP BY and the SELECT list have already been processed. Consequently, the aliased column ‘c’ is available in the ORDER BY clause…

1
2
3
4
5
6
-- group by country name
-- show the highest concentration for each group
SELECT country, MAX(concentration) AS c 
    FROM product 
    GROUP BY country 
    ORDER BY c desc;

It’s possible to specify multiple columns to group by.

  • groups are formed where values for both columns are equal
  • for example, the following query counts the number of students that have the same first name and same midterm score
1
SELECT first, midterm, COUNT(*) FROM student GROUP BY first, midterm;

output may look something like (two people named alice scored a 90 on their midterm)

1
2
3
Alice | 90 | 2
Alice | 71 | 1
Bob   | 84 | 1

No GROUP BY:

  • if an aggregate function is used, but no GROUP BY is present…
  • all selected rows form a single group…
  • the aggregate function is applied to that single group

Knowing this, it’s possible to to run query that applies an aggregate function on a group that includes all rows:

1
SELECT MAX(score) FROM student;

Add a HAVING clause after GROUP BY to eliminate groups that do not satisfy a condtion. HAVING is different from WHERE. WHERE filters individual rows before the application of GROUP BY, while HAVING filters group rows created by GROUP BY. Each column referenced in condition must unambiguously reference a grouping column.

Only include groups that have an average score greater than 70:

1
2
3
4
SELECT FIRST, AVG(midterm) 
	FROM student 
	GROUP BY first
	HAVING AVG(midterm) > 70;

Additional information can be found in the HAVING documentation.

GROUP BY Errors

It is not valid for the SELECT list to reference ungrouped columns (columns not specified in GROUP BY) unless the ungrouped column name appears as an argument to an aggregate function. This restriction exists to avoid ambiguity, as there could be multiple possible values for an ungrouped column, and selecting one value might not be meaningful for the entire group.

However, there is an exception to this rule! An ungrouped column can appear in the SELECT list when the GROUP BY column is a primary key. This exception arises because each group, in this case, will have only one row, ensuring that there is a unique value for the ungrouped column within each group.

GROUP BY Error Example

1
2
3
4
5
6
7
-- group by country 
-- note that name doesn't appear in GROUP BY
-- and it's not in an aggregate
SELECT country, name, MAX(concentration) AS c 
    FROM product 
    GROUP BY country 
    ORDER BY c desc;

Foreshadowing

At times, we have a strong desire for ungrouped columns to be present in our query result. One approach to achieve this is:

  1. Execute the GROUP BY and aggregation in one query.
  2. Utilize that query as a subquery.
  3. Join it with another query, perhaps itself, to retrieve the additional columns.

MAX / MIN vs ORDER BY and LIMIT

If you want to retrive the maximum or minimum value from a column, along with potential additional column data, which method do you believe would be more efficient?

  1. Using MAX or MIN.
  2. Using ORDER BY and LIMIT (possibly with a limit of 1).

While ORDER BY may take longer since it needs to sort all rows, MAX or MIN only requires a sequential scan of the rows. However, other factors, such as the total number of rows (sorting is likely to take longer with a larger number of rows) and the involvement of indexes, also play a role in the overall performance.

Performance is not the entire narrative; there are scenarios where one approach might outshine another:

  1. ORDER BY allows sorting by more than just the grouped column and the aggregate function.
  2. ORDER BY with LIMIT 1 assumes only one minimum or maximum row, potentially overlooking ties where there might be more than one. In such cases, using MAX with a subquery could be a preferable alternative.

Note that you cannot nest aggregate functions: MAX(AVG(col))

  • each aggregate requires a grouping
  • the way around this is to use subqueries