Aggregation and Grouping
The foundational knowledge stems from the NYU’s “CSCI-UA 479 Data Management and Analysis” course by Joe Versoza.
GROUP BY and Aggregate Functions
Adding a GROUP BY
clause enables the grouping of rows and the application of aggregate functions to each group:
- Place the
GROUP BY
clause afterFROM
andWHERE
, followed by the column name(s) used for grouping. - It is used to group rows after the application of the
WHERE
clause. - Aggregate functions can be applied to each group.
- Distinct from
DISTINCT,
which only provides unique values without aggregation.
Additional information can be found in the GROUP BY documentation.
Aggregate Functions
When grouping rows using GROUP BY
, various aggregate functions can be applied to each group in different parts of the query, such as the SELECT
list or ORDER BY
clause:
AVG
SUM
MAX
MIN
COUNT
Reviewing the processing order of a SELECT statement, let’s identify the placement of GROUP BY
FROM
: Determines the set of all possible rows to return.WHERE
: Filters out rows that do not match specified criteria.GROUP BY
: Combines rows into groups, and aggregate functions are computed (HAVING
can be used to filter out groups).SELECT
list: Determines the actual values of resulting rows by evaluating expressions, resolving column names, etc.DISTINCT
: Eliminates duplicate rows in the output.ORDER BY
: Sorts the output rows.LIMIT
: Restricts the output rows to a specific Reviewing
GROUP BY Examples
Assume the existence of the following table containing some data
1
2
3
4
5
6
7
CREATE TABLE student(
netid varchar(20) PRIMARY KEY,
first varchar(255) NOT NULL,
last varchar(255) NOT NULL,
midterm numeric,
registered timestamptz DEFAULT NOW()
);
Group by first name, show counts for each group
1
SELECT first, COUNT(*) FROM student GROUP BY first;
Group by first name, show the midterm everage for students with same first name
1
SELECT first, AVG(midterm) FROM stduent GROUP BY first;
Depending on the processing order of clauses, a group may become available for use!
For instance, as ORDER BY
is processed last, both GROUP BY
and the SELECT
list have already been processed. Consequently, the aliased column ‘c’ is available in the ORDER BY
clause…
1
2
3
4
5
6
-- group by country name
-- show the highest concentration for each group
SELECT country, MAX(concentration) AS c
FROM product
GROUP BY country
ORDER BY c desc;
It’s possible to specify multiple columns to group by.
- groups are formed where values for both columns are equal
- for example, the following query counts the number of students that have the same first name and same midterm score
1
SELECT first, midterm, COUNT(*) FROM student GROUP BY first, midterm;
output may look something like (two people named alice scored a 90 on their midterm)
1
2
3
Alice | 90 | 2
Alice | 71 | 1
Bob | 84 | 1
No GROUP BY
:
- if an aggregate function is used, but no
GROUP BY
is present… - all selected rows form a single group…
- the aggregate function is applied to that single group
Knowing this, it’s possible to to run query that applies an aggregate function on a group that includes all rows:
1
SELECT MAX(score) FROM student;
Add a HAVING
clause after GROUP BY
to eliminate groups that do not satisfy a condtion. HAVING
is different from WHERE
. WHERE
filters individual rows before the application of GROUP BY
, while HAVING
filters group rows created by GROUP BY
. Each column referenced in condition must unambiguously reference a grouping column.
Only include groups that have an average score greater than 70:
1
2
3
4
SELECT FIRST, AVG(midterm)
FROM student
GROUP BY first
HAVING AVG(midterm) > 70;
Additional information can be found in the HAVING documentation.
GROUP BY Errors
It is not valid for the SELECT
list to reference ungrouped columns (columns not specified in GROUP BY) unless the ungrouped column name appears as an argument to an aggregate function. This restriction exists to avoid ambiguity, as there could be multiple possible values for an ungrouped column, and selecting one value might not be meaningful for the entire group.
However, there is an exception to this rule! An ungrouped column can appear in the SELECT
list when the GROUP BY
column is a primary key. This exception arises because each group, in this case, will have only one row, ensuring that there is a unique value for the ungrouped column within each group.
GROUP BY Error Example
1
2
3
4
5
6
7
-- group by country
-- note that name doesn't appear in GROUP BY
-- and it's not in an aggregate
SELECT country, name, MAX(concentration) AS c
FROM product
GROUP BY country
ORDER BY c desc;
Foreshadowing
At times, we have a strong desire for ungrouped columns to be present in our query result. One approach to achieve this is:
- Execute the
GROUP BY
and aggregation in one query. - Utilize that query as a subquery.
- Join it with another query, perhaps itself, to retrieve the additional columns.
MAX / MIN vs ORDER BY and LIMIT
If you want to retrive the maximum or minimum value from a column, along with potential additional column data, which method do you believe would be more efficient?
- Using
MAX
orMIN
. - Using
ORDER BY
andLIMIT
(possibly with a limit of 1).
While ORDER BY
may take longer since it needs to sort all rows, MAX
or MIN
only requires a sequential scan of the rows. However, other factors, such as the total number of rows (sorting is likely to take longer with a larger number of rows) and the involvement of indexes, also play a role in the overall performance.
Performance is not the entire narrative; there are scenarios where one approach might outshine another:
ORDER BY
allows sorting by more than just the grouped column and the aggregate function.ORDER BY
withLIMIT 1
assumes only one minimum or maximum row, potentially overlooking ties where there might be more than one. In such cases, usingMAX
with a subquery could be a preferable alternative.
Note that you cannot nest aggregate functions: MAX(AVG(col))
- each aggregate requires a grouping
- the way around this is to use subqueries