Introduction
In SQL, you often need to retrieve unique rows from a table. The DISTINCT
keyword is a powerful tool for achieving this, but what if you want to select distinct rows within groups defined by one or more columns? This tutorial explains how to achieve this, focusing on scenarios where you want to retrieve only one representative row for each group.
The Problem: Selecting One Row Per Group
Imagine you have a table of products with attributes like ID, SKU, and Product Category. You might want to list each unique product category only once, alongside the details of one product from that category. The standard DISTINCT
keyword won’t directly solve this, as it simply eliminates duplicate rows based on all selected columns.
The Solution: Combining Window Functions or Subqueries
There are several ways to select distinct rows based on a group. The most common and efficient approaches involve either window functions (specifically ROW_NUMBER()
) or subqueries with aggregate functions like MIN()
or GROUP BY
.
1. Using Window Functions (ROW_NUMBER()
)
Window functions allow you to calculate values across a set of table rows that are related to the current row. ROW_NUMBER()
assigns a unique sequential integer to each row within a partition (group).
Here’s how it works:
- Partitioning: Define the column(s) by which you want to group the rows. This creates partitions of data.
- Ordering: Within each partition, specify the order in which the rows should be numbered. This determines which row will be considered the "first" within that group.
- Filtering: Select only the rows where the
ROW_NUMBER()
is equal to 1. This effectively selects one representative row per group.
Example:
Let’s assume we have a table named Products
with the following structure:
| ID | SKU | Product Category |
|—|—|—|
| 1 | FOO-23 | Orange |
| 2 | BAR-23 | Orange |
| 3 | FOO-24 | Apple |
| 4 | FOO-25 | Orange |
To select one product per category, the following query can be used:
SELECT
ID,
SKU,
Product_Category
FROM (
SELECT
ID,
SKU,
Product_Category,
ROW_NUMBER() OVER (PARTITION BY Product_Category ORDER BY ID) AS RowNum
FROM
Products
) AS RankedProducts
WHERE
RowNum = 1;
In this query:
PARTITION BY Product_Category
divides the rows into groups based on theProduct_Category
column.ORDER BY ID
sorts the rows within each partition by theID
column. The row with the smallestID
within each category will be assignedRowNum = 1
.WHERE RowNum = 1
filters the results to include only the first row from each category.
2. Using Subqueries and Aggregate Functions
An alternative approach involves using a subquery to identify the minimum (or maximum) value of a key column (like ID
) within each group. This value can then be used to filter the main query.
Example:
Using the same Products
table as above, the following query achieves the same result:
SELECT
p.ID,
p.SKU,
p.Product_Category
FROM
Products p
INNER JOIN (
SELECT
MIN(ID) AS MinID
FROM
Products
GROUP BY
Product_Category
) AS MinIDs ON p.ID = MinIDs.MinID;
In this query:
- The subquery
SELECT MIN(ID) AS MinID FROM Products GROUP BY Product_Category
finds the minimumID
for eachProduct_Category
. - The
INNER JOIN
combines theProducts
table with the results of the subquery, effectively filtering theProducts
table to include only the rows with the minimumID
for each category.
Choosing the Right Approach
Both window functions and subqueries can achieve the desired result. Here’s a general guideline:
- Window functions are often more efficient, especially when dealing with large datasets. They avoid the need for separate subqueries and can be optimized by the database engine.
- Subqueries can be easier to understand and write, particularly for simple cases. They are a good choice when performance is not critical.