Selecting Distinct Rows Based on a Group in SQL

Introduction

In SQL, you often need to retrieve unique rows from a table. The DISTINCT keyword is a powerful tool for achieving this, but what if you want to select distinct rows within groups defined by one or more columns? This tutorial explains how to achieve this, focusing on scenarios where you want to retrieve only one representative row for each group.

The Problem: Selecting One Row Per Group

Imagine you have a table of products with attributes like ID, SKU, and Product Category. You might want to list each unique product category only once, alongside the details of one product from that category. The standard DISTINCT keyword won’t directly solve this, as it simply eliminates duplicate rows based on all selected columns.

The Solution: Combining Window Functions or Subqueries

There are several ways to select distinct rows based on a group. The most common and efficient approaches involve either window functions (specifically ROW_NUMBER()) or subqueries with aggregate functions like MIN() or GROUP BY.

1. Using Window Functions (`ROW_NUMBER()`)

Window functions allow you to calculate values across a set of table rows that are related to the current row. ROW_NUMBER() assigns a unique sequential integer to each row within a partition (group).

Here’s how it works:

Partitioning: Define the column(s) by which you want to group the rows. This creates partitions of data.
Ordering: Within each partition, specify the order in which the rows should be numbered. This determines which row will be considered the "first" within that group.
Filtering: Select only the rows where the ROW_NUMBER() is equal to 1. This effectively selects one representative row per group.

Example:

Let’s assume we have a table named Products with the following structure:

| ID | SKU | Product Category |
|—|—|—|
| 1 | FOO-23 | Orange |
| 2 | BAR-23 | Orange |
| 3 | FOO-24 | Apple |
| 4 | FOO-25 | Orange |

To select one product per category, the following query can be used:

SELECT 
    ID,
    SKU,
    Product_Category
FROM (
    SELECT 
        ID,
        SKU,
        Product_Category,
        ROW_NUMBER() OVER (PARTITION BY Product_Category ORDER BY ID) AS RowNum
    FROM 
        Products
) AS RankedProducts
WHERE 
    RowNum = 1;

In this query:

PARTITION BY Product_Category divides the rows into groups based on the Product_Category column.
ORDER BY ID sorts the rows within each partition by the ID column. The row with the smallest ID within each category will be assigned RowNum = 1.
WHERE RowNum = 1 filters the results to include only the first row from each category.

2. Using Subqueries and Aggregate Functions

An alternative approach involves using a subquery to identify the minimum (or maximum) value of a key column (like ID) within each group. This value can then be used to filter the main query.

Example:

Using the same Products table as above, the following query achieves the same result:

SELECT 
    p.ID,
    p.SKU,
    p.Product_Category
FROM 
    Products p
INNER JOIN (
    SELECT 
        MIN(ID) AS MinID
    FROM 
        Products
    GROUP BY 
        Product_Category
) AS MinIDs ON p.ID = MinIDs.MinID;

In this query:

The subquery SELECT MIN(ID) AS MinID FROM Products GROUP BY Product_Category finds the minimum ID for each Product_Category.
The INNER JOIN combines the Products table with the results of the subquery, effectively filtering the Products table to include only the rows with the minimum ID for each category.

Choosing the Right Approach

Both window functions and subqueries can achieve the desired result. Here’s a general guideline:

Window functions are often more efficient, especially when dealing with large datasets. They avoid the need for separate subqueries and can be optimized by the database engine.
Subqueries can be easier to understand and write, particularly for simple cases. They are a good choice when performance is not critical.