How to select first row in each GROUP BY group?

To select the first row in each GROUP BY group in SQL, use the ROW_NUMBER() window function (standard SQL) or database-specific methods like DISTINCT ON (PostgreSQL). Below are the most efficient approaches:

1. Using ROW_NUMBER() (Standard SQL, Works in Most Databases)

Assign a row number to each group and select the first row from each group.

Example Table (sales):

sale_idregionsale_dateamount
1East2023-01-05100
2East2023-01-01150
3West2023-01-03200
4West2023-01-02250

Query:

WITH ranked_sales AS (
  SELECT 
    *,
    ROW_NUMBER() OVER (
      PARTITION BY region 
      ORDER BY sale_date  -- Order by the column defining "first" (e.g., earliest date)
    ) AS rn
  FROM sales
)
SELECT sale_id, region, sale_date, amount
FROM ranked_sales
WHERE rn = 1;  -- Select first row per group

Result:

sale_idregionsale_dateamount
2East2023-01-01150
4West2023-01-02250

2. Using DISTINCT ON (PostgreSQL Only)

Directly select the first row per group without a CTE:

SELECT DISTINCT ON (region) *
FROM sales
ORDER BY region, sale_date;  -- Order by group column and the column defining "first"

3. Using a Subquery with MIN()/MAX() (Works in MySQL, SQLite, etc.)

Find the earliest/latest value in the group and join back to the original table:

SELECT s.*
FROM sales s
INNER JOIN (
  SELECT 
    region, 
    MIN(sale_date) AS earliest_date  -- Replace with MAX() for the latest row
  FROM sales
  GROUP BY region
) sub ON s.region = sub.region AND s.sale_date = sub.earliest_date;

Limitations:

  • Fails if multiple rows share the same MIN()/MAX() value in a group.
  • Requires a unique identifier for exact matches.

Key Considerations

  • Ordering: Use ORDER BY in the window function to define what “first” means (e.g., earliest date, smallest ID).
  • Ties: Use RANK() instead of ROW_NUMBER() if you want to include ties (multiple “first” rows).
  • Performance: The ROW_NUMBER() method is efficient for large datasets.

Handling Ties (Multiple “First” Rows)

To include all rows tied for first place, use RANK():

WITH ranked_sales AS (
  SELECT 
    *,
    RANK() OVER (PARTITION BY region ORDER BY sale_date) AS rk
  FROM sales
)
SELECT sale_id, region, sale_date, amount
FROM ranked_sales
WHERE rk = 1;

Database-Specific Notes

DatabaseBest Method
PostgreSQLDISTINCT ON
MySQLSubquery with MIN()/MAX()
SQL ServerROW_NUMBER()
OracleROW_NUMBER()
SQLiteROW_NUMBER() (v3.25+)

Summary

  • Standard Approach: Use ROW_NUMBER() with a CTE.
  • PostgreSQL: Prefer DISTINCT ON for simplicity.
  • Legacy Databases: Use subqueries with MIN()/MAX() and joins.

Leave a Reply

Your email address will not be published. Required fields are marked *