SQL Deduplication Showdown: DISTINCT vs GROUP BY vs ROW_NUMBER()

5 min readSep 10, 2024

One of the key challenges I’ve faced as a data engineer has been the ongoing effort to optimize our data pipelines and conserve the computational resources of our databases. A big part of this involves cleaning up data, and one of the most common tasks is deduplication. In SQL, there are several ways to tackle this, like using

DISTINCT,
GROUP BY,
or the more advanced QUALIFY ROW_NUMBER() OVER (PARTITION BY …) = 1.

Each method has its advantages, and choosing the right one can make a big difference in efficiency. When considering the performance of DISTINCT, GROUP BY, and ROW_NUMBER() with QUALIFY, it's important to understand how each method uses database resources like CPU, memory, and disk I/O. Here’s a summary and in-depth breakdown of when and how to use each:

1. `DISTINCT`

The DISTINCT keyword is the simplest way to remove duplicates from a result set by eliminating rows that are identical across all the selected columns. DISTINCT has minimal overhead since it doesn’t need to group or rank rows, but it can suffer with large datasets as the row comparison count grows.

Syntax:

SELECT DISTINCT column1, column2, ...
FROM table;

Resource Usage:

CPU: Low to moderate, depending on the number of columns and the complexity of comparisons. The database scans the result set and compares each row to previous rows to remove duplicates.
Memory: Low, as DISTINCT typically doesn't need to store additional metadata beyond the unique rows.
Disk I/O: Moderate, as the database may need to scan the entire table (or indexes, if available) to retrieve all rows.

Pros:

Efficient for small datasets or scenarios where all columns are involved.
Works efficiently when you only want unique rows based on all selected columns.
If indexes exist on the columns being compared, DISTINCT can use them to speed up deduplication.

Cons:

Limited to removing duplicates where all columns match exactly.
Doesn’t provide flexibility if you want to keep only certain rows (e.g., the latest entry).
For larger datasets, performance can degrade as comparisons across all rows increase.

2. `GROUP BY`

GROUP BY is used to group rows based on one or more columns. It’s typically used for aggregation (like SUM, COUNT, etc.), but it can also deduplicate by grouping on the relevant columns and selecting one representative row per group. GROUP BY is useful for deduplication with aggregation but can become resource-intensive as the dataset or the complexity of grouping increases.

Syntax:

SELECT column1, MAX(column2)  -- Or any aggregate function
FROM table
GROUP BY column1;

Resource Usage:

CPU: Moderate to high, especially if multiple columns are grouped and aggregated. For each group, the database must apply the specified aggregation functions (e.g., MAX, MIN), which can be CPU-intensive.
Memory: Moderate to high, as the database needs to store intermediate group data in memory until the query finishes.
Disk I/O: High, especially when there are no relevant indexes, as the database scans the table to form groups. Sorting or hashing may be required to process groups efficiently, which can increase I/O if the dataset is large.

Pros:

Allows for advanced deduplication with aggregate functions.
Useful when deduplication depends on specific column values (like the most recent date or the highest score).

Cons:

Requires using aggregate functions, which may not always fit simple deduplication needs.
Slightly more complex than DISTINCT.
Group formation is costly: The database has to scan the dataset, sort or hash the rows into groups, and then compute aggregate functions. This can be resource-intensive, especially for large datasets. Indexes can help with performance, especially when they cover the GROUP BY columns.
The performance is heavily dependent on group size: A large number of groups or uneven group sizes can cause a performance hit.

3. `QUALIFY ROW_NUMBER() OVER (PARTITION BY ...) = 1`

ROW_NUMBER() with QUALIFY is very useful when you want precise control over which duplicate row to keep based on specific ordering (e.g., keeping the earliest or latest entry) or when there are no suitable aggregate functions for your deduplication criteria. ROW_NUMBER() with QUALIFY offers the most flexibility but at a high resource cost. It requires significant CPU and memory to rank and filter rows, especially for large datasets. Optimizing with indexes on the partition and order-by columns can help mitigate some performance overhead.

Syntax:

SELECT *
FROM table
QUALIFY ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY column2) = 1;

Resource Usage:

CPU: High, as the database must evaluate the ROW_NUMBER() function for every row, partition the data, and assign ranks based on the ordering. The ROW_NUMBER() function involves sorting within partitions, which can be CPU-heavy, especially for large datasets.
Memory: High, especially for large datasets, as the database needs to store row numbers and partition data in memory while ranking the rows.
Disk I/O: High, because the dataset may need to be scanned multiple times: once for partitioning and ordering, and once to filter out rows with ROW_NUMBER() = 1. Sorting the partitions on disk can lead to additional disk I/O, especially if the result set is too large to fit in memory.

Pros:

Allows complex deduplication logic, especially when determining which row to keep based on specific conditions (like the latest timestamp).
Flexible and powerful in scenarios where you want to keep the “first” or “last” row for each group.
Indexing on the partition and ordering columns can significantly improve performance.

Cons:

Can be more complex and less intuitive than DISTINCT or GROUP BY.
Potentially less performant for large datasets compared to DISTINCT.
Complex and flexible: While ROW_NUMBER() is powerful for deduplication, it is resource-intensive. The partitioning and ordering process demands significant CPU and memory, particularly for large datasets.
Sorting and ranking each partition requires substantial work, especially if the partitions are large or the ordering is complex.

Summary for Performance Considerations:

Indexes: Indexing the columns used in DISTINCT, GROUP BY, or ROW_NUMBER() can greatly improve performance by reducing the need for full table scans and minimizing disk I/O.
Sorting and Hashing: Both GROUP BY and ROW_NUMBER() often require sorting or hashing to group or rank rows, which consumes CPU, memory, and disk I/O resources.
Dataset Size: All three methods may slow down significantly for large datasets due to the increase in comparisons, sorting, and grouping operations.
Temporary Storage: Complex queries (especially GROUP BY and ROW_NUMBER()) may require temporary storage on disk if the intermediate result set is too large to fit in memory.

In summary, the performance of these methods depends largely on the size of the dataset, the existence of relevant indexes, and the complexity of the deduplication logic. Optimizing indexes and understanding the dataset’s structure are key to improving performance.

SQL Deduplication Showdown: DISTINCT vs GROUP BY vs ROW_NUMBER()

1. `DISTINCT`

Syntax:

Resource Usage:

Pros:

Cons:

2. `GROUP BY`

Syntax:

Resource Usage:

Pros:

Cons:

3. `QUALIFY ROW_NUMBER() OVER (PARTITION BY ...) = 1`

Syntax:

Resource Usage:

Pros:

Cons:

Summary for Performance Considerations:

Written by The Data Explorer

No responses yet

SQL Deduplication Showdown: DISTINCT vs GROUP BY vs ROW_NUMBER()

1. DISTINCT

Syntax:

Resource Usage:

Pros:

Cons:

2. GROUP BY

Syntax:

Resource Usage:

Pros:

Cons:

3. QUALIFY ROW_NUMBER() OVER (PARTITION BY ...) = 1

Syntax:

Resource Usage:

Pros:

Cons:

Summary for Performance Considerations:

Written by The Data Explorer

No responses yet

1. `DISTINCT`

2. `GROUP BY`

3. `QUALIFY ROW_NUMBER() OVER (PARTITION BY ...) = 1`