How can I remove duplicate rows in SQL Server?

To remove duplicate rows in SQL Server, you can use several methods depending on your specific requirements. Below are the most common and effective approaches:

1. Using `ROW_NUMBER()` with a CTE (Common Table Expression)

This method identifies duplicates using a window function and deletes them while retaining one instance of each duplicate group.

Example:

WITH CTE AS (
    SELECT 
        *,
        ROW_NUMBER() OVER (
            PARTITION BY Column1, Column2, Column3 -- Columns defining duplicates
            ORDER BY (SELECT 0) -- Optional: Order to decide which row to keep
        ) AS row_num
    FROM 
        YourTable
)
DELETE FROM CTE WHERE row_num > 1;

Steps:

PARTITION BY: List the columns that define duplicates (e.g., FirstName, LastName).
ORDER BY: Determines which duplicate to keep (e.g., CreatedDate DESC to keep the newest row).
Delete rows where row_num > 1 (retains the first row in each partition).

2. Using `GROUP BY` and `MIN/MAX` (for tables with a unique identifier)

If your table has a unique column (e.g., an ID), delete duplicates by retaining one ID per group.

Example:

DELETE FROM YourTable
WHERE ID NOT IN (
    SELECT MIN(ID) -- Keep the smallest ID in each duplicate group
    FROM YourTable
    GROUP BY Column1, Column2, Column3 -- Columns defining duplicates
);

3. Using a Temporary Table

Copy distinct rows into a temporary table, truncate the original table, and re-insert the deduplicated data.

Example:

-- Step 1: Select distinct rows into a temp table
SELECT DISTINCT * INTO #TempTable FROM YourTable;

-- Step 2: Truncate the original table
TRUNCATE TABLE YourTable;

-- Step 3: Re-insert the deduplicated data
INSERT INTO YourTable SELECT * FROM #TempTable;

-- Step 4: Drop the temp table
DROP TABLE #TempTable;

4. Using `DELETE` with a Self-Join

Identify duplicates and delete them using a self-join.

Example:

DELETE t1
FROM YourTable t1
JOIN YourTable t2 ON 
    t1.Column1 = t2.Column1 AND 
    t1.Column2 = t2.Column2 AND 
    t1.Column3 = t2.Column3 -- Columns defining duplicates
WHERE t1.ID > t2.ID; -- Keep the row with the smallest ID

Key Considerations

Backup First: Always test with a SELECT statement to verify which rows will be deleted.

   -- Test before deleting
   WITH CTE AS (...)
   SELECT * FROM CTE WHERE row_num > 1;

Indexes: Add indexes on the columns defining duplicates to improve performance for large tables.
Unique Constraints: After deduplication, add a UNIQUE constraint to prevent future duplicates:

   ALTER TABLE YourTable 
   ADD CONSTRAINT UC_YourTable UNIQUE (Column1, Column2, Column3);

Example Workflow

Identify Duplicates:

   SELECT 
       Column1, Column2, Column3, 
       COUNT(*) AS DuplicateCount
   FROM YourTable
   GROUP BY Column1, Column2, Column3
   HAVING COUNT(*) > 1;

Delete Duplicates (using the CTE method).
Add Constraints to prevent recurrence.

Summary

Method	Best For	Pros	Cons
`ROW_NUMBER()` with CTE	Most scenarios, flexible ordering	Retains control over which row to keep	Requires understanding of window functions
`GROUP BY` & `MIN/MAX`	Tables with a unique identifier	Simple logic	Requires a unique column (e.g., `ID`)
Temporary Table	Simple deduplication	Works for all duplicate types	Resets identity columns, slower for large tables
Self-Join	Tables with unique IDs	Explicit control over deletions	Complex for multiple duplicate columns

Choose the method that best fits your table structure and requirements!

1. Using ROW_NUMBER() with a CTE (Common Table Expression)

2. Using GROUP BY and MIN/MAX (for tables with a unique identifier)

3. Using a Temporary Table

4. Using DELETE with a Self-Join

Key Considerations

Example Workflow

Summary

Leave a Reply Cancel reply

1. Using `ROW_NUMBER()` with a CTE (Common Table Expression)

2. Using `GROUP BY` and `MIN/MAX` (for tables with a unique identifier)

4. Using `DELETE` with a Self-Join