How can I remove duplicate rows in SQL Server?

To remove duplicate rows in SQL Server, you can use several methods depending on your specific requirements. Below are the most common and effective approaches:

1. Using ROW_NUMBER() with a CTE (Common Table Expression)

This method identifies duplicates using a window function and deletes them while retaining one instance of each duplicate group.

Example:

WITH CTE AS (
    SELECT 
        *,
        ROW_NUMBER() OVER (
            PARTITION BY Column1, Column2, Column3 -- Columns defining duplicates
            ORDER BY (SELECT 0) -- Optional: Order to decide which row to keep
        ) AS row_num
    FROM 
        YourTable
)
DELETE FROM CTE WHERE row_num > 1;

Steps:

  1. PARTITION BY: List the columns that define duplicates (e.g., FirstName, LastName).
  2. ORDER BY: Determines which duplicate to keep (e.g., CreatedDate DESC to keep the newest row).
  3. Delete rows where row_num > 1 (retains the first row in each partition).

2. Using GROUP BY and MIN/MAX (for tables with a unique identifier)

If your table has a unique column (e.g., an ID), delete duplicates by retaining one ID per group.

Example:

DELETE FROM YourTable
WHERE ID NOT IN (
    SELECT MIN(ID) -- Keep the smallest ID in each duplicate group
    FROM YourTable
    GROUP BY Column1, Column2, Column3 -- Columns defining duplicates
);

3. Using a Temporary Table

Copy distinct rows into a temporary table, truncate the original table, and re-insert the deduplicated data.

Example:

-- Step 1: Select distinct rows into a temp table
SELECT DISTINCT * INTO #TempTable FROM YourTable;

-- Step 2: Truncate the original table
TRUNCATE TABLE YourTable;

-- Step 3: Re-insert the deduplicated data
INSERT INTO YourTable SELECT * FROM #TempTable;

-- Step 4: Drop the temp table
DROP TABLE #TempTable;

4. Using DELETE with a Self-Join

Identify duplicates and delete them using a self-join.

Example:

DELETE t1
FROM YourTable t1
JOIN YourTable t2 ON 
    t1.Column1 = t2.Column1 AND 
    t1.Column2 = t2.Column2 AND 
    t1.Column3 = t2.Column3 -- Columns defining duplicates
WHERE t1.ID > t2.ID; -- Keep the row with the smallest ID

Key Considerations

  1. Backup First: Always test with a SELECT statement to verify which rows will be deleted.
   -- Test before deleting
   WITH CTE AS (...)
   SELECT * FROM CTE WHERE row_num > 1;
  1. Indexes: Add indexes on the columns defining duplicates to improve performance for large tables.
  2. Unique Constraints: After deduplication, add a UNIQUE constraint to prevent future duplicates:
   ALTER TABLE YourTable 
   ADD CONSTRAINT UC_YourTable UNIQUE (Column1, Column2, Column3);

Example Workflow

  1. Identify Duplicates:
   SELECT 
       Column1, Column2, Column3, 
       COUNT(*) AS DuplicateCount
   FROM YourTable
   GROUP BY Column1, Column2, Column3
   HAVING COUNT(*) > 1;
  1. Delete Duplicates (using the CTE method).
  2. Add Constraints to prevent recurrence.

Summary

MethodBest ForProsCons
ROW_NUMBER() with CTEMost scenarios, flexible orderingRetains control over which row to keepRequires understanding of window functions
GROUP BY & MIN/MAXTables with a unique identifierSimple logicRequires a unique column (e.g., ID)
Temporary TableSimple deduplicationWorks for all duplicate typesResets identity columns, slower for large tables
Self-JoinTables with unique IDsExplicit control over deletionsComplex for multiple duplicate columns

Choose the method that best fits your table structure and requirements!

Leave a Reply

Your email address will not be published. Required fields are marked *