How to find duplicate values in a SQL table ?

To find duplicate values in a SQL table, you need to identify rows where specific columns (or combinations of columns) have the same values. Below are detailed methods and examples for detecting duplicates:

1. Basic Method: GROUP BY and HAVING

Identify duplicates based on one or more columns using aggregation.

Example 1: Find Duplicates in a Single Column

SELECT column_name, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;

Example 2: Find Duplicates Across Multiple Columns

SELECT column1, column2, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY column1, column2
HAVING COUNT(*) > 1;

2. List All Duplicate Rows

To see all rows that are duplicates (not just the grouped keys), join the table with the grouped duplicates.

Example:

SELECT t.*
FROM table_name t
INNER JOIN (
  SELECT column1, column2
  FROM table_name
  GROUP BY column1, column2
  HAVING COUNT(*) > 1
) dup
ON t.column1 = dup.column1 AND t.column2 = dup.column2;

3. Use ROW_NUMBER() to Identify Duplicates

Assign a row number to duplicates and filter them. This is useful for deleting duplicates later.

Example:

WITH cte AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY column1, column2
      ORDER BY (SELECT NULL)
    ) AS row_num
  FROM table_name
)
SELECT * FROM cte WHERE row_num > 1;

4. Find Entire Duplicate Rows

Check for rows where all column values are identical.

Example:

SELECT *
FROM table_name
GROUP BY column1, column2, column3, ... -- List all columns
HAVING COUNT(*) > 1;

5. Delete Duplicates (Optional)

Once duplicates are identified, you can delete them. Use a CTE or subquery for safety.

Example:

WITH cte AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY column1, column2
      ORDER BY id -- Use a unique identifier to decide which row to keep
    ) AS row_num
  FROM table_name
)
DELETE FROM cte WHERE row_num > 1;

Key Notes

  • Columns Matter: Duplicates are defined by the columns you include in GROUP BY or PARTITION BY.
  • NULL Values: NULL = NULL is treated as not equal in SQL. Adjust your logic if NULLs are present.
  • Performance: Use indexes on columns involved in grouping/partitioning for large tables.
  • Backup: Always back up data before deleting duplicates.

Example Walkthrough

Assume a table employees with duplicates in email:

idnameemail
1Alicealice@mail.com
2Bobbob@mail.com
3Alicealice@mail.com

Step 1: Identify Duplicate Emails

SELECT email, COUNT(*) AS duplicate_count
FROM employees
GROUP BY email
HAVING COUNT(*) > 1;

Result:

emailduplicate_count
alice@mail.com2

Step 2: List All Duplicate Rows

SELECT * FROM employees
WHERE email IN (
  SELECT email
  FROM employees
  GROUP BY email
  HAVING COUNT(*) > 1
);

Result:

idnameemail
1Alicealice@mail.com
3Alicealice@mail.com

Summary

  • Use GROUP BY and HAVING to find duplicates.
  • Use ROW_NUMBER() to isolate duplicates for deletion.
  • Adjust logic based on your definition of a “duplicate” (specific columns vs. entire rows).

Leave a Reply

Your email address will not be published. Required fields are marked *