Wednesday, March 24, 2010

Removing full duplicates by using row_number() function and common table expression

This blog article is inspired by another very informative one which explains really well the meaning of the new row_number() function introduced since SQL2005. Since the original article did not give a full answer to how to delete multiple fully duplicated rows in a table, I am putting my code here for the test.

The good thing about row_number() function is that it can be based on partition of columns. New row number count starts for each partition of full duplicates and those rows with row number bigger than 1 can be easily spotted and removed. Here is the example that explains everything I am talking about here.

First of all, create a table in a test db

USE [test_db]

GO

CREATE TABLE [dbo].[Employee](
[EMPID] [int] NULL,
[FNAME] [varchar](50) NULL,
[LNAME] [varchar](50) NULL
) ON [PRIMARY]


Then let's insert many duplicated rows:

INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2021110, 'MICHAEL', 'POLAND')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2021110, 'MICHAEL', 'POLAND')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2021115, 'JIM', 'KENNEDY')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2021115, 'JIM', 'KENNEDY')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2021115, 'JIM', 'KENNEDY')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2121000, 'JAMES', 'SMITH')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2121000, 'JAMES', 'SMITH')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2011111, 'ADAM', 'ACKERMAN')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (3015670, 'MARTHA', 'LEDERER')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (3015670, 'MARTHA', 'LEDERER')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (3015670, 'MARTHA', 'LEDERER')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (3015670, 'MARTHA', 'LEDERER')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (1021710, 'MARIAH', 'MANDEZ')
GO

If you selete everything from this table now, the following should be there:

EMPID FNAME LNAME
1021710 MARIAH MANDEZ
2011111 ADAM ACKERMAN
2021110 MICHAEL POLAND
2021110 MICHAEL POLAND
2021115 JIM KENNEDY
2021115 JIM KENNEDY
2021115 JIM KENNEDY
2121000 JAMES SMITH
2121000 JAMES SMITH
3015670 MARTHA LEDERER
3015670 MARTHA LEDERER
3015670 MARTHA LEDERER
3015670 MARTHA LEDERER

And here comes the de-dup script with the help of common table expression:

with temp_employee as

( select row_number() over (partition by EMPID, FNAME, LNAME order by EMPID) as rowid, * from EMPLOYEE)
delete temp_employee where rowid > 1


Hope this article is illustrative enough.



No comments:

Post a Comment