What is the best way to remove duplicate rows from a fairly large
SQL Server table (i.e. 300,000+ rows)?
The rows, of course, will not be perfect duplicates because of the existence of the
RowID identity field.
RowID int not null identity(1,1) primary key, Col1 varchar(20) not null, Col2 varchar(2048) not null, Col3 tinyint not null
This question is tagged with
~ Asked on 2008-08-20 21:51:29
Assuming no nulls, you
GROUP BY the unique columns, and
MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable LEFT OUTER JOIN ( SELECT MIN(RowId) as RowId, Col1, Col2, Col3 FROM MyTable GROUP BY Col1, Col2, Col3 ) as KeepRows ON MyTable.RowId = KeepRows.RowId WHERE KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
~ Answered on 2008-08-20 22:00:00
Another possible way of doing this is
; --Ensure that any immediately preceding statement is terminated with a semicolon above WITH cte AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY ( SELECT 0)) RN FROM #MyTable) DELETE FROM cte WHERE RN > 1;
I am using
ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in
RowID order for example you could use
ORDER BY RowID DESC
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the
GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
ROW_NUMBER solution will always give pretty much the same plan whereas the
GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then
TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
~ Answered on 2010-09-29 14:52:32
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes FROM MyTable dupes, MyTable fullTable WHERE dupes.dupField = fullTable.dupField AND dupes.secondDupField = fullTable.secondDupField AND dupes.uniqueField > fullTable.uniqueField
~ Answered on 2008-08-20 21:53:14
The following query is useful to delete duplicate rows. The table in this example has
ID as an identity column and the columns which have duplicate data are
DELETE FROM TableName WHERE ID NOT IN (SELECT MAX(ID) FROM TableName GROUP BY Column1, Column2, Column3 /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially nullable. Because of semantics of NOT IN (NULL) including the clause below can simplify the plan*/ HAVING MAX(ID) IS NOT NULL)
The following script shows usage of
ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName, COUNT(*) TotalCount FROM YourTableName GROUP BY YourColumnName HAVING COUNT(*) > 1 ORDER BY COUNT(*) DESC
~ Answered on 2011-11-23 15:32:19
delete t1 from table t1, table t2 where t1.columnA = t2.columnA and t1.rowid>t2.rowid
delete from table t1 using table t2 where t1.columnA = t2.columnA and t1.rowid > t2.rowid
~ Answered on 2010-09-30 02:35:06
DELETE LU FROM (SELECT *, Row_number() OVER ( partition BY col1, col1, col3 ORDER BY rowid DESC) [Row] FROM mytable) LU WHERE [row] > 1
~ Answered on 2014-05-21 07:54:47
This will delete duplicate rows, except the first row
DELETE FROM Mytable WHERE RowID NOT IN ( SELECT MIN(RowID) FROM Mytable GROUP BY Col1, Col2, Col3 )
~ Answered on 2013-09-10 13:07:06
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS ( SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN FROM MyTable ) DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS (SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3) FROM MyTable) DELETE CTE WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
~ Answered on 2015-05-19 14:35:26
To Fetch Duplicate Rows:
SELECT name, email, COUNT(*) FROM users GROUP BY name, email HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users WHERE rowid NOT IN (SELECT MIN(rowid) FROM users GROUP BY name, email);
~ Answered on 2016-12-29 10:31:23
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1; delete from t1; insert into t1 select * from t2; drop table t2;
~ Answered on 2013-02-05 21:44:52
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1 --WHERE id IN ( SELECT MIN(id) FROM table1 GROUP BY col1, col2, col3 -- could add a WHERE clause here to further filter HAVING count(*) > 1 --)
~ Answered on 2014-03-01 07:40:18
SELECT DISTINCT * INTO tempdb.dbo.tmpTable FROM myTable TRUNCATE TABLE myTable INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable DROP TABLE tempdb.dbo.tmpTable
~ Answered on 2012-10-10 11:17:38
I thought I'd share my solution since it works under special circumstances. I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction -- create temp table with identical structure as source table Select * Into #temp From tableName Where 1 = 2 -- insert distinct values into temp insert into #temp select distinct * from tableName -- delete from source delete from tableName -- insert into source from temp insert into tableName select * from #temp rollback transaction -- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
~ Answered on 2014-01-27 12:20:09
This query showed very good performance for me:
DELETE tbl FROM MyTable tbl WHERE EXISTS ( SELECT * FROM MyTable tbl2 WHERE tbl2.SameValue = tbl.SameValue AND tbl.IdUniqueValue < tbl2.IdUniqueValue )
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
~ Answered on 2014-12-10 19:36:34
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as ( select min(PrimaryKey) as PrimaryKey UniqueColumn1, UniqueColumn2 from dbo.DuplicatesTable group by UniqueColumn1, UniqueColumn1 having count(*) > 1 ) delete d from dbo.DuplicatesTable d inner join cte on d.PrimaryKey > cte.PrimaryKey and d.UniqueColumn1 = cte.UniqueColumn1 and d.UniqueColumn2 = cte.UniqueColumn2;
~ Answered on 2014-11-13 16:20:35
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE A001 2011-01-01 A001 2011-01-01 A002 2011-01-01 A002 2011-01-01 A002 2011-01-01 A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _ FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
~ Answered on 2013-08-06 17:14:34
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN (SELECT (SELECT TOP 1 RowID FROM MyTable mt2 WHERE mt2.Col1 = mt.Col1 AND mt2.Col2 = mt.Col2 AND mt2.Col3 = mt.Col3) FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
~ Answered on 2008-08-20 22:27:53
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
~ Answered on 2008-08-20 21:58:00
WITH tblTemp as ( SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name) As RowNumber,* FROM <table_name> ) DELETE FROM tblTemp where RowNumber >1
~ Answered on 2015-07-23 11:42:20
I had a table where I needed to preserve non-duplicate rows. I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN ( SELECT MIN(RowID) AS IDNo FROM myTable GROUP BY Col1, Col2, Col3 HAVING COUNT(*) = 2 )
~ Answered on 2009-12-11 13:47:21
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN ( SELECT MIN(id) FROM tblemp GROUP BY title HAVING COUNT(id)>1 )
~ Answered on 2016-09-28 05:26:54
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
~ Answered on 2013-09-18 06:36:12
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1 SELECT * FROM tc_category GROUP BY category_id, application_id HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1 SELECT * FROM tc_category GROUP BY category_id, application_id HAVING count(*) = 1
~ Answered on 2009-05-08 13:06:42
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is:
testing and column names
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno) AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1) or empname not in (select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno) AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
~ Answered on 2012-02-08 12:06:52
Another way of doing this :--
DELETE A FROM TABLE A, TABLE B WHERE A.COL1 = B.COL1 AND A.COL2 = B.COL2 AND A.UNIQUEFIELD > B.UNIQUEFIELD
~ Answered on 2016-02-02 06:59:27
I would mention this approach as well as it can be helpful, and works in all SQL servers: Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted delete from myTable where RowId = DuplicatedID SET ROWCOUNT 0
~ Answered on 2013-01-30 19:45:37
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null) CREATE TRIGGER prevent_plagiarism ON stories after INSERT, UPDATE AS DECLARE @cnt AS INT SELECT @cnt = Count(*) FROM stories INNER JOIN inserted ON ( stories.story = inserted.story AND stories.story_id != inserted.story_id ) IF @cnt > 0 BEGIN RAISERROR('plagiarism detected',16,1) ROLLBACK TRANSACTION END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
~ Answered on 2008-08-20 22:53:12
DELETE FROM table_name T1 WHERE rowid > ( SELECT min(rowid) FROM table_name T2 WHERE T1.column_name = T2.column_name );
~ Answered on 2013-10-03 06:18:06
DELETE FROM MyTable WHERE NOT EXISTS ( SELECT min(RowID) FROM Mytable WHERE (SELECT RowID FROM Mytable GROUP BY Col1, Col2, Col3 )) );
~ Answered on 2014-01-02 15:27:26
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as ( SELECT ROW_NUMBER() OVER ( PARTITION BY DuplicateKey1 ,DuplicateKey2 -- optional ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed ) RN FROM MyTable ) DELETE FROM MYCTE WHERE RN > 1
~ Answered on 2015-01-01 15:32:03
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int) INSERT INTO car(PersonId,CarId) VALUES(1,2),(1,3),(1,2),(2,4) --SELECT * FROM car ;WITH CTE as( SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car) DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
~ Answered on 2012-07-11 11:46:52
Sometimes a soft delete mechanism is used where a date is recorded to indicate the deleted date. In this case an
UPDATE statement may be used to update this field based on duplicate entries.
UPDATE MY_TABLE SET DELETED = getDate() WHERE TABLE_ID IN ( SELECT x.TABLE_ID FROM MY_TABLE x JOIN (SELECT min(TABLE_ID) id, COL_1, COL_2, COL_3 FROM MY_TABLE d GROUP BY d.COL_1, d.COL_2, d.COL_3 HAVING count(*) > 1) AS d ON d.COL_1 = x.COL_1 AND d.COL_2 = x.COL_2 AND d.COL_3 = x.COL_3 AND d.TABLE_ID <> x.TABLE_ID /*WHERE x.COL_4 <> 'D' -- Additional filter*/)
This method has served me well for fairly moderate tables containing ~30 million rows with high and low amounts of duplications.
~ Answered on 2016-06-07 01:29:07
I know that this question has been already answered, but I've created pretty useful sp which will create a dynamic delete statement for a table duplicates:
CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1 AS BEGIN SET NOCOUNT ON; IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix; SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix) IF(@MaxRow is null) RAISERROR ('I wasn''t able to find any columns for this table!',16,1) ELSE BEGIN DECLARE @i int =1 DECLARE @Columns Varchar(max) =''; WHILE (@i <= @MaxRow) BEGIN SET @[email protected]+(SELECT '['+name+'],' from #tableMatrix where rn = @i) SET @i = @i+1; END ---DELETE LAST comma SET @Columns = LEFT(@Columns,LEN(@Columns)-1) DECLARE @Sql nvarchar(max) = ' WITH cteRowsToDelte AS ( SELECT ROW_NUMBER() OVER (PARTITION BY '[email protected]+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '[email protected] +') DELETE FROM cteRowsToDelte WHERE rowNumber > 1; ' SET NOCOUNT OFF; IF(@DebugMode = 1) SELECT @Sql ELSE EXEC sp_executesql @Sql END END
So if you create table like that:
IF(OBJECT_ID('MyLitleTable') is not null) DROP TABLE MyLitleTable CREATE TABLE MyLitleTable ( A Varchar(10), B money, C int ) --------------------------------------------------------- INSERT INTO MyLitleTable VALUES ('ABC',100,1), ('ABC',100,1), -- only this row should be deleted ('ABC',101,1), ('ABC',100,2), ('ABCD',100,1) ----------------------------------------------------------- exec sp_DeleteDuplicate 'MyLitleTable',0
It will delete all duplicates from your table. If you run it without the second parameter it will return a SQL statement to run.
If you need to exclude any of the column just run it in the debug mode get the code and modify it whatever you like.
~ Answered on 2017-04-13 08:49:17
alter table MyTable add sno int identity(1,1) delete from MyTable where sno in ( select sno from ( select *, RANK() OVER ( PARTITION BY RowID,Col3 ORDER BY sno DESC )rank From MyTable )T where rank>1 ) alter table MyTable drop column sno
~ Answered on 2015-12-16 06:11:51
If all the columns in duplicate rows are same then below query can be used to delete the duplicate records.
SELECT DISTINCT * INTO #TemNewTable FROM #OriginalTable TRUNCATE TABLE #OriginalTable INSERT INTO #OriginalTable SELECT * FROM #TemNewTable DROP TABLE #TemNewTable
~ Answered on 2018-10-29 14:20:14
Now lets look elasticalsearch table which this tables has duplicated rows and Id is identical uniq field. We know if some id exist by a group criteria then we can delete other rows outscope of this group. My manner shows this criteria.
So many case of this thread are in the like state of mine. Just change your target group criteria according your case for deleting repeated (duplicated) rows.
DELETE FROM elasticalsearch WHERE Id NOT IN (SELECT min(Id) FROM elasticalsearch GROUP BY FirmId,FilterSearchString )
~ Answered on 2016-01-11 20:31:40
I think this would be helpfull. Here, ROW_NUMBER() OVER(PARTITION BY res1.Title ORDER BY res1.Id)as num has been used to differentiate the duplicate rows.
delete FROM (SELECT res1.*,ROW_NUMBER() OVER(PARTITION BY res1.Title ORDER BY res1.Id)as num FROM (select * from [dbo].[tbl_countries])as res1 )as res2 WHERE res2.num > 1
~ Answered on 2018-06-10 09:27:52
For the table structure
RowID int not null identity(1,1) primary key, Col1 varchar(20) not null, Col2 varchar(2048) not null, Col3 tinyint not null
The query for removing duplicates:
DELETE t1 FROM MyTable t1 INNER JOIN MyTable t2 WHERE t1.RowID > t2.RowID AND t1.Col1 = t2.Col1 AND t1.Col2=t2.Col2 AND t1.Col3=t2.Col3;
I am assuming that
RowIDis kind of auto-increment and rest of the columns have duplicate values.
~ Answered on 2020-08-06 04:01:39