Tuesday, November 22, 2011

How to Dedupe a SQL Database Table

We had some log files get parsed twice this morning and accidentally fed the same data in to our table twice due to a race condition.  Removing the extra records was fairly simple, I started out simply trying to find the duplicate rows and expanded from there to remove the extra records.

calllogid is the primary key in the table
uicallid is a row that _should_ be unique but is not because of data from a legacy system being imported in to the table

The logs that were imported twice were identical in every way except for the calllogid, and I chose to match on uicallid because that is a unique value that should never be duplicated.

I started out trying to get the calllogid of every record that was duplicated and I used a subquery to do this:
count(uicallid) as count,
( SELECT TOP 1 CallLogID from dbo.tablename as sub where sub.uicallid = main.uicallid) as id 
FROM dbo.tablename as main 
group by main.uicallid
having count(main.uicallid) > 1 

Manually checking the output I was able to see how many times each record was duplicated and each record from this morning showed up twice, just as i expected.  Once I was confident I had the correct data selected I asked a coworker for help and I was taught you could do a delete where fieldname IN (<a subquery>).  

Now the final query looks like this:

DELETE FROM dbo.tablename
WHERE CallLogID IN ( select
( SELECT TOP 1 CallLogID from dbo.tablename as sub where sub.uicallid = main.uicallid) as id 
FROM dbo.tablename as main 
group by main.uicallid
having count(main.uicallid) > 1 )

The query to delete the duplicate records ran on the 5.5 million row 1GB table in 15 seconds, and it took 3 seconds to do the select to find them.  Not tooo bad.  If there was more than one duplicate entry this query would only delete one of them, so the query would need to be run multiple times to delete multiple entries.

No comments:

Post a Comment