Identify "duplicated" rows

**mkallover** · 01-15-2013, 09:26 AM

This might be a pretty common query but I'm not really sure how to go about doing it.

My company has a table that identifies doctors and their forms of ID. Because of some faulty load programs, we have an issue where a single doctor may have multiple rows on the table. I'm trying to identify and correct those situations. Here is what the issue looks like:

InternalID	ID1	ID2	Name
1	12345	AA1234	Mike Smith, MD
2	12345	12345	Mike Smith, MD

That second row should not be there but because of the load program, we have millions of rows on our table like that.

So what I would like my query to do is identify all of the rows on the table where there is one row that is correct (InternalID = 1) and one row that is incorrect (InternalID = 2). The eventual goal being to run a job that would delete the incorrect rows. There are millions of rows on the table so optimally the query should probably only return the InternalID of the incorrect rows.

I'd appreciate any ideas you all might have.

**alansidman** · 01-15-2013, 09:30 AM

Have you attempted to resolve this using the Duplicates Query Wizard built into Access? That would be my first step in this project.

**orange** · 01-15-2013, 11:56 AM

I agree with Alan, and hope you have a backup in a safe place.
One of the things that jumps out to me is WHY/HOW did duplicates get added? There appears to be a major glitch in some processes or procedures.
Do you have a data model? Have you NORMALIZED your tables?

There are millions of rows on the table

Really? How do you get millions of rows into this situation?

**alansidman** · 01-15-2013, 12:06 PM

Just recalled a video that will also work for you.

http://www.datapigtechnologies.com/f...teproblem.html

**mkallover** · 01-15-2013, 12:14 PM

I agree with Alan, and hope you have a backup in a safe place. - It is backed up. I am querying a table that sits in a data warehouse.

One of the things that jumps out to me is WHY/HOW did duplicates get added? There appears to be a major glitch in some processes or procedures. - I think I addressed this in my original post by to clarify: We have programs that load the table based on multiple sources files we receive. There are faults in the load programs that are being corrected in another effort but I am also trying to identify the impacted rows.

Do you have a data model? Have you NORMALIZED your tables? - No, the table is not normalized and unfortunately I have no power to do anything about that.

There are millions of rows on the table

Really? How do you get millions of rows into this situation? - When you're dealing with a table that has records for every single doctor that practices in the US, you can build up a lot of rows quickly when you have a flaw. I was not involved with the creation of the table or the load programs but now I am part of a process to try and clean it up.

**mkallover** · 01-15-2013, 12:19 PM

To Alan:

I've tried the duplicate wizard but I'm still having a tough time identifying only the incorrect rows and not returning the "good" rows.

**orange** · 01-15-2013, 12:38 PM

Can you tell us in plain English what makes the "good" rows? Do you have a definition of what a unique/good record is?

If you are dealing with millions of records, multiple sources, I hope someone is looking at Table design and has a plan.
It sounds like there may be a lot of independent work efforts underway.

**mkallover** · 01-15-2013, 12:45 PM

A "good" row would be one where ID1 <> ID2. Those are the rows that are caused by the faulty load programs that are being corrected.

I think I can first identify all of the duplicate rows using the query wizard and then use an additional query to pull out the rows where ID1 = ID2 and that will give me my population of incorrect rows.

Identify "duplicated" rows

Thread Tools

Identify "duplicated" rows

Similar Threads

"at most one record can be returned by subquery" after appending new rows

"allow value list edits" button disappear when "allow full menus" untick

Identify " " in String

Identify " " and ","

"Internal" timestamp on table rows?

Posting Permissions