De-duplicating new data

**BassettProvidentia** · 03-01-2011, 03:38 AM

Good morning,

I have a prospective client that is interested in having a User Defined Function written that will prevent duplicated data from being appended to his database.

The option I am proposing is that each of the new rows is appended to the master database where the UDF returns false.

The UDF, each interation, will go through each row in the master database and apply another UDF to field in question. If the result is 'false', the outer UDF will return false to query.

The database is quite large and I am keen to ensure that the solution I offer is as efficient as possible. Is there a better way to achieve this?

**orange** · 03-02-2011, 03:31 PM

What exactly constitutes duplicate data in this case?
Why not have a PrimaryKey on the Table involved - Access will not allow duplicates then.
How would an "append" occur? Hopefully access to tables is controlled by Forms in the application.

You could use DCount (in a BeforeUpdate event) and check for 0 records then allow append, if >0 a potential dup so bypass an append

**BassettProvidentia** · 03-03-2011, 01:38 AM

Thanks for replying orange.

The fields are a list of organisation names that come from an external source. Their is not a unique identifier within the new incoming data, so I am writing a VBA function that searches the "master table" for a similar name and reports back what the matching record is.

There are 500,000 master records and 1000 new rocords, so my function is, which searches 500,000 records is being run 1000 times. Although it is working fairly well at this stage, it is quite slow and I was wondering if there was an easier (for the machine's resources) was of doing things.

The append is being run from an append query, as the data is not input manually but somes from an external datasource. Basically, the row from the list of new companies will be added if the UDF DeDupe() returns null.

Can you think of anything quicker?

**orange** · 03-03-2011, 07:39 AM

So, not knowing anything about your business/application, I have to ask what exactly is a Master table? Master of what and to whom? If you maintain some sort of authoritative list of 500,000 records, I don't see why you can't assign a unique number within your scope.
In any case, I would build a Table of Master names, and using whatever criteria you have, Name/PostalCode or whatever, establish a list of unique records. I would hold that table -almost sacred. Then, using whatever algorithm you have for the uniqueness of names, compare incoming records with the current Master.

You could use a Dcount to check for duplicates, such that 0 means we don't have it and a 1 indicates - already present.

What do you use these Companies for? Don't you need addresses, PostalCode, etc for your needs?

A little more background on the business needs would help in any advice. I would think a list of 500,000 unique companies would be a significant list that others may have an interest in - if it is maintained and somewhat comprehensive.

**BassettProvidentia** · 03-03-2011, 07:59 AM

Thanks for the reply.

I am writing the function for a client.

They have a table within their database that contains a list of company names (and I am sure they have primary keys, etc).

Every month or so, they receive a list of new company names to add to the database. The list comes from external sources, and often contains names that are simular, but not an exact match, to names in the database.

Example:

JimBob's Widgets (Ltd)
JimBobs Widgets Limited

I am writing a VBA function that is used within an append query, to search through the database using a supplied string for "fuzzy" matches.

The concept is working well now, but about 40% of the matches that are found are not matches at all. I want to reduce this.

I am using an SQL query from a DAO object, which uses a like statement.

At the moment I have it like this:

SELECT name FROM table WHERE name LIKE "*Jimbobs* Widgets*";

An example like that works well, but if the name is like:

A J H Ltd

The result set returns all sorts of unrelated records. Is there an wildcard that I can use within the LIKE clause to represent a space or nothing?

**orange** · 03-03-2011, 09:48 AM

From my previous post

What do you use these Companies for? Don't you need addresses, PostalCode, etc for your needs?

A little more background on the business needs would help in any advice. I would think a list of 500,000 unique companies would be a significant list that others may have an interest in - if it is maintained and somewhat comprehensive.

I'm not sure I understand your "fuzzy" match approach.
You may want to investigate the following algorithms/links
to refine your test for "fuzzily similar names".

1) SOUNDEX http://en.wikipedia.org/wiki/Soundex
2) Levenstein distance http://en.wikipedia.org/wiki/Levenshtein_distance
3) metaphone http://en.wikipedia.org/wiki/Metaphone

As for wildcards and other string approaches, you could investigate
"regular expressions". http://en.wikipedia.org/wiki/Regular_expression

Good luck with whatever you decide.

**BassettProvidentia** · 03-03-2011, 10:18 AM

Thanks for the links. They have certainly given me something to work with.

**dskulman** · 03-03-2011, 02:40 PM

In the table that is being appended to, make all the fields you don't want duplicated primary keys (you can have multiple keys).

Then, an append query will NOT append anything that is duplicated.

Originally Posted by BassettProvidentia

Good morning,

I have a prospective client that is interested in having a User Defined Function written that will prevent duplicated data from being appended to his database.

The option I am proposing is that each of the new rows is appended to the master database where the UDF returns false.

The UDF, each interation, will go through each row in the master database and apply another UDF to field in question. If the result is 'false', the outer UDF will return false to query.

The database is quite large and I am keen to ensure that the solution I offer is as efficient as possible. Is there a better way to achieve this?

De-duplicating new data

Thread Tools

De-duplicating new data

Appending non-duplicated records

Similar Threads

Duplicating numbers....why??

duplicating records

labels duplicating themselves

Report duplicating itself many many times

duplicating and editing the record

Posting Permissions