"Data Type" / "Data Value" Screening Method

**skydivetom** · 03-04-2021, 08:19 AM

Experts:

I need some general recommendation for *screening* AND *validating* data values across multiple tables.

Allow me to provide some background first:
- First of all, the attached database includes only **EXAMPLE DATA**. However, I attempted to mimic the essence of the actual data.
- The sample DB includes only 4 source tables (Source1, Source2, Source3, Source4). My actual data set includes dozens of tables.
- Finally, the sample tables includes at the most up to 6 fields per table. My actual data tables *may* include up to 150 fields each.

Additional background:
- The 5th table (00_tblFields) includes the total # of fields based on the 4 source tables.
- Also, 00_tblFields contains a boolean field "PotentialDuplicate" as well as a field referencing the source table (for that particular field).

Existing Process:
- First, crosstab query "qry00_Transformation_AllFields" transposes all DISTINCT field names into a matrix and places a "●" under each source column that includes that particular field name.
- Next, crosstab query "qry02_Transformation_PotentialDuplicates" only displays those fields where the analyst placed a checkmark into field "PotentialDuplicate" (in table 00_tblFields).
- So, ultimately, based on the intial screening of **fieldnames**, it appears that "Gender" (from Source1) & "Gender1" (from Source2) may be include the same type of data. At this point no actual data screening has occurred yet.
- Similarly, there are 4 source tables which include a fieldname for last name. 3 of the 4 source tables, however, have a different spelling of last name field (Last Name, Last_Name, LastName). As a result, all 4 fields were checked [00_tblFields].[PotentialDuplicate].

I hope all above makes sense thus far.

Data Screening (Background):
a. Based on the initial hunch a) 2 gender-related fields and b) 4 last name fields may hold similar data (or data types).
b. However, based on a., I do need to validate/verify that the latter is actually true. Now, based on the example data, it would be extremely easy to quickly open up 4 tables and glance at those up to 6 fields.
c. Reality is though that step b. is difficult when looking at the actual organizational data (again, each source table may include 100+ fields and 1000s of records).
d. Also, given that each table has a different structure, it doesn't make sense to attempt to put all of the fields into a single table. Besides, Access (I think) cannot have more than ~250 fields in a single table.

Data Screening (Manual Process) -- when performing manual data screening the following should be realized:
- Table/field [Source1].[Gender] includes values = "M" and "F"; alternatively, table/field [Source2].[Gender1] includes values equal to "Male" and "Female". Ultimately though, both source tables describe the same type of data. Thus, the user/analyst should synchronize both *fieldname* and *underlying data*.
- As part of the manual review, all four source tables include actual last names EXCEPT table Source2. Here -- for whatever reason -- date values are stored in the field [Last Name].

Now, here's what I need some help with:
- I would like to use the *most efficient* way to quickly screen data of the potential duplicate fields across the 4 source tables.
- Based on the latter, I'm not entire certain if I should have 4 queries open at the same time and then quickly look/screen for the underlying data values.
- So, my question is as follows: Given the different tables structures, is there an way to somehow join/union the four tables and then quickly drag/down the candidate fields into one (1) query so that I can realize that "M/F" = "Male/Female" but that "Lastname" (date) does NOT equal "LastName" (string)?

So, any ideas as to how I could quickly identify any a) equivalent data types OR b) data type mismatches would be greatly appreciated.

Thank you in advance!
Tom

**orange** · 03-04-2021, 08:45 AM

Tom,

What is the ultimate goal? I've seen situations where we've had to bring systems together from different departments etc. So there was an effort to have an authority for field names and a range of acceptable values. A substantial data cleansing activity.

You could work with a new table or set of tables that contain the results of some manipulation/adjustment. But depends on the real requirement and "future processing" of the existing sources of data.

**skydivetom** · 03-04-2021, 08:59 AM

Orange - yes, you're spot-on. Currently, we are going through a migration effort from several legacy systems into a single target system. Some of the legacy system (across, e.g., subsidiaries) may started (10+ years) ago w/ a similar architecture but **separate** administrators maintained updates their own systems over the years.

Now, as part of the migration, we need to identify those source columns (e.g., variations of lastnames) that ultimately contain the information. If validated, then we'd synchronize the various (similar) field and finally map the same fields to the designated target field.

As you indicated, the data cleansing activity can be substantial. However, in my view, before I even begin cleansing data, I would like to remove any duplicate field names (e.g., have "LastName" vs. 3+ variations of lastname). That hopefully would reduce the number of fields from e.g., 600 to may 400 (or some other number).

Makes sense?

**orange** · 03-04-2021, 10:15 AM

Yes, absolutely.
But even though you may have several "last names", they may have different definitions. Contact last name, client last name...
We merged multiple government departments, over 25 grants and loans programs (financial/incentive/regs) into corporate database. One of the biggest issues was the resolving the details regarding meanings/definitions/range of values/codification.

I recall 50 local systems spreadsheets, dbase3.... across 8 sector branches that proponents lived by ("they were gold" in their eyes). But analysis showed that much company info was out of data, contacts were dead.... It was a "fun" time.

**skydivetom** · 03-04-2021, 10:28 AM

Orange -- concur 100%. As part of my example date, I kept it to a minimum. In actual source data, we do specify those differences. But again, there may be scenarios where I have "Contact last name" AND "ContactLastName" AND "Contact_last_name", etc. Similarly, same could be the case for "client last name" and so forth.

At this moment, my primarily goal is to ID that no, e.g., "03/01/2021" or "1234567" exist in what -- logically -- speaking should be a "Jones" or "Smith".

Any thoughts on how to quickly compare all related columns? That is, if I were to throw all candidate source tables into the same query **without** having the ability to join them, 1000 records may easily return 1 million records (compounding matching). Thus, I would have to add one table a time (Source1) to look at LastName; then do the same with Source2; then do the same with Source3, etc. Very cumbersome!! I'm hoping there's a more efficient way to compare various columns w/o having the ability to join records. Any thoughts?

**orange** · 03-04-2021, 10:41 AM

I have an old routine or 2 where I can review each table in the database and list the tablename, desc, fieldname, field desc,, ordinalposition, datatype, size etc. to a table. From that you could build/run query(s) to do some analysis.

I'm sure you're aware, but for clarity --Backup your stuff; work on copies and backup your interim results as you move along.

Attached is source code to document your tables as mentioned above.

This routine creates a table DataDictionary each time it is executed. Run it from the module/vba.

**skydivetom** · 03-04-2021, 11:09 AM

Thank you for providing the DataDictionary... I'll review it and will see how it translates in a backup copy. Many thanks for sharing your ideas & recommendations.

Cheers,
Tom

**orange** · 03-04-2021, 04:11 PM

You're welcome. Keep us abreast of your activity.
If you have specific validations etc., let us know.
You could certainly have some functions to ensure a field's values are:
alphabetic, numeric, valid dates, proper lengths.....

From the dictionary you can create queries to
check field names to identify variation
check table designs
etc.

"Data Type" / "Data Value" Screening Method

Thread Tools

"Data Type" / "Data Value" Screening Method

Similar Threads

New form gives me "Compile Error: Method or data member not found"

Simple table relationships ("faces" to "spaces" to "chairs")

Best Practice for Creating a "Playlist" or "Album" type function within Access

Going "bonkers" with "data member not found" error

"package" reports into a form to allow "print", "choose Printer" etc easier for user

Posting Permissions