Results 1 to 6 of 6
  1. #1
    alvinsmode is offline Novice
    Windows 10 Access 2016
    Join Date
    Nov 2017
    Posts
    4

    Post Need advise and help for managing million lines of data with lots of misspelled and typo

    Hi guys, so I have a pretty large data of a company's purchase history(approximately 3 million lines). Unfortunately, some of the product names had either misspelled or spelled differently when being entered, but they are literally the same product (ex. Chicken Breast 2LB, Chicken Brest 1LB, Chicken Brast 1LB). Since there are hundreds of thousands of different products, I'm wondering if there is any better/smart way to let me group all these products with similar spelling into one category rather than look into every single line manually? I would greatly appreciate it for any advise or suggestion. Thank you!

  2. #2
    June7's Avatar
    June7 is offline VIP
    Windows 10 Access 2010 32bit
    Join Date
    May 2011
    Location
    The Great Land
    Posts
    53,646
    No, there is no better/smart way. GIGO syndrome (garbage in, garbage out) data. Data saved should have been a ProductID, not the descriptive names.
    How to attach file: http://www.accessforums.net/showthread.php?t=70301 To provide db: copy, remove confidential data, run compact & repair, zip w/Windows Compression.

  3. #3
    orange's Avatar
    orange is offline Moderator
    Windows 10 Access 2010 32bit
    Join Date
    Sep 2009
    Location
    Ottawa, Ontario, Canada; West Palm Beach FL
    Posts
    16,850
    alvin,

    What exactly are you intending to do with the data when you get it cleansed? There may not be a better method, but there may be some options if readers understood your requirement in detail.

    Many organizations work with productCodes/productids that have specific names (and often descriptions) and always use the productCode in related tables --that way typos and spelling errors are kept to a minimum.

    see UNSPSC as an example and NAICS for another

  4. #4
    alvinsmode is offline Novice
    Windows 10 Access 2016
    Join Date
    Nov 2017
    Posts
    4
    Hi Orange,
    So my company is a first-tier supplier, and we bought a lot of different components (or raw material) for 30+ of our own production plants from hundreds of different suppliers within the US. However, since most of the production plants have their own procurement team and basically can buy the components from wherever they feel like, it's very hard to standardize the data they have already created in the main system. Each plant would create their own description of the product in our company's system when they make a purchase, therefore, so the same product could have over 30 different names. Now I'm trying to find what products/category each plant usually buy (or the size of the product, which usually included in the product description, ex. chicken breast 5LB, or chicken breast 4LB, etc...) and where do they buy it from. so that I can probably centralize the procurement process for the company, and make the data cleaner in the future.

  5. #5
    Join Date
    Apr 2017
    Posts
    1,776
    Then it is even worse! There is the possibility that same product name in different plants is used for different products!

    Using LIKE operator is no solution too - having accidentally renamed 2 different products as same is probably much worse case than a couple weeks of manual work.

    As you anyway have to edit all products, then maybe you can use it to some upgrading. Usually in such cases the product name isn't an identificator at all - article number is most widely used instead. When you don't have it yet, my advice is to consider implementation of it.

    1. Create a table of all different products (different products, not different product names). The table has a field ArticleNo among other info;
    2. Create a table of all different product names used currently, and add a field for ArticleNo (Maybe you must add a a field for plant too). Insert a proper ArticleNo for every row in table;
    3. Add a field for ArticleNo into your purchases table, and fill it using update query from table where you set link between ArticleNo and product names (and plants).

    As you'll implement ArticleNo from zero, use the occasion and structure the ArticleNo so you can easily get various reports in future. Probably something like is used in European Union for statistical reports:
    The code is multi-level;
    Each level is 2-number code (probably the 1st level is 1-number code). When you use numbers, then you can describe 99 different values (9 different values for 1-number code), when you use 32-bit coding system, then over 1000 different values for level.

    So p.e.
    the 1st level determines 1 - meat; 2 - vegetables; 3 - drinks, ...;
    the 2nd level determines for meat 101 - raw; 102 fabricated; 103 - cooked;
    the 3rd level determines for raw meat 10101 - chicken; 10102 - geese; 10110 - beef; 10120 - pig;

    All ArticleNo's will be of same length, for level groups the rest of number is filled with 0's. P.e. for the example above, when you'll have 5 levels, the according codes will be
    100000000 - meat;
    200000000 - vegetables;
    101010000 - raw chicken;
    101010011 - raw chicken (1 kg vacuum package)
    etc.

    So when you need to get all possiblle products of raw chicken, you simply select for ArticleNo >= 101010000 and ArticleNo < 101020000

  6. #6
    orange's Avatar
    orange is offline Moderator
    Windows 10 Access 2010 32bit
    Join Date
    Sep 2009
    Location
    Ottawa, Ontario, Canada; West Palm Beach FL
    Posts
    16,850
    Alvin,
    For an organization that has 30+ production plants that deal with hundreds of suppliers, and fits this description of tier 1

    A tier one company is the most important member of a supply chain, supplying components directly to the original equipment manufacturer (OEM) that set up the chain. Creating a tiered supply chain is part of supply chain management. Its aim is to link important business functions and processes in the supply chain into an integrated business model, according to the Council of Supply Chain Management Professionals.
    A top tier organization is one of the leading organizations in a category, such as manufacturing, education or an industry sector. In supply chain management, the term has a specific meaning. A top tier supplier, also known as a tier one supplier, is the most important member of the supply chain in a hierarchy of suppliers and subcontractors. (from http://smallbusiness.chron.com)

    one would expect better discipline on its data management, purchase and spend/financial analysis.

    Before you undertake any action with the data, I recommend you get a clear statement of requirement for your "project'. Once the purpose of the project and its scope have been clarified and quantified, I suggest you do some analysis and get an objective view of the data management practices throughout your organization. You may be facing an "issue/opportunity" that management/politics does not want to solve. If you really are charged with bringing the purchase and spending areas to some disciplined management and consistency, then you need a corporate sponsor from within your senior management --this is not a project for a database/data management newbie. You may play a role, but this sort of thing isn't driven by someone experimenting with MS Access. (And I mean no disrespect to you).

    Do NOT invent your own coding mechanism before the requirement and scope of the project have been determined.

    NOTE: What Ari has suggested is not wrong in concept. That's exactly how codification and ontologies work. But that's why and how NAICS, UNSPSC, FSC, NATO Stock Number etc became standards for use throughout organizations, industries, governments and trade among countries. The scope of the project/issue/opportunity was seen as "global" or definitely bigger than an industry or country and these codes evolved. Your project may be much smaller, but don't create a coding scheme until you know the "sphere of influence".

    Here is one small piece of UNSPSC I found online while searching "chicken". May be overkill for you, but I'm sure you'll find other Tier 1 companies in the food products ( or whatever industry your organization is in/interacts with) has some codification scheme in place or under development. To be effective all the players, especially those with electronic systems for procurement, logistics, spend analysis..., have to be participants.

    Click image for larger version. 

Name:	SampleUNSPSC_Chicken.png 
Views:	13 
Size:	50.3 KB 
ID:	31350

Please reply to this thread with any new information or opinions.

Similar Threads

  1. Replies: 5
    Last Post: 04-22-2017, 06:13 AM
  2. Replies: 5
    Last Post: 04-25-2014, 11:40 AM
  3. Replies: 8
    Last Post: 02-27-2013, 04:56 PM
  4. Replies: 3
    Last Post: 05-16-2012, 02:56 PM
  5. Replies: 13
    Last Post: 10-26-2011, 03:49 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Other Forums: Microsoft Office Forums