Results 1 to 12 of 12
  1. #1
    Juan_007 is offline Novice
    Windows 8 Access 2016
    Join Date
    Feb 2018
    Posts
    4

    Setting up DB for distances between all entries in the DB

    Hey, Hans here. New to the forum. Thank you for welcoming me in advance

    I'm relatively new to Access, way more familiar with Excel. I've run quite a bit of Access tutorials over the past weeks, but I'm starting to question if I will find a solution to my challenge this way. So I want to present my thoughts here in the hope someone can help or show me where to go look for my solutions.

    I'm looking at the following.
    I have data for about 8,000 'locations'. They are not physical locations, but let's call it that for simplification; as I'm interested in the 'distance' between locations, which is also not a physical distance, but a distance nonetheless. What I want to create is a database where I can record and store the distance between any unique combination of locations. So that is (8,000 * 8,000 - 8,000) / 2 = 31,996,000 distances.

    When I started with this I thought it was going to be easy. Create a matrix with the names of locations on a vertical header and a horizontal header and voila: I can start recording my distances. Let's simplify it to 8 instead of 8,000 locations and it would look like this (where x’s are non-unique or distance to itself which both are irrelevant):

    --A B C D E F G H
    A x
    B x x


    C x x x
    D x x x x
    E x x x x x
    F x x x x x x
    G x x x x x x x
    H x x x x x x x x

    Right?
    But then I figured that would not be useful, even in Excel, because pivots are only used to present rather than record data (right? Though it seems intuitive in cases like this…). And I learned that pivots in Access have been discontinued since Access 2013 (why?!?), so I haven’t even taken the effort to start to try to figure out if it could have been done in there with that function.

    So am I looking at making a normalized table with 30+ million rows now? Is that my only option? In the simplified (8 locs) example that would look like this:

    Loc1 Loc2 Distance
    A A X
    B A
    C A
    D A
    E A
    F A
    G A
    H A
    A B
    B B X
    C B
    D B
    E B
    F B
    G B
    A C
    etc etc

    It feels tedious to me.
    You can also see why I find this by not writing out the full table like I did with the pivot. It’s too much work.
    If such table is my only option, is there some automatic or systematic way of creating such table? Having to copy the 8,000 locations 8,000 times (for each location that is) to create all my rows and then another column with 8000 * 8000 rows with each location against that seems crazy and inefficient. There must be a better way…
    A second reason it feels tedious to me, after I successfully create it, is that I will have to look up through the list to find the combination of locations I want to edit, every time I want to record or edit a distance. This as opposed to having a pivot table where looking up the distance I want to record feels (at least) easier to me. But then again, I don’t believe you can even edit your values/data in an Excel pivot table, so I don’t know how feasible that idea was in the first place (I am wondering if that was possible with Access pivots, but its discontinuation makes the question irrelevant now I guess). I won’t be recording all 30+ million distances (not all distances actually exist, or a great portion of it can be set to an arbitrary set number, likely 0 or a maximum, but depends). It will always be a manual way of entering a distance and there is no way of knowing beforehand which distances will interest me. Well, it is reasonable to suspect the distances within a class (the locations are classified/structured some way) will interest me more than between classes, but it is not exclusively the case.

    Finally, some additional, more general questions.
    Is it good conduct to use several tables in Access? I have additional information about every location. Let’s say, an ID, what country and continent it is in (again, it isn’t physical locations, but there is some sort of classification/structure going on), alternative names, population. I’m just making things up now, but there is much more to say about these ‘locations’ and maybe there will be some additional information I will realize I want to store too while creating the database. If I would record the information in Excel I would simply have another column for any additional information. But if I’m going to make that huge table with all combinations of locations, it doesn’t seem right to repeat all that information as well.
    So is that solved by having a separate table with the distances and one with other information and then create relationships? Should I split up other information as well besides a table for the distances and one for the rest of the info about the locations? I guess you can sense that I’m a bit new to Access from these questions. I suspect the answer to these additional questions is basically ‘yes’…
    If I have 30+ million rows for my distances, is it still possible to represent this data from my database in Excel with a pivot table? Excel has a limit of just over 1 million rows… I’ve also heard about an Excel add-in called PowerPivot, should I look into that?

    Because it has become such a long post, a summary by the most important questions:

    • How should I set up a database with distances between 8,000 locations (30+ million distances)?
    • If I need a normalized table with 30+ million rows, how can I make this table in an efficient way (without having to copy-paste all loc names/ID's thousands and thousands of times)?
    • After creating such table, how can I efficiently edit the distances given that distances need to be recorded manually?
    • Am I right in thinking that creating multiple tables with relationships in between is good conduct in Access?
    • Can I still present my 30+ million distances in a pivot table in Excel? Or can I use such large database in Excel at all in meaningful ways?


    Best wishes, and thanks so much for taking the effort to read through my post.
    Last edited by Juan_007; 02-03-2018 at 07:28 PM. Reason: Cleaned up a sentence

  2. #2
    CJ_London is offline VIP
    Windows 10 Access 2010 32bit
    Join Date
    Mar 2015
    Posts
    11,398
    in the context that your location data is in the form of x/y coordinates on a two dimensional surface then you just need to store these in a table once and use a query to calculate distances. e.g.

    tblLocations
    LocationPK autonumber
    LocationName text
    Xcoord long
    Ycoord long

    Assuming your distance is calculated by the sum of the difference between the x an y coordinates, to show all possible combinations you use a Cartesian query (no join) with the table on itself. Or you might want to calculate the hypotenuse for a right angled triangle.

    SELECT A.LocationName as LocA, B.LocationName, abs(B.Xcoord-A.Xcoord)+abs(B.Ycoord-A.Ycoord) as Distance
    FROM tblLoactions A, tblLocations B

    So if you have 10 locations, you will generate 100 rows in a query

    But if your distance is a specific measure and cannot be calculated, the only way is how you have it.

    Is it good conduct to use several tables in Access?
    in principle yes. To understand the theory, google/bing normalisation to find out more but simplistically the objective is to store data only once. Also google about indexes they are just as important for a successful implementation.

    30 million rows is large but quite possible in access - or if necessary upscale to sql server/express

    with regards your other questions, manual input of 30 million rows means just that - you have a job for life. Despite a long post, you have not provided enough information so we can help. For example if the data you receive is in electronic form, then you can automate the process (you mention copy/paste). What that automation looks like depends on the data.

    For example I recently developed a system for a client who recorded around 120,000 journeys a day (very large workforce), the people provided their start and end postcode for each journey (well the start for journey 2 was the end for journey 1). I automated it by checking whether each journey had been previously undertaken (and if not passing the postcodes to google maps (or at least one of the services they provided) who returned the appropriate mileage which was then added to the database.

    Re excel, the 64bit version may handle it, but without indexing it would be extremely slow

  3. #3
    Juan_007 is offline Novice
    Windows 8 Access 2016
    Join Date
    Feb 2018
    Posts
    4
    All right, thanks for answering. Sorry for not being clear or comprehensive enough despite the long post.

    I did tell they are no physical locations or distances. There are therefore also no coordinates. And I used 'location' only to try keeping things simple.

    Trying to clarify without giving away what I'm doing (which I can't give away cause it's sensitive information) might only make things more difficult to understand, but:
    The 'locations' are certain concepts. And the distances are distances between these concepts (so nothing physical really). These distances come in the form of a number. This number can't be calculated within the database I'm setting up, but will come from scientific literature where these concepts have been studied and the numbers that form these distances have been compiled by sometimes varying methodology. So there is no way to automate the population of the distances table I'm afraid. Unless there is a possibility to write a program that goes through scientific articles (and even books, which I don't even know have all been digitalized) looking for these very specific numbers, which I doubt is possible before the singularity happens.
    I will have to find these distances myself in the literature. But there won't be 30+ million of them (phew!), but likely several 1000s nonetheless (but I won't be working on my own).

    The most prancing question that is left to me is:
    How do I create this table with 30+ million rows as quickly and efficiently as possible?
    I have a list with about 8,000 names or IDs of these concepts. And I want to end up with all these concepts against every other concept.
    So in my mind I'm copying the first list of 8,000, then copy paste 8,000 of the first name/ID in the column next to it. Then again the full list of 8,000, then copy paste the second name/ID in the second column... And so on... 8,000 times? That's crazy...
    And I don't even believe you can simply copy-paste a list into an Access database. So do I need to create them in Excel? Which means over 30 Excel spreadsheets with a million row each?

  4. #4
    ssanfu is offline Master of Nothing
    Windows 7 32bit Access 2010 32bit
    Join Date
    Sep 2010
    Location
    Anchorage, Alaska, USA
    Posts
    9,664
    Quote Originally Posted by Juan_007 View Post
    How do I create this table with 30+ million rows as quickly and efficiently as possible?
    Quote Originally Posted by Juan_007 View Post
    And I don't even believe you can simply copy-paste a list into an Access database. So do I need to create them in Excel? Which means over 30 Excel spreadsheets with a million row each?
    Use VBA to create the records. I realize you don't have the "distances", but you can still create all of the loc1/loc2 pairs and add the distances later.
    Once yo have to Locations table filled out with at least the location name, you can use two loops to create the records.
    Or you might be able to use a Cartesian join in a query and append the results to the Locations table.

    Consider tables like these:
    Click image for larger version. 

Name:	Locations1.png 
Views:	23 
Size:	13.3 KB 
ID:	32503
    "tblLocations" holds all 8,000 names
    "tblAltNames" holds alternate names for locations/allows for multiple names
    "tblDistances" holds.... well, "distances" between any two "locations".


    Quote Originally Posted by Juan_007 View Post
    I have a list with about 8,000 names or IDs of these concepts. And I want to end up with all these concepts against every other concept.
    So in my mind I'm copying the first list of 8,000, then copy paste 8,000 of the first name/ID in the column next to it. Then again the full list of 8,000, then copy paste the second name/ID in the second column... And so on... 8,000 times? That's crazy...
    Especially since the maximum number of fields (what you called "columns") over the lifetime of a table is 255.

  5. #5
    CJ_London is offline VIP
    Windows 10 Access 2010 32bit
    Join Date
    Mar 2015
    Posts
    11,398
    So in my mind I'm copying the first list of 8,000, then copy paste 8,000 of the first name/ID in the column next to it. Then again the full list of 8,000, then copy paste the second name/ID in the second column... And so on... 8,000 times? That's crazy...
    this implies your data is available electronically, so perhaps complex automation is possible, however not knowing what the data looks like can't really advise further. From the little we now know, I agree with Steve's suggestion as a starting point

  6. #6
    orange's Avatar
    orange is offline Moderator
    Windows 10 Access 2010 32bit
    Join Date
    Sep 2009
    Location
    Ottawa, Ontario, Canada; West Palm Beach FL
    Posts
    16,716
    Perhaps you can use a more focused analogy so readers have some idea of what you are dealing with.
    If you are talking about "fuzzy logic" or "similarity index" etc, then some examples might be warranted if we are to offer more direct advice.

  7. #7
    Juan_007 is offline Novice
    Windows 8 Access 2016
    Join Date
    Feb 2018
    Posts
    4
    Thanks Steve. That gives me something I can delve into more.
    I'm going to play around with those ideas.

    Is it important to know what "_PK" and "_FK" stand for? I didn't fully grasp that.
    Why would you set up a separate table for alternative names? For clarity of the base table?

    I will btw also want to use a column where the source of a certain similarity/distance can be noted. This means a column with lots of empty cells. Will that form any problem in Access?
    Or should I perhaps also assign ID's to all unique loc1/loc2 pairs and make a separate table for the sources just referring to the relevant pair ID"s?

    Quote Originally Posted by orange View Post
    Perhaps you can use a more focused analogy so readers have some idea of what you are dealing with.
    If you are talking about "fuzzy logic" or "similarity index" etc, then some examples might be warranted if we are to offer more direct advice.
    I'm sorry for my inability to come up with better examples
    Maybe we can say the locations are bird sounds and we come up with a way to calculate numbers on how similar they sound/are? I didn't suspect the analogy would be all important.
    Quote Originally Posted by Ajax View Post
    this implies your data is available electronically, so perhaps complex automation is possible, however not knowing what the data looks like can't really advise further. From the little we now know, I agree with Steve's suggestion as a starting point
    I have the list with names sitting in an Excel sheet, together with columns with additional information. I wanted to import that sheet into Access. That would then form the basic/mother table in Steve's diagram.
    The distances, or indeed similarities (that's indeed what they are, stupid I couldn't come up with similarity index), are all over scientific literature. Getting the data on this will encompass searching for certain articles, reading them, and recording similarity numbers (the distances) as we find them.

  8. #8
    orange's Avatar
    orange is offline Moderator
    Windows 10 Access 2010 32bit
    Join Date
    Sep 2009
    Location
    Ottawa, Ontario, Canada; West Palm Beach FL
    Posts
    16,716
    Juan_007,

    Here is a link to a variety of information on Database Planning and Design. It would be an excellent place to start - you need to understand some database concepts before jumping in to physical Access. For your own benefit, try to describe in very simple English what your database will be about - the subjects involved and how they relate to one another. That is a key first step --know your requirements.


    As for a potential algorithm(damerau-levenshtein) see this
    Good luck.

  9. #9
    Juan_007 is offline Novice
    Windows 8 Access 2016
    Join Date
    Feb 2018
    Posts
    4
    Thanks! That looks really useful. Wondering why all the basic tutorials about Access I used never mentioned stuff like this...

    The similarities actually do usually make use of levenshtein's. But again, I won't be calculating them myself. They will come from scientific literature and are subject to specific methodology from one instant to another (one set of 'bird sounds' to another).
    The only way I could automate that is to have all the data on all the 'bird sounds' and then come up with my own methodology that is applicable to all the different sets of 'bird sounds'. That's very likely simply not possible even if I had all the data on the 'bird sounds' in this case because of the inherent differences that occur from one set of 'bird sounds' to another.
    So automating that was never my question. We are going to look up the numbers by hand through the literature.

  10. #10
    isladogs's Avatar
    isladogs is offline MVP / VIP
    Windows 10 Access 2010 32bit
    Join Date
    Jan 2014
    Location
    Somerset, UK
    Posts
    5,954
    I realise this is no more about bird sounds and their differences than the original analogy about locations and distances apart.
    However I'm going to dispense with the quote marks below.

    If you know the difference values between A&B and between A&C, is there a formula that can use those two values to get the difference between B&C?
    If a formula does exist and can be used in a procedure then the number of records needed in tblDistances drops from 31 million+ (7999 factorial) to 7999 which is much more manageable?

    If not, how big is the army who will be dragged in to populate 7999 factorial results and how long is it going to take?
    Say you have 10 people each doing 500 records per day, that is over 6000 days.
    If each does that 5 days a week with 2 weeks off a year that will take around 24 years
    And each time one new item/bird/place/thing is added think about how many extra records that will require.

    There's clearly something more to this than you are telling us....

    Furthermore consider the practicalities of searching that number of records.
    Using a pair of combo boxes for each item won't work for millions of records.
    You will need to break each down into a series of steps.

    To take 2 examples from my own databases:

    1. A student attendance marks table has about 2 million records each year
    That's 1500 students for 7 lessons/sessions per day and around 190 school days per year.
    To search for a specific record, first select the student then the date and finally the lesson using 3 combos.

    2. A postal address database has around 28 million records.
    To select the postcode from the total of 2.6 million, you could use 5 cascading combos - area, district, sector, zone, postcode.
    Once you have that, a list box can be used to get the list of addresses.
    Obviously if you know the exact address the process can be shortened.

    Whilst these two examples may have nothing to do with your situation, you will still need a similar solution to search for records
    Colin, Access MVP, Website, email
    The more I learn, the more I know I don't know. When I don't know, I keep quiet!
    If I don't know that I don't know, I don't know whether to answer

  11. #11
    CJ_London is offline VIP
    Windows 10 Access 2010 32bit
    Join Date
    Mar 2015
    Posts
    11,398
    Wondering why all the basic tutorials about Access I used never mentioned stuff like this...
    probably because they are about how to use access, not designing an efficient database which is a completely different topic.

    Is it important to know what "_PK" and "_FK" stand for? I didn't fully grasp that.
    PK means 'Primary Key' or 'Parent Key', FK means 'Family Key' or in db parlance 'Foreign Key'

    Primary keys are unique to identify a specific record, they do not usually have any meaning in a record other than to identify it, users will not normally see it and it is usually numeric. It is used to identify the specific record and the related records by matching on the foreign key.

    Consider using a location name as a primary key - it should be unique and ticks that box. But perhaps when entered it was spelt wrong, or there have been a reclassification and you need to rename it. When you do that, you break the link to the related records or have a big task to update those family records as well. Why numeric? because they take up less space which matters for indexing performance. There are always potential exceptions, airports have a 3 letter code which is unique, US states have 2 letter codes. People have social security numbers etc. For the first two it may be acceptable to just use the 2/3 letter code as a primary (and foreign) key because a) they are short, b) limited in scope and c) are unlikely to change. But some people don't have social security numbers, or there is potential for change for some reason, so not a good candidate for a primary key.

    A typical primary key will be a long numeric type and takes up 4 bytes. A 3 letter code takes up 2 bytes per char, plus 1 - 7 bytes in all and nearly twice the size. Disk space is cheap, so why does it matter, you are storing the code anyway? It's because of indexing.

    An index consists of effectively a table containing the key and a pointer to the record. Lets assume the pointer is 2 bytes, it might only be one.

    So you have a record with a primary key and say 10 related foreign records (you are talking about 8000 foreign records), the math is easy.

    With numeric primary key

    Primary record 4 bytes (for key) plus 7 bytes for code) =11 bytes
    Index for primary record 4 bytes (for key) + 2 bytes (for pointer) =6 bytes
    foreign record 4 bytes * 10=40 bytes
    index for foreign record 4 bytes (for foreign key) + 2 bytes (for pointer)*10=60 bytes
    Total 117 bytes

    using code as the primary key

    Primary record 7 bytes for code =7 bytes
    Index for primary record 7 bytes (for key) + 2 bytes (for pointer) =9 bytes
    foreign record 7 bytes * 10=70 bytes
    index for foreign record 7 bytes (for foreign key) + 2 bytes (for pointer)*10=90 bytes
    Total 176 bytes

    OK, space is cheap, but now consider how indexing actually works. Everything else being equal, the governing factor in index performance is the speed of disk reading - pretty fast now, but what happens is the computer reads a block of data (think it is around 4096 bytes) looks through what it has loaded and if not found, reads the next block.

    So in the above example for a numeric datatype it will load 682 index records in a block read and for a 3 character code it will load 455. If your character code was say 10 characters long, it will only load 186 index records. You have 30m records, then on average it will take some 80,000 disk reads to find your record (v circa 22,000 for a numeric code)

    Clearly search algorithms have improved over the years to reduce the number of reads required to find a specific record, but the principle remains.


    The only way I could automate that is to have all the data on all the 'bird sounds' and then come up with my own methodology that is applicable to all the different sets of 'bird sounds'.
    with respect, this sounds like you have pictured in your mind the only way this can be done. And this would appear to be based on a limited experience of Access, no knowledge of database design and what appears to be an excel view of how things should be done.

    Perhaps the solution is to have an 8k x 8k worksheet in excel and use excel vba to do your analysis. 64bit excel allows 1m rows x 16k columns which should be more than enough for your purposes. Perhaps access is not the right tool for your requirement

  12. #12
    ssanfu is offline Master of Nothing
    Windows 7 32bit Access 2010 32bit
    Join Date
    Sep 2010
    Location
    Anchorage, Alaska, USA
    Posts
    9,664
    Quote Originally Posted by Juan_007 View Post
    Is it important to know what "_PK" and "_FK" stand for? I didn't fully grasp that.
    Alex explained what they mean. I use "_PK" and "_FK" suffixes to make it easier to see which fields are linking/linked fields without opening the table in design view to find linking field names.


    Quote Originally Posted by Juan_007 View Post
    Why would you set up a separate table for alternative names? For clarity of the base table?
    How many "alternative names" could there be for each location? 3? 5? You can't (shouldn't) add, say 10 fields for alternative names, in the location table. Lots of wasted space.
    If you have a separate table for alternative names, you can have unlimited alternative names for each location.
    The relationship between the Locations table and the "alternative names" table is 1-to-many. Using this design, a Location could have 0 to 10,000 names WITHOUT changing the design of the dB. (I know 10,000 is an exaggeration, but the idea is the same if you have an unknown number of alternative names per location)

Please reply to this thread with any new information or opinions.

Similar Threads

  1. Replies: 2
    Last Post: 05-07-2015, 02:55 PM
  2. Replies: 11
    Last Post: 03-13-2014, 09:54 AM
  3. Replies: 0
    Last Post: 09-12-2013, 01:19 PM
  4. validation setting for duplicate entries
    By noweyout in forum Forms
    Replies: 1
    Last Post: 02-18-2011, 06:14 AM
  5. Y and N entries
    By Drisconsult in forum Access
    Replies: 4
    Last Post: 08-10-2010, 02:16 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Other Forums: Microsoft Office Forums