Complex Survey: Table Structure Design and Normalization

**kevin007** · 06-28-2010, 02:38 PM

I am a newbie moving Excel DB into ACCESS as it has become unwieldy. I am going to provide here a comprehensive background and am about to be very detailed so that you can help me please! I have seen a million forums on survey db design and nothing that answers my questions on proper table structure design for my needs (which are to analyze the mean responses using many different criteria such as age, gender, time, date, demographic, etc) Mostly we have multiple choice (with a few open ended questions) and a few yes/no. However, we have several different scales that the respondent’s use depending on which question they were answering as you will see! I am wondering how to house this data in the right table and how to relate the tables together.

We currently have data on around 10000 surveys (records) collected over the last few years. We also have 4 different survey versions/types. As the survey evolved (still within the Excel format) questions were added and moved around. We currently have 2 Surveys: one for our "Bus" fleet and one for our "LRT" (Light Rail Transit fleet). But the BUS survey evolved into 3 versions as it was updated -There are Bus A Bus B and Bus A/B. So in Excel we have a field called Survey_Version where it has A, A/B or B, and a field called MODE where it has 1 or 2 where 1 means it’s the Bus survey and 2 means its the LRT survey.

We are a service based industry and we are mostly gauging customer satisfaction here in 7 key performance areas for our Bus and LRT (LIGHT RAIL TRANSIT) services).The 7 areas are:

1 Facilities
2 Fleet
3 Operators
4 Safety and Security
5 Information
6 Service Delivery
7 Routes

Within those 7 categories of service we ask a total of ask 23 questions (23 fields for these questions) with a 1-7 "Likert Scale" (1 means "poor" 4 means "acceptable" and 7 means "excellent" with 2-6 being generic levels of satisfaction) then 8 means "don't know" and 9 means"no response"

We also ask 2 "loyalty" questions using a different Likert scale where it is still 1-9 with 8 as "don't know" and 9 as "no response" but in this case the scale is (1 means "not likely" 4 means "likely" and 7 means "very likely"; with 2-6 being generic levels of likelihood to use our service again)

In addition to analyzing the "Mean" responses to these questions we also use this survey to gather a whole bunch of data on Demographics, Routes and Travel patterns of our customers, as well as asking what Fare "products" that people used when they traveled with us (i.e., ticket, monthly pass, cash etc), how many “transfers” they used to get there, and we view the mean satisfaction to the service areas grouped by these factors.

We also record in each survey the weather and the temperature, at the time of the interview (the person doing the Survey notes this). The surveyor also records the time and date and the location of the interview.

We use this additional information to analyze the "mean" responses based on…I.e. what was the average satisfaction level toward operators, in "xxx" location, in quarter 2 of 2008 when it was very hot outside? Etc etc. and then we can perform cross analysis to see mean satisfaction/loyalty rates for respondents on certain routes compared to others.

In total we have 88 fields that we set up to house all of this data in the original Excel format. That includes fields for all 4 survey versions (as they evolved). I consolidated all the versions into 1 flat file Excel db and after tedious countless hours of cleaning managed to import it successfully into ACCESS. Into 1 BIG table called tbl_responses. NOW THAT THE DATA IS IMPORTED, FROM HERE ON IN, WE WANT TO BUILD A FORM FOR OUR INPUT INTERFACE. NO MORE EXCEL!

23 of these questions relate to the 7 key service areas (poor-excellent mentioned above) and the rest are calculated fields that are based off of the Date of Birth record, the date record, the time record, etc and also data recorded for each responded by the Interviewer (Interviewer record)

As you can probably see a number of queries will need to be run to analyze the means based on gender, age, time of response (quarter, year, month, time of day), weather, temperature, demographics (senior, student, etc). After a long and difficult self thought lesson in SQL I was able to finally able return the right means using an Iif operator to remove any Reponses that have an 8 or a 9 (we don't want to confuse the averages by including the don't know and no response records)

SO I hope you’re still with me! The SQL I had to use to correctly exclude THE 8 AND 9 FROM THE MEANS AND PERFORM AN ANALYSIS ON THE 23 service area questions (in the 7 key service area categories) simply GROUPING BY YEAR AND BY QUARTER from each question looks like this.

CODE
SELECT tbl_responses_2_0.YEAR, tbl_responses_2_0.QUARTER, tbl_responses_2_0.MODE,
Avg(IIf([A_facilities_benches] In (0,8,9),Null,[A_facilities_benches])) AS A_facilities_benchesAvg,
Avg(IIf([A_facilities_clean] In (0,8,9),Null,[A_facilities_clean])) AS A_facilities_cleanAvg,
Avg(IIf([A_facilities_condition] In (0,8,9),Null,[A_facilities_condition])) AS A_facilities_conditionAvg,
Avg(IIf([A_fleet_seats] In (0,8,9),Null,[A_fleet_seats])) AS A_fleet_seatsAvg,
Avg(IIf([A_fleet_floors] In (0,8,9),Null,[A_fleet_floors])) AS A_fleet_floorsAvg,
Avg(IIf([A_fleet_exterior] In (0,8,9),Null,[A_fleet_exterior])) AS A_fleet_exteriorAvg,
Avg(IIf([A_fleet_noise] In (0,8,9),Null,[A_fleet_noise])) AS A_fleet_noiseAvg,
Avg(IIf([A_ops_traffic] In (0,8,9),Null,[A_ops_traffic])) AS A_ops_trafficAvg,
Avg(IIf([A_ops_speed] In (0,8,9),Null,[A_ops_speed])) AS A_ops_speedAvg,
Avg(IIf([A_ops_crteous] In (0,8,9),Null,[A_ops_crteous])) AS A_ops_crteousAvg,
Avg(IIf([A_ops_appear] In (0,8,9),Null,[A_ops_appear])) AS A_ops_appearAvg,
Avg(IIf([B_safety_riding] In (0,8,9),Null,[B_safety_riding])) AS B_safety_ridingAvg,

Avg(IIf([B_safety_waiting] In (0,8,9),Null,[B_safety_waiting])) AS B_safety_waitingAvg,
Avg(IIf([B_safety_measures] In (0,8,9),Null,[B_safety_measures])) AS B_safety_measuresAvg,
Avg(IIf([B_info_on_bus] In (0,8,9),Null,[B_info_on_bus])) AS B_info_on_busAvg,
Avg(IIf([B_info_at_stations] In (0,8,9),Null,[B_info_at_stations])) AS B_info_at_stationsAvg,
Avg(IIf([B_service_seats] In (0,8,9),Null,[B_service_seats])) AS B_service_seatsAvg,
Avg(IIf([B_service_ovrcrwd] In (0,8,9),Null,[B_service_ovrcrwd])) AS B_service_ovrcrwdAvg,
Avg(IIf([B_service_value] In (0,8,9),Null,[B_service_value])) AS B_service_valueAvg,
Avg(IIf([B_routes_on_sched] In (0,8,9),Null,[B_routes_on_sched])) AS B_routes_on_schedAvg,
Avg(IIf([B_routes_trnsfr] In (0,8,9),Null,[B_routes_trnsfr])) AS B_routes_trnsfrAvg,
Avg(IIf([B_routes_time] In (0,8,9),Null,[B_routes_time])) AS B_routes_timeAvg,
Avg(IIf([B_routes_dist] In (0,8,9),Null,[B_routes_dist])) AS B_routes_distAvg,
Avg(IIf([C_loyalty_1] In (0,8,9),Null,[C_loyalty_1])) AS C_loyalty_1Avg,
Avg(IIf([C_loyalty_2] In (0,8,9),Null,[C_loyalty_2])) AS C_loyalty_2Avg
FROM tbl_responses_2_0
GROUP BY tbl_responses_2_0.YEAR, tbl_responses_2_0.QUARTER, tbl_responses_2_0.MODE
HAVING (((tbl_responses_2_0.MODE)=1))
ORDER BY tbl_responses_2_0.YEAR DESC , tbl_responses_2_0.QUARTER DESC;

ALL THOSE AVG Iif operators were necessary to correctly remove the 8's and 9's from each of the 23 “service area” questions as well as the 2 loyalty questions. Otherwise ACCESS's “mean” function in query wizard would not exclude the 8's and 9's correctly.

So, if I start to build queries to filter out by GENDER, AGE, DEMOGRAPHIC, LOCATION, WEATHER, TIME OF DAY ETC, you can see where I’ve realized this is NOT going to work with this big flat table structure.

I NEED SOME PROPER STRUCTURE AND NORMALIZATION.

Please let me now post the exact field names and structures of the one big flat file table we have that I built in order to import the Excel data into Access.

First it looks like this….

| survey_version (A, A/B or B)| Respondent_ID (Primary Key) | agent_name (person who administered the survey (not an important field [NO analysis is done by agent!!] | MODE (Survey 1 or survey 2 [Bus or LRT] )

Then comes...

| started_at_date | started_at_time | finished_at_time | time_taken_in_minutes |

These are Interviewer Records, the rest of the Interviewer records come at the end i.e. weather, temperature, etc. You will also see at the end Quarter, Year, Month etc which I will note here were previously caluculated in Excel from the | started_at_date | started_at_time | finished_at_time | fields just shown above.

Then comes a series of “Screener” Questions if they answered yes their responses are not used in the “mean” analysis but their demographics are kept for other uses. Yes or No

Screener_Q_1 (yes or no are you old enough to take survey) | screener_Q_2 (yes [1] or no [2] are you an employee of the company) | year_of_birth |

Then some question on what routes they traveled which can have a 1-200 response that relate to 1 of 200 possible routes)

| route_1_bus (what route did you take 1-200) | route_2_bus (what route did you take 1-200) | route_3_bus (what route did you take 1-200) |

The next three questions [priorities] holds the responses where they ranked their top 3 priorities using the 7 key performance areas (These are one of those questions where the 1-7 scale WHERE 1-7 SIMPLY PERTAINS TO the 7 key service areas Facilities, Fleet, Operators, Routes etc)Which is most important to them,. This allows us to cross examine where importance is correlated to actual satisfaciton levels. This was we can say “they thought the Operators were “Excellent” but they didn’t think it was very “Important” that the operator BE excellent.

Q1_priority_1 | Q2_priority_1 | Q3_priority_3 |

The next MAIN 25 (or so) questions are the bulk of the essential data. They all use the same 7 point “Likert” scale to judge their satisfaction level with 8 for don’t know and 9 for no response) These are the questions (as well as the 2 “Loyalty ones”) where we analyze the mean responses to each one.

Q1 | Q2 | Q3 | ….Q25 |

Then we have a question that asks Yes [1] or No [2] if they had any additional concerns)

| Q_concerns |

Then we coded their concerns to categorize them into the 7 service areas that they best fit with i.e. safety, fleet, driver, etc)

| code1 | code 2 | code |
After that is a text field where we store the verbatim answer to the “concerns” question

|Q_Verbatim |

then we have a section on what the weather was like that day and the location of the interview that looks like this

| location | weather | temperature |

…then comes the section of "when" that the survey administrator fills out that was calculated in Excel through the date and time field at the start of this table structure.

MONTH | YEAR | QUARTER | DAY| WEEK | TIMEOFDAY |

…then comes the section on the respondents demographics and their market segment that looks like this

AGE (Derived from the birth date field) | SENIOR | YOUTH | POSTSECONDARY | ...etc | These again are yes and no [1or2] records that say if the respondent is one of these that field gets a 1 if not it gets a 0)

So we have a 1-200 scale for routes, we have a 1-9 scale for the main questions, we have a 1-7 scale for priorities (which service area is most “important”), we have a bunch of different yes or no q's (Yes/No are they an employee of our company?, are they disabled, are they a senior, are they a student? Some yes/no’s have a “Don’t know” option and some do NOT! , we have scales to categorize the temperature 1=+25 degrees, 2- -25degress etc. We have scales to designate the location of the interview and the whether (1=snow, 2=rain etc), we have 1 text field for the verbatim responses to the concerns, then we have the scale to record the "coded" "service area" categories where their open ended concern would fall. We have a scale to categorize the "location" of the interview. Then we have field like AGE where any number can be, and YEAR and Date of Birth where allot of different numbers can be. So theres a whole bunch of “Answer Types” that need tables (I think?) that need to be related to each respondent, and then to the correct question to show what the chose for their response!! Ahhhhh!

THAT’S IT!

IM SORRY THIS GOT SO LONG! I REALLY NEED YOU TO UNDERSTAND where I’m at so you can help me build these tables correctly. Please keep in mind I am very new. I am really hoping someone can walk me through which tables to build, how to group these 88 fields, what table to put them in, what to include in the table, what the primary keys are for each table and where the foreign keys are they relate to.

WHERE IM AT NOW

I started to come up with a table structure and quickly got stumped. Remember I am really new and inexperienced with this. This is what I have: (FK = "Foreign Key") (PK = "Primary Key"

Table for Question Categories (7 performance area, screener quesitons, LRT quesitons etc)Tbl_QCategorypkQ_CategoryIDCategory Name1Facilities2Fleet3Operators4Safety Security5Information6Service Delivery7Routes8How many transfers9BusScreener10LRTScrnr_&_TravelInfo11Priorities12Demographics (age, type of commuter Q’s)13Interviewer Record (Weather, time, etc)14Open Ended

Table for All the Questions Asked Tbl_SurveyQuestionspkQuestionIDQuestion TextMODEfkQuestionCatID1Consent to complete survey (yes/no)192What year were you born? (YYYY)193What bus route(s) did you take today? (1st Mention)174What bus route(s) did you take today? (2nd Mention)175What bus route(s) did you take today? (3rd Mention)176How many transfers in total did you need to make to complete… 187Which LRT station was your starting point for this trip?2108How did you get to that LRT station?2109LRT previous question, Other - SPECIFY (open)21010How many minutes did it take you to WALK/BIKE to the LRT station?21011What are the top 3 most important factors for you to have the best possible transit system ETS can offer? (1st Mention)211""""""""""""""""""""""

Table to indicate the Survey Version tblSurveys

pkSurveyID (autonumber)txtSurveyName

1CER Bus A2CER Bus B3CER Bus A/B4CER LRT

Mabey some tbales for the demographics???I dunno?

tbl_DemographicspkDemoRespIDDemoName1Grade School Student2University, college or trade school student3Full or part-time weekday worker (5PM-8AM) - including self-employed4Full or part-time shift worker (5PM - 8 AM) - including self employed5Not currently employed 6Retired8Don’t Know9No Response

The Ill need tables for all the time and date feilds, and tables for the age and date of birth, and what routes they traveled, and how many transfers they used, and the locaiton of the interview. I don't have the foggiest on how to build these, what to include as fields, and how to relate them together.

Sorry I’m stumped here. How do I build the right tables for the respon”dents” themselves, table for the respons”es” they gave to each question? The actual raw data (answers) where does this go? How do I desing a table for the Demographics, Etc? What table to a build to group the data on what Fare Product they chose? etc etc.

I’m really stumped here. PLEASE HELP ME. I need to design and relate this data so that is can be analyzed based on Demographic, Age, Time, Date, Quarter, Year, Weather, Temp, and all the rest.

Like I say we will build a form to do entries in the future but as it stands I had to import this data into the one flat file with exact matching field names. Now that I have it I figure I can use APPEND queries to get into the right tables and normalize this. But how do I build THOSE tables? Which table do I build? What are their structure? How do they relate (foreign keys to primary keys) Etc.

PLEASE, PLEASE HELP ME! I AM ATAROADBLOCK!

PLEASE DO NOT REFER ME TO DWAYNEHOOKUMS AT YOUR SURVEY. IT DOES NOT HELP ME.!!

**kevin007** · 06-28-2010, 03:06 PM

I MUST APOLOGIZE FOR THE ABOVE "SAMPLES" of the table structures I began building. When I clicked post the example "records" under the "feild names" spead out across the screen and I just relaized they dont look right.

TO CLARIFY: In BLUE is the prespective table "Name" in BOLD are the prospective table "field names". The rest is a sample of the data I would have as records in these feilds. CER means (Customer Expectations Research) and is the name of this program. for tblSurveys the ID is 1, 2, 3, 4 and the prospective names (records)are Bus A, Bus B, Bus A/B and LRT.

I hope that makes sence. I don;t know if the structure is correct and still need to build all the rest. Mabey these three "sample" tables are not even on the right track....

**kevin007** · 07-06-2010, 09:21 AM

does anyone have any advice for me. I really need to get this db restructured. I have an fundamental academic knowledge of Access. I don;t have much of a working knowledge of Access and apparently a questionaire db is one of the more complex db's. I will gladly foreward a copy of the db if you can helkp me build a table structure to house the data I have imported into the 1 main flat table. I am at my wits end. At this point I would be willing to even compensate someone from my own paycheck if they can jsut build a table structure for me. The idea is that I need to be able to query mean and proportion responses based on a number of factors. Including survey version, MODE, Year, Quarter, Age, Demographic etc. Im sure its really quite simple for an experienced programmer/access guy. For me its a nightmare! Anyone interested. I really do need some help with this.

Complex Survey: Table Structure Design and Normalization

Thread Tools

Complex Survey: Table Structure Design and Normalization

Similar Threads

Survey DB Query not correctly Calculating COUNTS and AVERAGES by year and quarter.

table design: one big table vs. multiple small tables

Simple linked table. Complex underlying SQL statement

An import question above the table structure

Table Normalization Help

Posting Permissions