Hi!
I am trying to develop a good base table for reporting purposes, that will attempt to reconcile data from 3 sources and I was hoping to get some much valued outside input. I am very downstream from the data sources, and am attempting what is likely a non standard approach. Any alternative suggestions are appreciated.
Unfortunately, the data sources are not normalized and I was hoping to achieve some degree of normalization from my efforts. Another motivating factor is that each one of these data sets is up to a million records. However, once we omit certain fields and they are grouped down the biggest one is about 9,000 records. Linking them up to different tables for reporting is cumbersome (especially when reporting against all 3) and we usually have to kick out a bunch of fields anyway. Also, a lot of the attributes in each table are very similar(Cust#, ST, PurchaseDate, etc). So my goal is to consolidate into one table, while minimizing excess columns and enforcing some kind of additional integrity.
I have started by creating 3 group by queries on each data source (MassPurchases, GroupPurchases, IndividualPurchases), which effectively limits the ID field in each table to one record. I use these queries to create 3 tables. Although each resultant table has an ID field, they are not related to one another and so I was planning on using that as a Primary key in each table. From there, I was going to add a field that simply specifies the data Source to each table, and then make combine that with ID for a Composite Key.
So for instance, I have made the tables MassPurchases, GroupPurchases, IndividualPurchases.
The composite keys in each table would be ID and Source (Source is essentially the source table name). From here, I would do a make table with IndividualPurchases, and then append the other two (Mass and Group)to the final Purchases table. Then, the final table would have the same composite key structure of ID-Source. The idea being I have all three sources in one table, and could do summaries, crosstabs, and ad hoc stuff more readily from that base table.
Is all of this overkill? Does it even make sense? Are there any alternative suggestions?
I have a feeling that I am getting into ETL/Datawarehousing waters, and so apologize if this is less of a design question.
Thanks