Need help cleaning a dirty dataset. Don't know the best way to attack the problem.
Let's say I have a database of a baseball season. Every game is recorded with an ID, and in a table of season games, each game carries two records: One for each team that competes in the game.
Game001 | Giants | Home ...
Game001 | Mets | Away ...
Game002 | Twins | Home ...
Game002 | Yankees | Away ...
etc.
Every time a game is played, each team is required to submit a player roster. That roster is supposed to list the teams for which each player has previously played. Sometimes the teams comply. Sometimes they don't. Sometimes they list different combinations of past teams. Thus, it is possible, thanks to human error, to have records like this:
A1234 | Jerry Jones | "Phillies, Mets, Braves" | Giants | Game001
A1235 | Jerry Jones | "Phillies, Mets" | Giants | Game002
A1236 | Jerry Jones | "Braves" | Giants | Game003
A1237 | Jerry Jones | NULL | Giants | Game004
I want to know how many players on the Giants have played for other teams. I'm willing to assume that, if I observe,
A1236 | Jerry Jones | "Braves" | Giants | Game003
A1237 | Jerry Jones | NULL | Giants | Game004
That record (A1237) contains the error. That is, it's easier to forget to list past playing history than it is to insert erroneous history.
One thought I had was to run an update query on the player table to set all non-null values to 1, and then run a SELECT DISTINCT query on that updated table WHERE experience = 1. Would that work? Other ideas?