Play-by-play data from Project Retrosheet includes the pitch sequence for each batter's appearance in more recent years but the pitch sequences include a number of extraneous characters representing more than just balls, strikes and balls put into play. They also include throws made to bases by pitchers and catchers, pitches in the dirt blocked by catchers, et al.
For example: B1C1F*B*B>X shows a ball, throw to 1st by pitcher, called strike, throw to first by pitcher, foul ball, two balls in the dirt blocked by catcher, the runner going on the pitch as the batter puts the ball into play. If I just want to count the number of pitches thrown, I need to strip out the extraneous symbols, which is what the following SQL statement does: BCFBBX. LEN[clean] is the number of pitches. The query also pairs the catcher with a given pitcher and vice versa.
SELECT Val(Mid$([gameID],4,4)) AS [Year], Retro10.pitch_seq, Replace(Replace(Replace(Replace(Replace(Replace(Re place(Replace([pitch_seq],"N",""),"1",""),"2",""),"3",""),".",""),"*","")," +",""),">","") AS clean, Len([clean]) AS pitches, Retro10.pitcher, Retro10.catcher, [pitcher] & "_" & [catcher] AS PandC, [catcher] & "_" & [pitcher] AS CandP
FROM Retro10;
My problem is that I expected to write a second query to sum the number of pitches for each catcher/pitcher and pitcher/catcher pairs:
SELECT CLEAN_pitch_seq.CandP, Sum(CLEAN_pitch_seq.pitches) AS SumOfPitches
FROM CLEAN_pitch_seq
GROUP BY CLEAN_pitch_seq.CandP;
However, I get a "Data type mismatch in criteria expression" when I run this second query and I have no idea why. Similarly, if I change the query to sum PandC.
It looked to me like a straightforward aggregate. It's a simple subtotal or SUMIF in Excel but I have over 750K records. What am I missing?
Thanks.