Thursday, August 4, 2011

redlegs and descriptive stats

Honestly, my appreciation of baseball is pathetic. That is a veritable reality (as those acquainted with me can attest). However, as a youth I followed Redlegs baseball compulsively. Sunday mornings were great because the Cincinnati Enquirer would have an expanded sports section with all the breakdown of player and team stats (this was before the dawn of ESPN.com/mlb/statistics....in a former age when all fans waited with baited breath for our information to arrive through the paper delivery man). Batting averages, ERA's, standings, etc. Typical of my analytic approach to things, I would perform my own calculations to see how many hits Hal Morris would need to break .400 over the next 50 at bats. Much has changed since my youth, but I still gain a good deal of satisfaction in performing calculations. For those who might find such a thought foreign or even perverse, I will frame this post on summary statistics around baseball. Perhaps no one will even notice that math is being conducted.

Data collection is the process of compiling bits of information (e.g.: number of hits, innings pitched) in an organized system (I personally prefer Excel for its ubiquitous use). Without doing at least some basic calculations the data are not very meaningful. Summary statistics are employed to describe several qualities of the data: location, spread, and shape. Location captures where the majority of the data lie (e.g.: the team batting average represents a central number around which most of the players on a baseball team are averaging). Spread captures how widely the data varies (e.g.: once the location is known we want to know how big the difference is between the worst batting average and the best batting average). Shape captures how closely a graph of the data resembles a bell shaped curve (e.g.: if the batting average is relatively high compared to the spread of individual batting averages then the shape will be abnormally tall and departs from a normal bell curve). In statistical jargon these three qualities are referred to as: mean, variance and skewness/kurtosis. (WARNING: those who experience anxiety when exposed to mathematical formulas should skip over the next paragraph...you won't miss much).



Mean is the sum of the data divided by the number of data points and is denote by the upper case "Y" with a line over it in the equation:
    • YBAR = SUM[Y(i)/N] where the summation is for 1 to N.
  • Variance is the square of each data points distance from the mean, all summed and divided by total number of data points minus 1. The square root of variance can then be taken to reflect the average distance that data points vary from the mean

      s = SQRT[SUM(Y(i) - YBAR)**2/(N-1)] where the summation is from 1 to N.

    Skewness is obtained by the cubed value of (the mean subtracted from each data point), summed for all data points and then divided by the product of (total number of data points minus 1) and (the cube of standard deviation). skewness = SUM(Y(i) - YBAR)**3/(N-1)*s**3 where the  summation is from 1 to N When this number is significantly positive then the data is not normally shaped and is lopsided to the right; when this number is negative then the data is lopsided to the left. Kurtosis is obtained similarly to skewness only the numerator is taken to the 4th power and standard deviation in the denominator is taken to the 4th power
    kurtosis = SUM(Y(i) - YBAR)**4/(N-1)*s**4 where the  summation is from 1 to N. A significantly negative kurtosis number indicates a flat curve and a positive kurtosis number indicates a tall, lanky curve.


    Why are descriptive statistics important? There is a phenomenon out there in the sports world known as fanaticism. A true baseball fanatic will go to extreme lengths to stand by her team. It is even conceivable that during pleasant interchange over brews she may overstate or even understate the skill level of her team. In the world of statistics we call this bias. Descriptive statistics help us to reduce the influence of bias and see things more like they really are. No matter how devoted I am to the Redlegs offense, if I compare a variety of reliable offensive indicators to the same indicators for the Cardinals, I will find out the truth of how well they compare. If we compare the two teams and find that the Redlegs have better numbers and that the difference between the two teams is not due to chance, then I may not only brag as a fanatic but as a man who knows the statistical truth ("statistical truth," is that an oxymoron?).

    So that raises a question, how would I know if the better Redlegs numbers were due to chance or skill? Tune in to next post for a breathtaking answer.

    No comments: