We often think of players’ boxscore statistics in terms of cumulative sums or averages, but these statistics, while they tell us about prolificness and what one might expect from any player in a given game, tell us very little else about the player’s output. Consider three hypothetical players in an 82 game season: Player A scores 5 points in 41 of his games, and 35 points in the other 41 games, player B scores 20 points in each of the 82 games, and player C scores 19 points in 80 of his games, and scores 60 in each of the remaining two. Each of these players ends the season averaging 20 ppg, each scored a total of 1,640 points. However, there should be no question that they are very different players, even without considering non-scoring contributions. B is extremely consistent, C is pretty consistent but has rare scoring outbursts, and A is either a big threat or hardly a threat at all. (Please keep in mind that I could be doing the same with per-minute statistics–it’s just a little easier conceptually to discuss per-game stats, while making an equivalent point.) Opposing teams would need to plan differently when facing each of these three players, and their value to their own team is a function not only of their scoring average, but their entire scoring distribution. Since it is much easier to keep track of cumulative totals, and since the simple mean can be calculated by dividing total points (ast, reb, etc.) by total games, we have all been raised on means and sums–which are useful as far as they go, but don’t tell the whole story. So, into the plethora of other “modern” statistics, I would like to add several statistics that have been with us the entire time, but hidden behind season sums and means: the standard deviation, the geometric mean, and the distribution.
The standad deviation is a summary statistic like the mean, but it measures dispersion. Essentially, it attempts to capture the typical deviance from the mean of each data point. So, players whose per-game boxscore stats vary a lot from game-to-game will have a higher standard deviation than will players who are more consistently close to their own mean. Whether a high or low standard deviation is a good thing is a normative question, although I tend to think that consistency (indicated by a low standard deviation) is a good thing. Bear in mind also, that typically, the greater the mean, the more room there is for variance, and thus the more potential for a larger standard deviation. Thus, another statistic, the coefficient of variation, can be used to give an idea of variation while controlling for the magnitude of the mean.
The geometric mean is similar to the arithmetic mean, in that it is a measure of centrality. However, it seems to emphasize consistency more than does a simple arithmetic mean. Where the arithmetic mean is the sum of the data divided by the number of data points, the geometric mean is the product of the data exponentiated by the inverse of the number of data points. Thus, in our above example, each player has the same mean (20 ppg), but B has a geometric mean of 20, C’s is 19.54, and A’s is 13.23. According to the geometric mean, then, player A is valued almost exactly the same as player D, who scores 13 points in each of 63 games, and 14 points in every other game. Both of their g.means are around 13.23, but player A’s arithmetic mean is 20, while player D’s is 13.23. As such, the geometric mean, especially when presented alongside the arithmetic mean, may tell us even more about a player’s output.*
Finally, there is the entire distribution of per-game point totals. This encapsulates all of the information about a player’s production, because it is the player’s entire production. It’s not a numerical statistic, but can be represented as a graphic, or even (theoretically) an equation. The distribution is represents essentially the same thing as does a histogram or bar chart of each statistic’s frequency at each level of output. In the graphic below, I display each of four players’ distributions on six different per-game statistics. This should give the viewer a very complete idea of each players’ production. I also include the summary statistics I’ve described, which individually give some information about the distribution, and taken together represent a partial but informative view of player production.
This graphic presents the output of four potential MVP candidates through about 60 games of this season. Note that LeBron James tops Kobe Bryant in arithmetic means across every category, and seems to be a more consistent scorer (on a per-game level, at least)… I hope you find this depiction of production useful and informative–please don’t hesitate to participate in the ongoing MVP debate (see this post).
* A note about geometric means: since a player might have zero points, or assists or blocks, etc. in any given game, there is the potential that this zero would “wipe out” their geometric mean for that statistic, making it relatively uninformative. Thus, I have replaced each instance of 0 with 0.9 — which penalizes the player for having a low figure, but maintains valuable information. This is probably not a perfect solution, but I’ve applied it consistently, so it should at least be “fair” in some sense. Let me know in the comments if there is a better way of doing this.