# Tag Archives: statistics

## The MVP debate, part II

We often think of players’ boxscore statistics in terms of cumulative sums or averages, but these statistics, while they tell us about prolificness and what one might expect from any player in a given game, tell us very little else about the player’s output. Consider three hypothetical players in an 82 game season: Player A scores 5 points in 41 of his games, and 35 points in the other 41 games, player B scores 20 points in each of the 82 games, and player C scores 19 points in 80 of his games, and scores 60 in each of the remaining two. Each of these players ends the season averaging 20 ppg, each scored a total of 1,640 points. However, there should be no question that they are very different players, even without considering non-scoring contributions. B is extremely consistent, C is pretty consistent but has rare scoring outbursts, and A is either a big threat or hardly a threat at all. (Please keep in mind that I could be doing the same with per-minute statistics–it’s just a little easier conceptually to discuss per-game stats, while making an equivalent point.) Opposing teams would need to plan differently when facing each of these three players, and their value to their own team is a function not only of their scoring average, but their entire scoring distribution. Since it is much easier to keep track of cumulative totals, and since the simple mean can be calculated by dividing total points (ast, reb, etc.) by total games, we have all been raised on means and sums–which are useful as far as they go, but don’t tell the whole story. So, into the plethora of other “modern” statistics, I would like to add several statistics that have been with us the entire time, but hidden behind season sums and means: the standard deviation, the geometric mean, and the distribution.

The standad deviation is a summary statistic like the mean, but it measures dispersion. Essentially, it attempts to capture the typical deviance from the mean of each data point. So, players whose per-game boxscore stats vary a lot from game-to-game will have a higher standard deviation than will players who are more consistently close to their own mean. Whether a high or low standard deviation is a good thing is a normative question, although I tend to think that consistency (indicated by a low standard deviation) is a good thing. Bear in mind also, that typically, the greater the mean, the more room there is for variance, and thus the more potential for a larger standard deviation. Thus, another statistic, the coefficient of variation, can be used to give an idea of variation while controlling for the magnitude of the mean.

The geometric mean is similar to the arithmetic mean, in that it is a measure of centrality. However, it seems to emphasize consistency more than does a simple arithmetic mean. Where the arithmetic mean is the sum of the data divided by the number of data points, the geometric mean is the product of the data exponentiated by the inverse of the number of data points. Thus, in our above example, each player has the same mean (20 ppg), but B has a geometric mean of 20, C’s is 19.54, and A’s is 13.23. According to the geometric mean, then, player A is valued almost exactly the same as player D, who scores 13 points in each of 63 games, and 14 points in every other game. Both of their g.means are around 13.23, but player A’s arithmetic mean is 20, while player D’s is 13.23. As such, the geometric mean, especially when presented alongside the arithmetic mean, may tell us even more about a player’s output.*

Finally, there is the entire distribution of per-game point totals. This encapsulates all of the information about a player’s production, because it is the player’s entire production. It’s not a numerical statistic, but can be represented as a graphic, or even (theoretically) an equation. The distribution is represents essentially the same thing as does a histogram or bar chart of each statistic’s frequency at each level of output. In the graphic below, I display each of four players’ distributions on six different per-game statistics. This should give the viewer a very complete idea of each players’ production. I also include the summary statistics I’ve described, which individually give some information about the distribution, and taken together represent a partial but informative view of player production.

This graphic presents the output of four potential MVP candidates through about 60 games of this season. Note that LeBron James tops Kobe Bryant in arithmetic means across every category, and seems to be a more consistent scorer (on a per-game level, at least)… I hope you find this depiction of production useful and informative–please don’t hesitate to participate in the ongoing MVP debate (see this post).

* A note about geometric means: since a player might have zero points, or assists or blocks, etc. in any given game, there is the potential that this zero would “wipe out” their geometric mean for that statistic, making it relatively uninformative. Thus, I have replaced each instance of 0 with 0.9 — which penalizes the player for having a low figure, but maintains valuable information. This is probably not a perfect solution, but I’ve applied it consistently, so it should at least be “fair” in some sense. Let me know in the comments if there is a better way of doing this.

## NBA Players in their prime

It has been suggested that I look at players’ statistics from only the primes of their careers. This is a good idea, given that both very inexperienced and very old players will “regress to the mean” in terms of their performance and possibly, playing style. As such, I generated a sum of each player’s boxscore statistics during the modern area across only their best seasons. My definition of “best” was simple: not their worst. For each player, I found their mean seasonal winshr, as well as their winshr standard deviations. Any seasons for which a player’s winshr was greater than the mean less one standard deviation was included in this analysis. This way, I excluded seasons in which a player was injured or relatively underused because of age or because of a minor role on their team. Chris Webber’s current and previous seasons, for example, would not be included. In this way, I hope to get at the “pure” essence of each player for an even better comparison. You will probably not be surprised to see that the diagram looks very similar to the non-peak-performance versions:

NBA players at peak performance [pdf]

A few interesting things to note, however: at their peak, Michael Jordan and Larry Bird are now among each others’ closest matches. Also, taking a macro view of the whole network, it is now easy to identify several different nodes: In bluish purple at top left, we can see defensive-minded, “dirty work” bigs, while at the bottom in blue are more scoring bigs. To their right is a reddish group of primarily scorers, while going north from there in green we see “pure point guards” and then more scoring point guards. Etc, etc. Let me know if you notice any other interesting connections or clusters in the comments.

## Senate partisanship history timeline discussion

A few days ago, I posted a history of partisanship in the Senate, which you should check out if you haven’t, it’s pretty fancy. Nathan Yau, author of the blog FlowingData, posted a helpful critique on his blog. I responed in the comments, and reproduce those comments below. If you have any comments or suggestions, please let me know. I am trying to optimize it, and any feedback is useful.

Thanks a lot for soliciting comments. You raise a lot of good questions here, I thought I might try to respond to some of them. My answers aren’t the final answer, mostly I’d like try to do an initial justification of some of my design choices:

* I wasn’t immediately sure what each visual cue represented e.g. size of state abbrev. until I reached the bottom. It might be worth making the annotation more prominent either by position, size, or color or all three.

This is a pretty good point. It may help to move the key. Mostly, I put it at the bottom to minimize its obtrusiveness.

* To me, the congress numbers don’t matter so much, but that just might be I don’t have a lot of learning on the history of American government.

The congress numbers and years are in some ways redundant, but congress scholars often refer to congresses by their number. In fact, the years are only there for those less familiar with the congress number, to give a sense of where you are in history.

* I’m wondering if there’s some way to make the labeling of the years more concise? If you just labeled with the first year of the two-year term, would it be obvious that you’re describing a two-year term? What if you took away the alternating gray background and just made it all white and then had a bar timeline-type thing on top (and bottom)?

I may be able to do without both years, since it is known that there are always two years to each congress. The gray and white bars are somewhat useful, because it’s not labeled (it should be), but within each session, the points all have a certian left-right jitter–this jitter makes it easier to read, and actually conveys in a very subtle way the second dimension of the ideological scale on which each Senator is plotted. If you read more about DW-nominate, you will find that the primary dimension dominates, but for certain time periods, a second dimension becomes important. I thought I would include it subtly, because it also helped with readability.

* What if you tried to use a color scheme? I mean, you have the red and blue for the reps and dems (which I think is right), but the gradient for the senate counts turns very bright pink and purple which doesn’t go too well. Then there’s the cyan, yellow, and green which doesn’t seem to have any specific significance other than each color represents something. What I mean is… is there a reason you chose those colors?

The colors chose themselves: red and blue have come to be identified with each of the parties. Green was my remaining option out of the RGB set, and I made all Southerners’ green value equal to 255. Then, every Democrat’s B value varies as a function of their party unity (the degree to which they voted with the party). The same for Republicans and Red. Thus you can read members’ party loyalty into their color. The interesting thing is, disloyal northerners look dark, even blackish, but disloyal southerners’ lack of R and B makes them increasingly only green. Thus, for example, the very disloyal Southern Democrats of the mid-20th Century can obviously stick out as very green, where other Southern dems are various shades of teal and greenish-blue. This reflects a very important shift in the history of Congress, and it’s all indicated right there, just as a function of geography and loyalty transformed to color values.

* It might be worth making the annotations bigger so that you don’t have to “zoom in” to read.

Also possibly valid, although part of the reason I made them small is that my original intent was to design for print, where the poster will be about 24×36 inches, and the labels will be fairly legible.

* I think I would make the median lines a bit more prominent, but that’s just me.

Not a bad idea, but I a) don’t want them to completely dominate, and b) want to maintain legibility of the overlaid state names as much as possible. I may be able to make the medians wider, but then in a sense, one loses accuracy.

* There’s a lot of cool stuff getting represented here, and I wonder if anything might benefit as a separate graph. Would this benefit at all as a series of graphs instead of one large graphic?

Possible, except one of the things I like most about it is that it tells almost the entire story of partisanship and something called conditional party government (which relates to the density graphs at the bottom), all in one place. So it’s a very comprehensive and relatively quick way to get all of it “at a glance” if you know what to look for.

I hope this isn’t getting repetitive, because I’ve got a diagram that will blow your mind: it’s like the entire NBA in a petri dish, with all different phyla and genera of player types represented. I used the same methodology I’ve been using (with the per-minute, rather than ratio statistics), but generated the graph with fewer connections (just the single closest match) per player. As a result, there are a whole lot of isolated clusters instead of one completely interconnected network. Also, I went ahead and did 1,000 players at once, instead of the standard 250. What I got astounded me–they look like microorganisms swimming around on the microscope slide that is the NBA. I apologize for the tiny font–if you zoom in to 125%, it should be readable–but had I made the names any larger, they would have overlapped to an illegible degree.

The NBA “petri dish” diagram [pdf]

I would be very interested in collectively coming up with a sort of “baller’s taxonomy,” wherein we try and identify the different clusters using some more subjective terms. I think we could come up with a better vocabulary to describe players and define playing styles. If you have any ideas, please put them in the comments, and if there is sufficient interest, I may come up with a more formalized process, in the hopes of putting together a follow-up diagram with labels.

Since I had already run the algorithm anyway (it takes a lot of cycles to do 1,000 players), I went ahead and made a completely connected version of the 1,000 player diagram. Warning: this one is pretty hard to parse.

1000 player network diagram [pdf]

Keep in mind that the search function (ctrl-f) will be really useful for these.

## NBA player similarities matrix revisited

In response to some questions at the APBRmetrics forum, I’ve put together a new NBA similarities network (Top 250 players version), wherein I use per-minute statistics, instead of my “patented” ratios method, just to see how it looks. In a lot of ways, this looks just as good or even better than the ratios version… I’m still somewhat torn, though: The ratios method, by ignoring time statistics completely, attempts to match players who, given a possession (or given an opponent with a possession), will do similar things with it, while the per-minute method does a better job of representing “substitutability.” I suppose I will let history be the judge, but I don’t think anyone loses when more pretty graphs are made:

NBA player similarities [pdf]

Another version with Extremely High Contrast Labels for Easy Reading: [pdf]