Tag Archives: graphic

The MVP debate, part II

We often think of players’ boxscore statistics in terms of cumulative sums or averages, but these statistics, while they tell us about prolificness and what one might expect from any player in a given game, tell us very little else about the player’s output. Consider three hypothetical players in an 82 game season: Player A scores 5 points in 41 of his games, and 35 points in the other 41 games, player B scores 20 points in each of the 82 games, and player C scores 19 points in 80 of his games, and scores 60 in each of the remaining two. Each of these players ends the season averaging 20 ppg, each scored a total of 1,640 points. However, there should be no question that they are very different players, even without considering non-scoring contributions. B is extremely consistent, C is pretty consistent but has rare scoring outbursts, and A is either a big threat or hardly a threat at all. (Please keep in mind that I could be doing the same with per-minute statistics–it’s just a little easier conceptually to discuss per-game stats, while making an equivalent point.) Opposing teams would need to plan differently when facing each of these three players, and their value to their own team is a function not only of their scoring average, but their entire scoring distribution. Since it is much easier to keep track of cumulative totals, and since the simple mean can be calculated by dividing total points (ast, reb, etc.) by total games, we have all been raised on means and sums–which are useful as far as they go, but don’t tell the whole story. So, into the plethora of other “modern” statistics, I would like to add several statistics that have been with us the entire time, but hidden behind season sums and means: the standard deviation, the geometric mean, and the distribution.

The standad deviation is a summary statistic like the mean, but it measures dispersion. Essentially, it attempts to capture the typical deviance from the mean of each data point. So, players whose per-game boxscore stats vary a lot from game-to-game will have a higher standard deviation than will players who are more consistently close to their own mean. Whether a high or low standard deviation is a good thing is a normative question, although I tend to think that consistency (indicated by a low standard deviation) is a good thing. Bear in mind also, that typically, the greater the mean, the more room there is for variance, and thus the more potential for a larger standard deviation. Thus, another statistic, the coefficient of variation, can be used to give an idea of variation while controlling for the magnitude of the mean.

The geometric mean is similar to the arithmetic mean, in that it is a measure of centrality. However, it seems to emphasize consistency more than does a simple arithmetic mean. Where the arithmetic mean is the sum of the data divided by the number of data points, the geometric mean is the product of the data exponentiated by the inverse of the number of data points. Thus, in our above example, each player has the same mean (20 ppg), but B has a geometric mean of 20, C’s is 19.54, and A’s is 13.23. According to the geometric mean, then, player A is valued almost exactly the same as player D, who scores 13 points in each of 63 games, and 14 points in every other game. Both of their g.means are around 13.23, but player A’s arithmetic mean is 20, while player D’s is 13.23. As such, the geometric mean, especially when presented alongside the arithmetic mean, may tell us even more about a player’s output.*

Finally, there is the entire distribution of per-game point totals. This encapsulates all of the information about a player’s production, because it is the player’s entire production. It’s not a numerical statistic, but can be represented as a graphic, or even (theoretically) an equation. The distribution is represents essentially the same thing as does a histogram or bar chart of each statistic’s frequency at each level of output. In the graphic below, I display each of four players’ distributions on six different per-game statistics. This should give the viewer a very complete idea of each players’ production. I also include the summary statistics I’ve described, which individually give some information about the distribution, and taken together represent a partial but informative view of player production.

mvpdensities.png

This graphic presents the output of four potential MVP candidates through about 60 games of this season. Note that LeBron James tops Kobe Bryant in arithmetic means across every category, and seems to be a more consistent scorer (on a per-game level, at least)… I hope you find this depiction of production useful and informative–please don’t hesitate to participate in the ongoing MVP debate (see this post).

* A note about geometric means: since a player might have zero points, or assists or blocks, etc. in any given game, there is the potential that this zero would “wipe out” their geometric mean for that statistic, making it relatively uninformative. Thus, I have replaced each instance of 0 with 0.9 — which penalizes the player for having a low figure, but maintains valuable information. This is probably not a perfect solution, but I’ve applied it consistently, so it should at least be “fair” in some sense. Let me know in the comments if there is a better way of doing this.

Dimesworth of difference?

Using roll call votes from the 110th Senate through the end of last year, I have constructed a network diagram based on maximum similarities between Senators’ voting records. Essentially, distances were calculated by assigning a 1 to yes votes, and a 0 to no votes, and finding the difference between each pair of Senators on each possible roll call vote. Thus, two Senators who vote identically have a distance of 0, while two Senators who vote completely opposite ways have a distance equal to the total number of roll calls. Based on these distances, I constructed a network diagram linking each Senator to their two most-similarly-voting counterparts. I also colored each vertex according to how similar each Senator is to “all Republicans” and “all Democrats” collectively. The result revealed the highly polarized nature of the Senate: there is only a single strand linking Republicans to Democrats:

110thnetthumb.png 11oth Senate Roll Call Network Diagram [pdf]

I then decided to reduce the number of connections to only the single closest match for each Senator, and found something interesting that you will hear only rarely from the media: Senators Clinton and Obama are each others’ closest match, based, at least, on roll call votes in the 110th Senate through the end of 2007. This would seem to indicate that the wide disparities perceived between them in the eyes of the media and the public have little to do with actual policy/ideological divides, but rather that personality and framing (and possibly demographics) are making up the bulk of voting preferences in many Americans’ minds.

110thisothumb.png 110th Senate Roll Call Isolated Networks [pdf]

I was aware, to some extent, of the constructed, rather than actual, nature of the differences between the two Democratic competitors, but to see the roll call evidence fall out so starkly was surprising.

MLB Batter network diagram by statistical proximity

The next in a series consists of batters in the MLB from 1955-2007 (because the modern set of statistics has not changed since the 1955 season). I think these statistics lend themselves less well to this sort of analysis, but it may be interesting to you baseball enthusiasts out there.

batnetthumb.png Batter Statistical Proximity [pdf]

NBA season network diagram

It  was suggested that I compare players on single season data, rather than career sum data, both as a validity test and to gain other insight. It goes without saying that players’ styles change over their career–often, scorers become less effective and try to do other things well. Sometimes (as with Jordan, for example), we see players add dimensions to their game over time. So, I present yet another network diagram, one which illustrates the changing nature of each player. A few notes: this set is somewhat scorer-heavy, because of the way I generated the list of best seasons (using a euclidean distance metric). Also, when looking at this, it helps to keep in mind that this is a two-dimensional rendering of a hyperdimensional network–unless players are actually connected, visual proximity doesn’t necessarily mean anything, although it may not mean nothing. It would appear, given the degree to which players’ seasons cluster together, that the proximity algorithm functions fairly well.

NBA Seasons Proximity Network [PDF]

Senate partisanship history timeline discussion

A few days ago, I posted a history of partisanship in the Senate, which you should check out if you haven’t, it’s pretty fancy. Nathan Yau, author of the blog FlowingData, posted a helpful critique on his blog. I responed in the comments, and reproduce those comments below. If you have any comments or suggestions, please let me know. I am trying to optimize it, and any feedback is useful.

Thanks a lot for soliciting comments. You raise a lot of good questions here, I thought I might try to respond to some of them. My answers aren’t the final answer, mostly I’d like try to do an initial justification of some of my design choices:

* I wasn’t immediately sure what each visual cue represented e.g. size of state abbrev. until I reached the bottom. It might be worth making the annotation more prominent either by position, size, or color or all three.

This is a pretty good point. It may help to move the key. Mostly, I put it at the bottom to minimize its obtrusiveness.

* To me, the congress numbers don’t matter so much, but that just might be I don’t have a lot of learning on the history of American government.

The congress numbers and years are in some ways redundant, but congress scholars often refer to congresses by their number. In fact, the years are only there for those less familiar with the congress number, to give a sense of where you are in history.

* I’m wondering if there’s some way to make the labeling of the years more concise? If you just labeled with the first year of the two-year term, would it be obvious that you’re describing a two-year term? What if you took away the alternating gray background and just made it all white and then had a bar timeline-type thing on top (and bottom)?

I may be able to do without both years, since it is known that there are always two years to each congress. The gray and white bars are somewhat useful, because it’s not labeled (it should be), but within each session, the points all have a certian left-right jitter–this jitter makes it easier to read, and actually conveys in a very subtle way the second dimension of the ideological scale on which each Senator is plotted. If you read more about DW-nominate, you will find that the primary dimension dominates, but for certain time periods, a second dimension becomes important. I thought I would include it subtly, because it also helped with readability.

* What if you tried to use a color scheme? I mean, you have the red and blue for the reps and dems (which I think is right), but the gradient for the senate counts turns very bright pink and purple which doesn’t go too well. Then there’s the cyan, yellow, and green which doesn’t seem to have any specific significance other than each color represents something. What I mean is… is there a reason you chose those colors?

The colors chose themselves: red and blue have come to be identified with each of the parties. Green was my remaining option out of the RGB set, and I made all Southerners’ green value equal to 255. Then, every Democrat’s B value varies as a function of their party unity (the degree to which they voted with the party). The same for Republicans and Red. Thus you can read members’ party loyalty into their color. The interesting thing is, disloyal northerners look dark, even blackish, but disloyal southerners’ lack of R and B makes them increasingly only green. Thus, for example, the very disloyal Southern Democrats of the mid-20th Century can obviously stick out as very green, where other Southern dems are various shades of teal and greenish-blue. This reflects a very important shift in the history of Congress, and it’s all indicated right there, just as a function of geography and loyalty transformed to color values.

* It might be worth making the annotations bigger so that you don’t have to “zoom in” to read.

Also possibly valid, although part of the reason I made them small is that my original intent was to design for print, where the poster will be about 24×36 inches, and the labels will be fairly legible.

* I think I would make the median lines a bit more prominent, but that’s just me.

Not a bad idea, but I a) don’t want them to completely dominate, and b) want to maintain legibility of the overlaid state names as much as possible. I may be able to make the medians wider, but then in a sense, one loses accuracy.

* There’s a lot of cool stuff getting represented here, and I wonder if anything might benefit as a separate graph. Would this benefit at all as a series of graphs instead of one large graphic?

Possible, except one of the things I like most about it is that it tells almost the entire story of partisanship and something called conditional party government (which relates to the density graphs at the bottom), all in one place. So it’s a very comprehensive and relatively quick way to get all of it “at a glance” if you know what to look for.