Regarding the previously mentioned People’s Statistic Project (in which you should participate, if you have not already): The gracious gentleman who runs the 3 Shades of Blue blog has seen fit to interview me, and the transcript is available at that site. As you can tell from reading it, I have a lot to say about nearly every subject…
I have been honing the Winshares formula into a finely-tuned machine, and will unveil it shortly in this space, stay tuned. I will also see if there is anything interesting to be learned from the People’s Statistic returns (there should be…), and post on that, as well. For some reason, everything seems busier these days, even though classes are over.
Categories: basketball · nba · statistics
Categories: Uncategorized
To begin, here are is my pick/prediction for the 2008 NBA MVP award: Chris Paul of the New Orleans Hornets. Second most valuable is Kobe Bryant, followed by LeBron James and Paul Pierce. How did I decide this? Read on…
I have discussed the concept of Winshares previously in this space, and I believe that this measure is the most parsimonious and theoretically satisfying way to estimate player value. If you are unfamiliar with the construction, here is the formula:
- valuable contributions = pts + as*2 + tr + st + bk - to
- winshares = (valuable contributions / team valuable contributions) * team wins
The very simple motivating theory is that each player is responsible for some fraction of his team’s success (and here I define success as winning, plain and simple–value is a separate concept from quality or talent, and value in athletics is commonly gauged by game outcomes and the contribution of individuals thereto). The better the player doing the contributing, the more successful the team, and so contributions should be weighted by team success to reward those players whose efforts result in winning.
Picture a team with one player who contributes substantially more than his teammates (say, Minnesota with Al Jefferson, or Cleveland with LeBron James). It stands to reason that win or lose, that player deserves a large share of the credit for that team’s outcomes. Now picture a team for which valuable contributions are more evenly made (say, Chicago, Sacramento, or Boston). It similarly stands to reason that credit for the success of those teams ought to be more evenly attributed to the several players who contribute.
This means that a great player doing all the work for an otherwise very poor team should be worth about the same amount, in terms of wins, as a great player doing a smaller part of the work for an otherwise very good team. This makes sense, both are great players, so both should be able to generate similar levels of success. LeBron James should be approximately as valuable as Kevin Garnett, since although the quality of their teammates is different, so is the amount they are required to contribute to their teams’ success.
So this is how I arrived at my formulation of player value: essentially add up all the good things a player has done for his team, and divide that by the total number of good things his team did. Multiply this percentage by the number of team wins, and there you have it–a per-player number of Winshares.
Now, there are several downsides to this operationalization. It takes no account of intangibles, or anything besides basic boxscore statistics. Kevin Garnett’s incredible intensity defensive leadership doesn’t count in this formulation (except as they are expressed in the boxscore–no doubt they contributed to team wins), so Paul Pierce comes through as slightly more valuable. Keep in mind, however, that this (Pierce for MVP) is what Garnett himself has told us all year long, and also keep in mind that this is not a per-minute or per-possession measure. Garnett played 2329 minutes to Pierce’s 2873, a substantial difference. Garnett had less time to add wins, even though he may have been more valuable per-minute than Pierce. However, for the MVP award, the focus ought to be on total value over the season, not player quality or efficiency. I am as big a Garnett fan as anyone, but no one would argue that injured Gilbert Arenas has been more valuable to the Wizards this year than Jamison or Butler, even if he is more valuable in some per-minute sense (though this is questionable).
The other problem with Winshares is that it does not take into account the specific possessions, minutes or games in which the valuable contributions came. I’m working on this, but in the meantime, you’ll want to use something like plus/minus figures if this is what you’re looking for. This disadvantage is most marked in attempting to measure the value of players traded during the season, but let’s face it–it is unlikely that an MVP-level player will be traded in the midst of an MVP-type season, and it’s even more unlikely that a player who was traded in the midst of the season would be in the running for MVP.
Any questions or critiques on this methodology are welcome, please feel free to leave a comment, but I submit that as far as elegance, parsimony, accessibility, and theoretical validity, Winshares as measured here are an optimal conceptualization of value.
After all that, here is the payoff: I’ve constructed a visualization depicting each player’s value in Winshares: their percent of valuable contributions is depicted on the vertical axis, and team success along the horizontal. Multiplying these two figures together results in Winshares, and each player is listed with their Winshare value and represented as a rectangle, the area of which is exactly proportional to his value. (Color is derived from my favorite way to capture playing type–the RGB scorer/perimeter/interior quasi-trichotomy.)
In a new twist, I’ve got it set up in a Google-Maps-style interface, so you can get as big a picture or as much detail as you’d like. Enjoy! (You’ll probably want to zoom in when the page first loads…)
Winshare Area Graph:

If that’s not the coolest, most straightforward way to envision basketball value, I don’t know what is!
Categories: analysis · basketball · graphics · infovis · metrics · nba · sports
As you might have heard, the Sixth man of the year is Manu Ginobili of the San Antonio Spurs. Apparently, it was pretty much unanimous, too. It is interesting–from an “institutions matter” standpoint, if you wanted a player on your team to win sixth man of the year every year, you’d just pick your best player, make him sit for the first minute, and then sub him in and play him for regular starter playing time. This is, in effect, what the Spurs are doing with Ginobili, and given that he’s their second (or possibly third) best player, he’s essentially a shoo-in for the award. I thought it might be interesting to look at best sixth men according to my favorite value metric, Winshares, and so here is a plot of percentage of games started versus Winshares, for all players who started less than 100% of the games in which they played:

Ginobili sticks out like a sore thumb. In fact, the only player higher on the Winshares dimension in that graph is LeBron James, who started merely 74 of his 75 games played. A reasonable criteria for qualifying as a sixth man, say, starting less than half of your games, quickly eliminates James, leaving Ginobili as the no-brainer choice. I’ll leave you with a Winshares ranking of the top ten players who started less than half of their games:
| Player |
Team |
startpct |
Winshares |
| ginobili,manu |
san |
0.311 |
9.25 |
| terry,jason |
dal |
0.415 |
6.30 |
| barbosa,leandro |
pho |
0.134 |
5.97 |
| diaw,boris |
pho |
0.244 |
5.90 |
| scola,luis |
hou |
0.476 |
5.72 |
| outlaw,travis |
por |
0.073 |
5.17 |
| millsap,paul |
uta |
0.024 |
5.10 |
| maxiell,jason |
det |
0.085 |
5.09 |
| posey,james |
bos |
0.027 |
5.07 |
| turiaf,ronny |
lal |
0.269 |
4.44 |
Categories: analysis · basketball · graphics · nba · sports
I wasn’t invited to the TrueHoop Stat Geek Smackdown (see also), but I figure I’m just as capable of making wild, semi-empirically based predictions as anyone else, so I have done so. I’ll try to keep this up, round-by-round, and we’ll see how I do against more well-known Stat Geeks. Perhaps if I do well, someday I will be a TrueHoop-acknowledged geek…
Using just True Winning Percentages and bernoulli probabilities, I’ve calculated the probabilities of each possible series outcome, and then normalized to sum to one. (See my spreadsheet) For the first round, I have:
BOS in 4
CLE in 7
ORL in 6
DET in 5
LAL in 6
HOU in 7
SAN in 7
NOR in 6
Probabilities, as sparklines:

Categories: Uncategorized
An alert reader pointed out that some of the predictions I made earlier were total nonsense. The first group, where I pick series winners and probabilities of winning, I stand by, but the second group, where I place odds on winning it all, those are not so good.
So I redid them, and actually tried this time, instead of being lazy, and here is what I came up with:
| team |
p(title) |
| bos |
0.272126 |
| det |
0.116574 |
| lal |
0.091853 |
| san |
0.082671 |
| nor |
0.078856 |
| hou |
0.071202 |
| pho |
0.062373 |
| orl |
0.054235 |
| uta |
0.046801 |
| dal |
0.03739 |
| den |
0.030089 |
| cle |
0.018941 |
| was |
0.013058 |
| tor |
0.010618 |
| phi |
0.009039 |
| atl |
0.004175 |
These make a little bit more sense… the Western Conference teams hurt each others’ odds a little bit, and we’re left with Detroit and Boston taking advantage of a somewhat easier time of it in the east. The spreadsheet I used can be seen here. Apologies for doing such a shoddy job the first time around. I’m sticking with these as my official word, and I still don’t give Utah much of a chance.
Categories: basketball · nba · sports · statistics
Since the NBA playoffs are starting today, I figured I’d throw out my own predictions. I used something I call “True Winning Percentage” (which I will explain some other time, but essentially, it takes into account opponent’s records, and opponents’ opponents’ records, and opponents’ opponents’ opponents’ records, and so on, and determines what a team’s “true” winning percentage would be if they played each other team infinite times, at least, theoretically), to calculate odds for each first round matchup, and then compared those projected winners and so on. Here are my predictions, along with my estimated probabilities of each team I pick to win actually winning the series:
| Likely victor |
Probability of likely victor winning series |
| BOS>ATL |
0.8509 |
| CLE>WAS |
0.5313 |
| ORL>TOR |
0.6414 |
| DET>PHI |
0.7266 |
| LAL>DEN |
0.6114 |
| HOU>UTA |
0.543 |
| SAN>PHO |
0.5303 |
| NOR>DAL |
0.5753 |
|
|
| BOS>CLE |
0.7829 |
| DET>ORL |
0.5894 |
| LAL>HOU |
0.5257 |
| SAN>NOR |
0.513 |
|
|
| BOS>DET |
0.6384 |
| SAN>LAL |
0.5004 |
|
|
| BOS>SAN |
0.6198 |
Note that the matchup between San Antonio and LA is essentially a dead heat. However, Boston has essentially identical odds against both, so the final outcome is not so much in doubt.
I also calculated overall odds of each playoff team winning the title, given their probability of winning against each other playoff team, these odds are as follows:
| Team |
True Win% |
Prob of championship |
| bos |
0.8106 |
0.7753 |
| sa |
0.7242 |
0.0496 |
| lal |
0.7239 |
0.0491 |
| no |
0.7137 |
0.0352 |
| det |
0.708 |
0.0292 |
| hou |
0.7029 |
0.0247 |
| pho |
0.6993 |
0.0219 |
| uta |
0.6657 |
0.0071 |
| dal |
0.6479 |
0.0039 |
| orl |
0.6281 |
0.002 |
| den |
0.625 |
0.0018 |
| cle |
0.5427 |
9E-05 |
| was |
0.5115 |
3E-05 |
| tor |
0.4857 |
9E-06 |
| phi |
0.477 |
6E-06 |
| atl |
0.4285 |
8E-07 |
As you can see, Boston has better than 3:1 odds of winning before any basketball has even been played. They are just that much better (even taking into account that they played most of their games against weaker competition) than everyone else. Incidentally, to see the effect of calculating True Win%, notice that Detroit, which had the second best win-loss record, falls to fifth when you take into account their opponents’ (and so on…) success. To conclude, given that the odds of any other team winning come in at 22.47%, I feel pretty safe picking the Celtics this year.
Notice that I don’t predict much success for Utah, contrary to several other prognostications that have them performing very well in the playoffs. I understand that these other folks are using scoring efficiencies and the like, but I’m sticking to my guns. Consider this my bold prediction: Houston will actually beat Utah in the first round (although it will be close).
Categories: basketball · nba · sports · statistics
I don’t want this blog to deal very much with the work of other people, and I am not interested generally in being a statistical or graphical critic, but some things are just abhorrent to a true Arbitrarian, and I feel compelled to discuss them.
<rant>
It is a fact of life in sports journalism in general that people just make things up all of the time. This is neither the time nor place to discuss this fully, but people make money when consumers watch their programs, listen to their shows, and read their blogs. Thus, there is an incentive to publish anything and everything you can think of, regardless of merit, just to get it in front of eyeballs. This is especially true of particularly strong or controversial opinions–more discussion and debate just generates greater attention and revenue. This is one of the reasons that so many shows (not just sports shows) offer an adversarial format in which each of several individual personalities adopt a certain position and attempt to justify it. It is not that they care about their position, or even necessarily believe it, it is just that there are always at least two positions that could be taken on any subjective matter, and arguments to be made. Television is not the only guilty party, however, in this shouting match.
We are in the midst now of a legitimate debate about who might be the most deserving of the NBA regular season MVP award. Depending on the criteria you use, certain players may rise to the top of your consideration (I will post on this in the very near future, with my own contention). The NBA loves this, just as the NCAA loves the BCS rankings and bowl determinations — the more ambiguous and arbitrary the process, the more discussion there will be, the more profit for the leagues and the media that report on the leagues. If, for example, the MVP award was automatically awarded to the player with the highest pts/gp on the team with the best record (which some may contend is a valid way of determining it), there might be some drama during the season if the race is close, but generally, there would be nothing to talk about, because the numbers alone would have it. Instead, the NBA holds a vote, so that any number of idiosyncratic factors may go into the determination of who’s “Most Valuable.” For the Arbitrarian, this is obnoxious, but to many, this is profitable.
I am always interested in reading a well-reasoned argument for some player as most valuable. I enjoy seeing assumptions and criteria stated, and a rational argument of some sort that those criteria are appropriate, and that one player or another is the best choice, given these criteria. What bothers me is when no real criteria are stipulated. What bothers me more is when “statistics” are used to offer the illusion of objectivity, while actually only covering up subjectivity.
A particularly egregious example of this, and the impetus for my writing this, can be found here: http://www.realgm.com/src_goaltending/136/20080416/finding_the_true_mvp/ (I hesitate to post the link, as traffic merely incentivizes this type of article). The author begins by ostensibly defining his criteria, his analysis will be “One that looks strictly at what advanced stats can tell us about which player actually has the most value to his team. Emphasis on ‘his team.’” He then proceeds to use one statistic, Pythagorean Wins Differential, which seems a reasonable choice, although he does not explicitly justify it, to construct a top five list.
However, transparently this list does not satisfy the author. He then performs a number of completely arbitrary transformations on the data, multiplying win differential by percentage of minutes played (?), multiplying by team winning percentage (which I thought would have already been considered in the win difference statistic), and adding (!) PER (!!!). None of these transformations make a lot of sense to me; at the very least, they are not fully explained by the author. Why don’t we also add the square root of each player’s blocks/home game, because, you know, we need to consider defense on the home court? The obvious reason for the author’s machinations are that he had in his mind a single player, or group of players who should be deserving of MVP, and his preferred statistic (the Pythagorean Win Differential) did not confirm his initial preferences. So, he added and subtracted and multiplied “advanced” statistics, until he found some transformation that supported his initial (unstated) opinion. Then, of course, he makes it official by calling his choice a 100 out of 100, and scaling everyone down from there. This scaling helps to make the reader forget all the nonsense that came before, and effectively decimates the convoluted units he had arrived at (Playing-time-and-Team-Success-weighted-Pythagorean Win Differential… Now with PER!). All we really need to know, of course, is that LeBron had a perfect season. “Remember, this is a measure of a player’s value to his particular team.” Or something like that. Remember to tune in next time, for “a look at the MVP scores for each individual team.”
I have nothing personal against this author, but his is an example of a larger phenomenon: subjectivity masquerading as objectivity. It is my belief that honest analysis is marked by commitment to methodology and consistency in rhetoric, and as an Arbitrarian, I hate to see the arbitrariness of so much that gets published.
</rant>
Update: Apparently, Slate has a somewhat similar take.
Categories: basketball · nba · theory
Bill Simmons wonders if having what he calls a “cooler” on a team helps that team win. What he means by cooler, I believe, is someone who is called upon to make free throws for a team at the end of games, and who doesn’t miss those late-game free throws. Unfortunately, I don’t really have the data to focus exclusively on late game situations, but if we assume that teams expect their best overall free throw shooters to be their cooler, we can look at whether good team free throw shooting in general, or having an exceptionally good free throw shooter (a Cooler), helps teams win games.
First, using data from 1979-2006, I ran a regression of offensive efficiency (pts/pos) and defensive efficiency (opts/opos) on team wins. As it turns out, every extra point scored per 100 possessions results in about three and a quarter more wins, and every extra point allowed per 100 opponent possessions results in about three and a quarter more losses. Offensive and defensive efficiency account for 93.9% of variation in team wins over this period, which is extremely high.
Now, add to the model team free throw percentage: You get nothing. There is absolutely no significant effect on team success from team free throw percentage when controlling for efficiency. What about best individual player free throw percentage (that is, the ft% of the player who would be the Cooler)? Still nothing. In fact, here is a scatterplot of team max ft% (for players with at least the league median number of attempts) versus team winning percentage:

For the uninitiated, that scatter leads us to believe that there is no relationship at all, even without controlling for efficiency. It’s a blob. Sorry, Coolers.
Categories: analysis · basketball · graphics · nba · sports · statistics · theory
If you haven’t seen them already, check out these amazing graphics by Stephanie Posavec. It looks like she spent an incredible amount of time hand-coding, and possibly hand-counting (!) paragraphs and words in the works she covered. I set out to try to replicate some of her styles, and while some are proving easier than others, I’ve put together my own set of sentence diagrams, centering around the State of the Union Addresses given by President Bush over his eight years in office. Please note that when I say “most frequently used uncommon words,” I mean that I eliminated from the count all words in this handy list of the 500 most commonly used words in English (which may or may not be accurate, but it’s good enough for me), as well as the words “America”, “American”, and “Nation”, because those are apparently pretty high usage words for presidents. Let me know what you think in the comments, as well as any critiques or suggestions for future analysis you might have.

Bush’s Legacy [png]
Categories: graphics
Yesterday, I linked to projections I’d made for the Championship game between Memphis and Kansas. I predicted a Kansas win, 71.07 to 70.74. On the plus side, these projections round to a tie, which we had going in to the first overtime. Also on the plus side the projected sum of points (141.81) is very close to the actual sum of points (143), and I did have Kansas winning. On the negative side, I needed an overtime (which I did not actually predict) to get point totals anywhere near my prediction, and I had it ending a little closer than it was. We did have a buzzer beater, though, which I am willing to say I predicted. Actually, having reasons to root for both teams, I was mostly pulling hard for a final score of 71 to 70 or something like that, and I’m not too disappointed. Here is how I did on the rest of my projections… none to well, unsuprisingly:
http://spreadsheets.google.com/pub?key=pjtolzxemBV6kYuIHIE9ZGA
Categories: analysis · basketball · ncaa · sports · statistics
Regarding tonight’s game: The road to the NCAA Championship, brought to you by ESPN. For the record, I have Kansas winning 71.07 to 70.74. I suppose this is just a statistical way of saying that it’s a toss-up, but I’m sticking to it. Enjoy the game!
Categories: basketball · graphics · infovis · ncaa · sports
Categories: analysis · graphics · infovis · links · metrics · nba · ncaa · sports · statistics
By now, you’ve probably seen the NYT article depicting MLB managerial styles as Chernoff faces. Well, I of course, could not let it go by without producing on of my own: NBA Chernoff Faces! There is a key included in the graphic describing what statistical factors impact which facial features, and I tried to assign these somewhat intuitively (which was hard: I actually just managed to make better passers have bigger eyes, high scorers/high minutes guys have bigger heads, well-rounded players have, well, rounded heads, good shooters have bigger smiles, cheesy things like that). The colors are coded just like I always do: red means more scoring, green implies more passing/steals and blue indicates more rebounding/blocks. I thought I’d start with just 25 players to give you all an idea of how it looks, I might bust out some huge 225-player monstrosity in the near future. Stay tuned…
Faces of the NBA (click above for full size) [
PNG]
PS. The best part is that Jason Kidd looks seriously wigged out. I think these should replace the official portraits always associated with a player when they appear on TV or when you go to their player page on ESPN, etc.
PPS. I had to edit the code to allow me to change the font size and make use of color. If you would like the new (R) code, leave a comment, or email me.
Categories: basketball · graphics · infovis · nba · sports · statistics
Here is my contribution to a growing “literature” on Mechanical Turk color-naming: Dolores Labs paid MechaTurks to apply labels to 10,000 color swatches, and offer a cool color explorer (hat tip to Infosthetics). They generously made the data available, and some public-minded soul cleaned up the data. A Mr. Wattenburg took the first shot at aesthetic presentation, and as FlowingData has noted, his version is an improvement. Neoformix produced an even more kicked-up version. Thus concludes my lit review…
I took the cleaned-up data and narrowed it down to those color names applied at least twice. For each unique label, I found the mean RGB values, and estimated a distance matrix based on each colorname’s characteristics. This distance matrix I then fed into the two newest ways of visualizing the Dolores Labs Mechanical Turk Color Data (DLMTCD): network and cluster diagrams!

DLMTCD cluster diagram [pdf]

Network with white background, different algorithm [pdf]

DLMTCD network diagram [pdf]
The size of the vertices is a function of the number of times that color label is applied. Note that the network diagram employs transparency for easy reading, so the color you see is not exactly the color presented to the MTs. The nice thing about the pdf format is that you can zoom in and out and pan around as much as you want, and ctrl-f allows you to find any term in the network. Let me know what you think in the comments, and many thanks to Dolores Labs.
Categories: graphics · infovis
Tagged: color, diagrams, graphics, matching, network, proximity
A Mr. Thorpe has written an article on ESPN (Insider, but it’s a free preview for we plebeians) entitled, “Rookie Watch: Which veterans should the rookie class study?” The article is a reasonably interesting read, but largely consists of the author recommending each of 22 rookies attempt to add dimensions to their game that are possessed by 22 veteran players. For example, Mr. Thorpe suggests that Kevin Durant adopt some of Allen Iverson’s “fire”, and that Thaddeus Young study Kevin Garnett’s “intensity”… I am not sure, however, why Garnett’s “intensity” would be better for Young than Iverson’s “fire.” Which brings me to my central critique of the article, and many others like it: the comparisons are entirely subjective and appear to be made in an arbitrary or convenient way. Of course, this is where statistics might help.
Using the same methodology as previously, I’ve put together a similarity network for this year’s NBA season, with all of Mr. Thorpe’s rookies (except for Greg Oden, who has not played) highlighted. I cannot advise any of these players which specific veterans to emulate, except to recommend they learn to score like Michael Jordan, pass like John Stockton, rebound like Dennis Rodman and defend like Hakeem Olajuwon. Instead, we can learn the non-rookies in the league to whom the rookies are the most statistically similar, which may in itself be informative. Al Horford, for example, somewhat similar to Anderson Varejao, Zydrunas Ilgauskas, and Drew Gooden, might be a good fit in Cleveland someday. K [pdf]evin Durant, proximate to Dirk Nowitzki, might do well to learn from the flaws in Nowitzki’s game, that he might obviate them. Joakim Noah, seems to have already found his niche in the ecosystem along with Dikembe Mutombo and Ben Wallace.
2007-08 NBA similarities, with rookies highlighted [pdf]
Categories: analysis · basketball · graphics · infovis · nba · statistics
Which better applies to your favorite team? Using NCAA basketball data collected by Facebook, I’ve thrown together a scatterplot of the teams which elicit most passion (measured by number of opinions expressed), contrasted with the favorability with which each team is viewed. Unsurprisingly, several of the larger state schools rank among the top in terms of number of opinions expressed, and just as obviously, Duke elicits the greatest number of opinions. Princeton, Yale and Harvard all rank toward the bottom in terms of favorability, although this is likely not due to their fearsome basketball reputations. I have to feel sorry for the Bethune-Cookman Wildcats, who appear to have a small, but hateful, following. The most beloved team appears to be the Wake Forest Demon Deacons, followed closely by St. John’s Red Storm. Between Wake, NC State (also well-liked), UNC and Duke, North Carolina is well represented at the extremes. Enough prologue, Here’s the graphic:
NCAA Men’s Basketball Fans and Haters [pdf]
And, for those interested, here is a listing of teams by percent favorable opinions:
NCAA Men’s Basketball Favorability
I would love to see crosstabs for the fans/haters. If one wanted to operationalize “greatest rivalry” I think this would be an excellent way to do so.
Categories: basketball · graphics · metrics · ncaa · sports · statistics
Tagged: basketball, fans, graphics, ncaa, scatter
Using the data provided here (which you should visit for all of the data caveats and other interesting findings), I’ve constructed a plot of projected general election outcomes, depending on whether it is Obama or Clinton facing McCain in November. As the Pollster.com story tells us, survey data indicates a Democratic victory either way, but based on the location of each Democratic candidates projected potential victories, electoral college outcomes could be somewhat different. As such, I’ve constructed a basic scatterplot of McCain’s margin in the survey over Clinton and Obama. At a glance, this depicts the states in which McCain has a clear advantage over both Democrats (bottom left, more red), the states in which either Democrat is heavily favored (upper right, cyan), and the states in which Clinton has more of an advantage than does Obama (upper left, more blue/purple), or vice-versa (lower right, more green/yellow). It is easy to see, for example, that Clinton has a sizable advantage over McCain in Florida (with a large number of electoral votes), while Obama has a negative margin there in the survey data. On the other hand, Obama fares better in Michigan and Texas (where both still trail McCain). Anyway, thanks to Pollster.com for providing the spreadsheet, and let me know if you come to any interesting conclusions from looking at the scatter.

Categories: graphics · infovis · politics
We often think of players’ boxscore statistics in terms of cumulative sums or averages, but these statistics, while they tell us about prolificness and what one might expect from any player in a given game, tell us very little else about the player’s output. Consider three hypothetical players in an 82 game season: Player A scores 5 points in 41 of his games, and 35 points in the other 41 games, player B scores 20 points in each of the 82 games, and player C scores 19 points in 80 of his games, and scores 60 in each of the remaining two. Each of these players ends the season averaging 20 ppg, each scored a total of 1,640 points. However, there should be no question that they are very different players, even without considering non-scoring contributions. B is extremely consistent, C is pretty consistent but has rare scoring outbursts, and A is either a big threat or hardly a threat at all. (Please keep in mind that I could be doing the same with per-minute statistics–it’s just a little easier conceptually to discuss per-game stats, while making an equivalent point.) Opposing teams would need to plan differently when facing each of these three players, and their value to their own team is a function not only of their scoring average, but their entire scoring distribution. Since it is much easier to keep track of cumulative totals, and since the simple mean can be calculated by dividing total points (ast, reb, etc.) by total games, we have all been raised on means and sums–which are useful as far as they go, but don’t tell the whole story. So, into the plethora of other “modern” statistics, I would like to add several statistics that have been with us the entire time, but hidden behind season sums and means: the standard deviation, the geometric mean, and the distribution.
The standad deviation is a summary statistic like the mean, but it measures dispersion. Essentially, it attempts to capture the typical deviance from the mean of each data point. So, players whose per-game boxscore stats vary a lot from game-to-game will have a higher standard deviation than will players who are more consistently close to their own mean. Whether a high or low standard deviation is a good thing is a normative question, although I tend to think that consistency (indicated by a low standard deviation) is a good thing. Bear in mind also, that typically, the greater the mean, the more room there is for variance, and thus the more potential for a larger standard deviation. Thus, another statistic, the coefficient of variation, can be used to give an idea of variation while controlling for the magnitude of the mean.
The geometric mean is similar to the arithmetic mean, in that it is a measure of centrality. However, it seems to emphasize consistency more than does a simple arithmetic mean. Where the arithmetic mean is the sum of the data divided by the number of data points, the geometric mean is the product of the data exponentiated by the inverse of the number of data points. Thus, in our above example, each player has the same mean (20 ppg), but B has a geometric mean of 20, C’s is 19.54, and A’s is 13.23. According to the geometric mean, then, player A is valued almost exactly the same as player D, who scores 13 points in each of 63 games, and 14 points in every other game. Both of their g.means are around 13.23, but player A’s arithmetic mean is 20, while player D’s is 13.23. As such, the geometric mean, especially when presented alongside the arithmetic mean, may tell us even more about a player’s output.*
Finally, there is the entire distribution of per-game point totals. This encapsulates all of the information about a player’s production, because it is the player’s entire production. It’s not a numerical statistic, but can be represented as a graphic, or even (theoretically) an equation. The distribution is represents essentially the same thing as does a histogram or bar chart of each statistic’s frequency at each level of output. In the graphic below, I display each of four players’ distributions on six different per-game statistics. This should give the viewer a very complete idea of each players’ production. I also include the summary statistics I’ve described, which individually give some information about the distribution, and taken together represent a partial but informative view of player production.

This graphic presents the output of four potential MVP candidates through about 60 games of this season. Note that LeBron James tops Kobe Bryant in arithmetic means across every category, and seems to be a more consistent scorer (on a per-game level, at least)… I hope you find this depiction of production useful and informative–please don’t hesitate to participate in the ongoing MVP debate (see this post).
* A note about geometric means: since a player might have zero points, or assists or blocks, etc. in any given game, there is the potential that this zero would “wipe out” their geometric mean for that statistic, making it relatively uninformative. Thus, I have replaced each instance of 0 with 0.9 — which penalizes the player for having a low figure, but maintains valuable information. This is probably not a perfect solution, but I’ve applied it consistently, so it should at least be “fair” in some sense. Let me know in the comments if there is a better way of doing this.
Categories: analysis · basketball · metrics · nba · sports · statistics
Tagged: basketball, debate, graphic, MVP, NBA, statistics
The zeitgeist (MSM, blogs) seems to suggest that this year’s NBA MVP debate is a particularly interesting one, centering around Kobe Bryant and LeBron James. Who is more deserving? Who is better? Who’s made their team better? Who has been snubbed for too long? Who has more support? Etc, etc. In the next few weeks, I will be presenting several different ways of comparing these two, often in contrast with the other two players most often mentioned as potential MVPs, Kevin Garnett and Chris Paul. I plan on using traditional/modern, conventional/unconventional, statistical/subjective means of comparison, and I hope that you readers will help me arbitrate between them. To help in our collective decision making, I have set up a Yahoo! Versus vote, where anyone can make an argument for either player, and then everyone gets to vote for the arguments they find most convincing, leading to a reasonably good collective choice outcome. In fact, in the hopes of web-wide participation, I have created a button linking directly to the vote, and I’m providing code for anyone to add this same button to their web page or blog:

To put this button on your site, just copy and paste this code into your html editor:
<a href=”http://versus.bix.yahoo.com/vs/LeBron_James-vs-Kobe_Bryant”>
<img src=”http://arbitrarian.files.wordpress.com/2008/03/mvpdebate.png”> </a>
I encourage you to add your own arguments and vote, and check back often, as more arguments will be added over time, and you can reallocate your votes as many times as you want.
Categories: Uncategorized
Tagged: comparison, debate, MVP, voting
It has been suggested that I look at players’ statistics from only the primes of their careers. This is a good idea, given that both very inexperienced and very old players will “regress to the mean” in terms of their performance and possibly, playing style. As such, I generated a sum of each player’s boxscore statistics during the modern area across only their best seasons. My definition of “best” was simple: not their worst. For each player, I found their mean seasonal winshr, as well as their winshr standard deviations. Any seasons for which a player’s winshr was greater than the mean less one standard deviation was included in this analysis. This way, I excluded seasons in which a player was injured or relatively underused because of age or because of a minor role on their team. Chris Webber’s current and previous seasons, for example, would not be included. In this way, I hope to get at the “pure” essence of each player for an even better comparison. You will probably not be surprised to see that the diagram looks very similar to the non-peak-performance versions:
NBA players at peak performance [pdf]
A few interesting things to note, however: at their peak, Michael Jordan and Larry Bird are now among each others’ closest matches. Also, taking a macro view of the whole network, it is now easy to identify several different nodes: In bluish purple at top left, we can see defensive-minded, “dirty work” bigs, while at the bottom in blue are more scoring bigs. To their right is a reddish group of primarily scorers, while going north from there in green we see “pure point guards” and then more scoring point guards. Etc, etc. Let me know if you notice any other interesting connections or clusters in the comments.
Categories: Uncategorized
Tagged: basketball, diagram, graphics, NBA, network, statistics
Using roll call votes from the 110th Senate through the end of last year, I have constructed a network diagram based on maximum similarities between Senators’ voting records. Essentially, distances were calculated by assigning a 1 to yes votes, and a 0 to no votes, and finding the difference between each pair of Senators on each possible roll call vote. Thus, two Senators who vote identically have a distance of 0, while two Senators who vote completely opposite ways have a distance equal to the total number of roll calls. Based on these distances, I constructed a network diagram linking each Senator to their two most-similarly-voting counterparts. I also colored each vertex according to how similar each Senator is to “all Republicans” and “all Democrats” collectively. The result revealed the highly polarized nature of the Senate: there is only a single strand linking Republicans to Democrats:
11oth Senate Roll Call Network Diagram [pdf]
I then decided to reduce the number of connections to only the single closest match for each Senator, and found something interesting that you will hear only rarely from the media: Senators Clinton and Obama are each others’ closest match, based, at least, on roll call votes in the 110th Senate through the end of 2007. This would seem to indicate that the wide disparities perceived between them in the eyes of the media and the public have little to do with actual policy/ideological divides, but rather that personality and framing (and possibly demographics) are making up the bulk of voting preferences in many Americans’ minds.
110th Senate Roll Call Isolated Networks [pdf]
I was aware, to some extent, of the constructed, rather than actual, nature of the differences between the two Democratic competitors, but to see the roll call evidence fall out so starkly was surprising.
Categories: Uncategorized
Tagged: diagram, graphic, network, nomination, politics, roll call
The next in a series consists of batters in the MLB from 1955-2007 (because the modern set of statistics has not changed since the 1955 season). I think these statistics lend themselves less well to this sort of analysis, but it may be interesting to you baseball enthusiasts out there.
Batter Statistical Proximity [pdf]
Categories: Uncategorized
Tagged: baseball, batting, diagram, graphic, MLB, network, proximity
I’ve had requests for my data and for the code I used to make these plots. So, in the spirit of openness, I’m posting them. If you would like to use them, please adhere to the Creative Commons license I’ve chosen, and let me know what you come up with. The .csv is the top 1000 careers over the last quarter century-or-so, determined by a playing-time-based statistic, and the .R file will run in R, and requires you to install the package sna. The sna package is awesome, it makes network diagramming essentially idiot-proof. Note that I currently have this code writing to a PDF, and that it cannot write to the pdf if a pdf with the same filename is open. Also, remember to make sure you change the csv’s file directory in the R code, or it won’t ever work. Please let me know if it’s not working for you, or if you know of a more efficient way of doing the same thing.
1000 Top Careers [csv]
Network Diagram Example [R]
Categories: Uncategorized
Tagged: basketball, code, diagram, NBA, network, R, sharing
It was suggested that I compare players on single season data, rather than career sum data, both as a validity test and to gain other insight. It goes without saying that players’ styles change over their career–often, scorers become less effective and try to do other things well. Sometimes (as with Jordan, for example), we see players add dimensions to their game over time. So, I present yet another network diagram, one which illustrates the changing nature of each player. A few notes: this set is somewhat scorer-heavy, because of the way I generated the list of best seasons (using a euclidean distance metric). Also, when looking at this, it helps to keep in mind that this is a two-dimensional rendering of a hyperdimensional network–unless players are actually connected, visual proximity doesn’t necessarily mean anything, although it may not mean nothing. It would appear, given the degree to which players’ seasons cluster together, that the proximity algorithm functions fairly well.
NBA Seasons Proximity Network [PDF]
Categories: Uncategorized
Tagged: basketball, graphic, NBA, network, proximity, seasons
A few days ago, I posted a history of partisanship in the Senate, which you should check out if you haven’t, it’s pretty fancy. Nathan Yau, author of the blog FlowingData, posted a helpful critique on his blog. I responed in the comments, and reproduce those comments below. If you have any comments or suggestions, please let me know. I am trying to optimize it, and any feedback is useful.
Thanks a lot for soliciting comments. You raise a lot of good questions here, I thought I might try to respond to some of them. My answers aren’t the final answer, mostly I’d like try to do an initial justification of some of my design choices:
* I wasn’t immediately sure what each visual cue represented e.g. size of state abbrev. until I reached the bottom. It might be worth making the annotation more prominent either by position, size, or color or all three.
This is a pretty good point. It may help to move the key. Mostly, I put it at the bottom to minimize its obtrusiveness.
* To me, the congress numbers don’t matter so much, but that just might be I don’t have a lot of learning on the history of American government.
The congress numbers and years are in some ways redundant, but congress scholars often refer to congresses by their number. In fact, the years are only there for those less familiar with the congress number, to give a sense of where you are in history.
* I’m wondering if there’s some way to make the labeling of the years more concise? If you just labeled with the first year of the two-year term, would it be obvious that you’re describing a two-year term? What if you took away the alternating gray background and just made it all white and then had a bar timeline-type thing on top (and bottom)?
I may be able to do without both years, since it is known that there are always two years to each congress. The gray and white bars are somewhat useful, because it’s not labeled (it should be), but within each session, the points all have a certian left-right jitter–this jitter makes it easier to read, and actually conveys in a very subtle way the second dimension of the ideological scale on which each Senator is plotted. If you read more about DW-nominate, you will find that the primary dimension dominates, but for certain time periods, a second dimension becomes important. I thought I would include it subtly, because it also helped with readability.
* What if you tried to use a color scheme? I mean, you have the red and blue for the reps and dems (which I think is right), but the gradient for the senate counts turns very bright pink and purple which doesn’t go too well. Then there’s the cyan, yellow, and green which doesn’t seem to have any specific significance other than each color represents something. What I mean is… is there a reason you chose those colors?
The colors chose themselves: red and blue have come to be identified with each of the parties. Green was my remaining option out of the RGB set, and I made all Southerners’ green value equal to 255. Then, every Democrat’s B value varies as a function of their party unity (the degree to which they voted with the party). The same for Republicans and Red. Thus you can read members’ party loyalty into their color. The interesting thing is, disloyal northerners look dark, even blackish, but disloyal southerners’ lack of R and B makes them increasingly only green. Thus, for example, the very disloyal Southern Democrats of the mid-20th Century can obviously stick out as very green, where other Southern dems are various shades of teal and greenish-blue. This reflects a very important shift in the history of Congress, and it’s all indicated right there, just as a function of geography and loyalty transformed to color values.
* It might be worth making the annotations bigger so that you don’t have to “zoom in” to read.
Also possibly valid, although part of the reason I made them small is that my original intent was to design for print, where the poster will be about 24×36 inches, and the labels will be fairly legible.
* I think I would make the median lines a bit more prominent, but that’s just me.
Not a bad idea, but I a) don’t want them to completely dominate, and b) want to maintain legibility of the overlaid state names as much as possible. I may be able to make the medians wider, but then in a sense, one loses accuracy.
* There’s a lot of cool stuff getting represented here, and I wonder if anything might benefit as a separate graph. Would this benefit at all as a series of graphs instead of one large graphic?
Possible, except one of the things I like most about it is that it tells almost the entire story of partisanship and something called conditional party government (which relates to the density graphs at the bottom), all in one place. So it’s a very comprehensive and relatively quick way to get all of it “at a glance” if you know what to look for.
Categories: Uncategorized
Tagged: graphic, history, partisanship, politicalscience, senate, statistics, timeline
I hope this isn’t getting repetitive, because I’ve got a diagram that will blow your mind: it’s like the entire NBA in a petri dish, with all different phyla and genera of player types represented. I used the same methodology I’ve been using (with the per-minute, rather than ratio statistics), but generated the graph with fewer connections (just the single closest match) per player. As a result, there are a whole lot of isolated clusters instead of one completely interconnected network. Also, I went ahead and did 1,000 players at once, instead of the standard 250. What I got astounded me–they look like microorganisms swimming around on the microscope slide that is the NBA. I apologize for the tiny font–if you zoom in to 125%, it should be readable–but had I made the names any larger, they would have overlapped to an illegible degree.
The NBA “petri dish” diagram [pdf]
I would be very interested in collectively coming up with a sort of “baller’s taxonomy,” wherein we try and identify the different clusters using some more subjective terms. I think we could come up with a better vocabulary to describe players and define playing styles. If you have any ideas, please put them in the comments, and if there is sufficient interest, I may come up with a more formalized process, in the hopes of putting together a follow-up diagram with labels.
Since I had already run the algorithm anyway (it takes a lot of cycles to do 1,000 players), I went ahead and made a completely connected version of the 1,000 player diagram. Warning: this one is pretty hard to parse.
1000 player network diagram [pdf]
Keep in mind that the search function (ctrl-f) will be really useful for these.
Categories: Uncategorized
Tagged: basketball, cluster, graphics, NBA, network, proximity, statistics
In response to some questions at the APBRmetrics forum, I’ve put together a new NBA similarities network (Top 250 players version), wherein I use per-minute statistics, instead of my “patented” ratios method, just to see how it looks. In a lot of ways, this looks just as good or even better than the ratios version… I’m still somewhat torn, though: The ratios method, by ignoring time statistics completely, attempts to match players who, given a possession (or given an opponent with a possession), will do similar things with it, while the per-minute method does a better job of representing “substitutability.” I suppose I will let history be the judge, but I don’t think anyone loses when more pretty graphs are made:
NBA player similarities [pdf]
Another version with Extremely High Contrast Labels for Easy Reading: [pdf]
Categories: Uncategorized
Tagged: basketball, diagram, graphics, NBA, network, statistics, visualization
For more information, just refer back to my last post, which dealt with this same methodology applied to the NBA. I’ve done the same thing here with some of the top quarterbacks in NFL history. This time, I used propensity to rush vs. pass, and yds/att and yds/rush to color-code them. I’ll let you make your own analysis, football’s not my specialty:

Quarterback Network Diagram [pdf]
Categories: Uncategorized
Tagged: diagram, football, graphics, network, nfl, statistics, visualization
This is my entry to the DataPortability.org Logo Competition, it’s a dynamically changing graph that represents the pursuit of universal Data Portability. Here’s the description I submitted:
Both the plot and the colors in this logo change dynamically as a function of time, and a slightly different graph is generated with every pageload. This may be seen to represent our pursuit of universal Data Portability. As time passes, the chart in the logo shows our current status (colored line) slowly coming to match our eventual goal (gray line), representing both the accomplishments and setbacks we face in this endeavor. To the left are several snapshots taken throughout the day. You can see how the tail end of the graph is changing in each, and over time, the entire line shifts and realigns itself, creating a dynamic yet consistent visual identity.
Additionally, this logo can be implemented anywhere by anyone, and the data it conveys will be personalized to each individual’s time zone, further reinforcing the data portability metaphor.
To see a working version of the logo (WordPress.com won’t allow the JavaScript), please go here.
Categories: Uncategorized
Tagged: charts, competition, dynamic, logo
Having covered my operationalization of statistical similarity, and offered some evidence of its usefulness, I’d like to share what I perceive as the best part of the whole endeavor, the pictures. Using R and the sna package, along with the distances I’d previously computed [zip], I’ve put together a network diagram of player similarity. Basically, each player has two or three arrows coming out of him, pointing to the players that are most similar to him. Then, using some brilliant algorithm I don’t fully grasp, each player is plotted so that they all cluster together in groups, by similarity. I’ve then colored each player/node according to the usual formula, meaning that each is colored according, basically, to how their contributions are distributed. Past analysis has indicated that propensity to take shots, post-area stats, and perimeter-area stats (to apply somewhat arbitrary characterizations), are a good way of determining colors. See other posts for more on this. Anyway, I have two versions each of two different networks: Both .png and .pdf versions of the Top 250 players of the modern era, and then .png and .pdf versions of players 251-500, the second tier. (I recommend looking at the .pdfs first, because they’re higher-resolution, and easier to scroll around. Note that the .png and .pdf versions are different because of the way the plotting algorithm works… it’s the same data, shown in a somewhat different way.) I hope you find this interesting and/or useful, and please feel free to comment on the validity of this approach.
Tier One [pdf] [png]
Tier Two [pdf] [png]
Update: If you like those graphs, you will really really like these:
Categories: Uncategorized
Tagged: basketball, distance, NBA, network, similarity, statistics, visualization

I’ve created a timeline of the ebb-and-flow of party politics in the US Senate since the beginning of the modern (Democrat & Republican-dominated) two-party era. Beginning with the antebellum 35th Congress, and progressing through to the 109th, this timeline tells the story of the evolution of politics in America as played out on the floor of the Senate.
Political Scientists, Historians, and even casual observers of political history have long noted the shift in ideological nature of the two major parties since the Civil War, and within the 20th Century alone–this timeline conveys a sense of that shift by depicting the scaled left-right ideological positions of each Senator along the vertical axis: a macro view of the entire time period illustrates the great distance between the parties up until the mid-20th Century, at which time Civil Rights-related issues began to create crosscutting cleavages within the parties, especially in the South. The bright green of Southern Democrats, voting with the Republicans in a Conservative Coalition, is readily apparent, as the distance between party medians converges.
Just as apparent is the realignment beginning in the late 1970s-early 1980s, where, as Political Scientists have documented, the polarization of the electorate and of elected officials became a dominant trend. This is illustrated both in the main timeline, and also by the series of density plots just below the main frame. At various times, the ideological distribution is obviously unimodal, or obviously bimodal: from the 79th to the 109th Congress we can witness the polarizing of the Senate, which according to the theory of Conditional Party Government, has lead to skewed policy outcomes.
This visualization rewards careful inspection–there are stories to be found everywhere: follow the positions of the Presidents, relative to their own party members; note how party leaders are typically close to their own party median, find patterns in individual states’ ideology over time–at any level of inspection, the data offer up a rational yet compelling history.
The ideological dimensions are determined based on Senate Roll-Call votes, and scaled to be historically consistent, such that comparisons may be made across historical eras. The measure is called DW-NOMINATE, and was developed by Kieth Poole and Howard Rosenthal. Please leave any comments or questions, as this work is constantly under progress.
Categories: Uncategorized
Tagged: graphics, parties, senate, timeline, visualization
I have in this space previously discussed how to find how similar any two players are, based solely on their boxscore statistics, and attempted, to some extent, to justify myself theoretically. Now, to unveil the results: For my dataset of all modern (1979-2007) NBA players, I subsetted the top 500 according to the formula (min^(10/9))/gp, which is a kind-of weighted minutes-per game statistic that values both playing time and longevity. Thus, I could extract some of the best (admittedly measured poorly, by playing time) younger players, and a good number of vet