The Arbitrarian

Entries categorized as ‘Uncategorized’

Mr. Consistency

June 24, 2008 · 3 Comments

Who are the most consistent scorers in the NBA? This is a question of some interest for those who participate in fantasy leagues, as consistency might be a virtue in determining the value of a player on your roster. For various reasons, a player might be worth more to you if they score 20 points every game, rather than alternate between 10 and 30 every other game. Further, some measure of consistency may highlight a player’s ability to impose their will on a game: a player able to get his scoring in, regardless of the opposition, could be said to be more of a game-defining player.

I’ve managed to estimate, for players since the 86-87 season, each individual’s mean points per 48 minutes, as well as the standard deviation of said statistic, and thus the coefficient of variation (sd/mean) and 95% confidence interval. Here’s a spreadsheet of the top (634) players in the league, by mean pts/48, sorted by coefficient of variation. Thus, the players at top could be said, in some way, to be more consistent scorers than those at the bottom.

Most consistent scorers, 1986-2008

Below is another way to view the same question. Using each player’s mean and standard deviation pts/48, along with the sample size, we can construct a 95% confidence interval for our estimate of their true mean. In the graphic linked below, each player is ranked by their mean pts/48, and the x-axis indicates how they fare under this measure of scoring. Each mean is surrounded by a line indicating the 95% confidence interval. This means, essentially, that we can be 95% sure that the player is within the span of their colored line. For players with smaller samples or greater variance, the error bars will be wider.

NBA Pts/48 min means with error bars

As you can see, some players have no error bars at all–this means that they only have one observation. Others’ error bars go down past zero. This means that we can be 95% sure that their mean pts/48 is in a range that includes zero, which doesn’t tell us very much. Anyway, here is the same graphic, for the 2007-08 season only:

Note that Carl Landry (#73) has a greater variance than most players around him, but he ranks as a better per-48 scorer than Shaquille O’Neal.

Finally, here’s a regular-season 2007-08 graphic for players’ MEV (or model-estimated value, using regression-derived regression weights like those seen here). Landry does even better here (18th), in terms of his mean, but his confidence interval is very large. This estimate suggests, though, that at worst, he’s about as good as Odom, Andre Miller, and Kirilenko; while at best, he is in rarified air. Keep in mind that this is still just a 95% confidence interval, so statistically, there’s still a 1 in 20 chance the true mean isn’t even in this interval. All should be taken with a grain of salt. One of the things I like most about this presentation is that it’s a per-minute stat, which controls for playing time (although not pace), but still reminds us that estimates for those players with little playing time should be taken with large grains of salt, and might not really mean much of anything. Josh McRoberts, for example, is probably not the 406th, much less the 6th, most valuable player in the NBA, even though his simple arithmetic mean indicates as much–his confidence interval reminds us of this, while maintaining the simple ordering.

I suppose this is also the public debut of any sort of official MEV ordering for 2007-08. I’d be interested to hear what people thought about this… this is something similar to Berri’s estimates, but I think the weightings are a little more appropriate. Let me know in the comments if they seem, at least, per-minute, to be reasonable estimates and orderings of player value.

Categories: analysis · basketball · graphics · infovis · metrics · nba · sports · statistics

Dennis, Eddie, Frank, Gus, Joe, Kevin, Linton and Neil

June 21, 2008 · 1 Comment

All Johnsons, all Phoenix Suns players. In fact, some of the greatest Johnsons to ever play the game played some of their best seasons for the Suns. Looking at Winshares, over the history of professional basketball, approximately 2.4 percent of all wins can be attributed to players with the Johnson surname. For the Suns franchise, however, that number jumps to 7.8 percent. A look at the Suns’ Winshare franchise history gives a sense of just how pivotal these Johnsons have been:

Barkley had the all-time most valuable season for a Sun in 1992-93, but it certainly looks like Stoudemire has the potential to take that title away. Amare had a huge rookie year in terms of Winshares, and was duly recognized for the Rookie of the Year award. Since then, he has essentially doubled his win production, and his best years are likely still ahead of him.

Another pattern of interest in this visualization is the recent history of all-star quality point guards. Kevin Johnson, Jason Kidd, Stephon Marbury (when he was a productive player), and Steve Nash, all played large roles in their teams’ success. However, it’s equally interesting to note that they played very different types of games. Just looking at their playing type spectrum coloration (see this post for more detail), it is possible to see that KJ and Nash are much purer perimeter players, while Kidd, as evidence by his slightly bluish tinge, was more of a rebounder, and mustard-colored Marbury shows evidence of a proclivity toward scoring along with his perimeter play–at least moreso than the other three.

What other trends do you notice in this history? Is it possible that Nash hasn’t ever been the Suns’ most valuable player, even in his MVP years? Can any of you basketball historians comment on the Westphal, Davis and Nance years?

Note: Since this post was published, the Winshares formula has undergone some revisions of some substantive import. To see the most current iteration and accurate tables and graphs, please see the Winshares page.

Categories: Uncategorized

Who won the game for Boston?

June 9, 2008 · 3 Comments

Here’s the game two estimate of who deserves credit for the win:

tm Player MP PTS MEV PVC PtC Credit G/B
lal Kobe Bryant 40.47 30 27.63 0.292 29.00 0.276 2.48
bos Paul Pierce 41.47 28 25.75 0.223 24.77 0.236 2.63
lal Pau Gasol 40.42 17 23.55 0.249 24.71 0.235 4.81
bos Rajon Rondo 41.83 4 20.64 0.179 19.86 0.189 3.32
bos Leon Powe 14.65 21 18.33 0.159 17.63 0.168 5.36
lal Vladimir Radmanovic 30.50 13 14.27 0.151 14.98 0.143 2.55
bos Kevin Garnett 39.12 17 12.88 0.111 12.39 0.118 1.69
lal Derek Fisher 29.90 9 10.94 0.116 11.49 0.109 2.77
bos Ray Allen 40.75 17 11.55 0.100 11.11 0.106 2.25
bos James Posey 19.72 8 9.92 0.086 9.54 0.091 9.00
lal Lamar Odom 32.27 10 7.77 0.082 8.16 0.078 1.77
bos P.J. Brown 22.62 6 8.42 0.073 8.10 0.077 7.79
lal Jordan Farmar 18.10 9 7.72 0.082 8.10 0.077 3.36
bos Kendrick Perkins 13.68 7 7.41 0.064 7.13 0.068 2.97
lal Ronny Turiaf 8.70 4 3.56 0.038 3.73 0.036 9.05
lal Sasha Vujacic 19.53 8 2.52 0.027 2.64 0.025 1.38
bos Sam Cassell 6.17 0 0.68 0.006 0.65 0.006 1.33
lal Luke Walton 12.80 2 -1.60 -0.017 -1.68 -0.016 0.61
lal Trevor Ariza 7.32 0 -1.86 -0.020 -1.96 -0.019 0.36
Totals 480 210 210.06 2.000 210.34 2.003 2.58

I’ve added a column since last time, G/B, which stands for “Good over Bad,” meaning I divide the linear-weighted sum of good things the player did over the linear-weighted sum of the bad things he did. It’s a playing-time-independent measure, and it highlights especially those players who were a “spark” off the bench, like Posey and Powe (who sounded on the radio like he had an incredible game), and Turiaf.

Two things to worry about if you’re a Boston fan and be happy about if you’re a Lakers fan: Kobe Bryant almost did enough to get the win for his team–he finally had a quarter and a half-ish in which he really took over and made his team compete. Kevin Garnett had a pretty poor game last night–missing as many shots (with a worse percentage) as did Kobe, and turning it over four times. He rebounded well, but the Celtics as a team out-rebounded the Lakers by only one. Put it this way for Garnett: Leon Powe, in just over a third of Garnett’s playing time, outplayed Garnett (in terms of Credit for the win) by half. We’ll see how things pan out in LA.

I also thought I’d also look into home-away free throw and personal foul disparities. Over the 1986-97 to 2007-08 period, in regular season games, home teams were called for an average of 22.17 personal fouls, compared to 23.04 on the visitors. Home teams shot 27.08 free throws to away teams’ 25.71. Difference of means tests for both of these were significant. Interestingly, free throw percentages are also significantly different: 0.752 for the home team, 0.750 away.

Categories: Uncategorized

Winshares: Player contributions to team success

May 20, 2008 · 3 Comments

Note: Since this post was published, the Winshares formula has undergone some revisions of some substantive import. To see the most current iteration and accurate tables and graphs, please see the Winshares page.

This post is a lengthy discussion of the theory and methodology behind the Winshares player value metric. If you are already familiar enough with Winshares, or are impatient, read the “In brief” section just below, and then you might want to skip ahead to the payoff graphics at the very end of this post. As always, comments and criticisms are encouraged!

In brief

Winshares are a statistic developed to estimate a player’s value in terms of wins. Combining individual statistics with team performance, Winshares allocate credit for team wins according to each team member’s contributions to team total production. As of the end of the 2007-08 regular season, Winshares are calculated as follows:

winshr = (val / team val) * team wins

val = pts - fgx*0.5603802 - ftx*0.9345311 + as*0.7697530 + or*0.8709732 + dr*0.7111727 + st*0.9190908 + bk*0.9495596 - to*0.8473544 - pf*0.7729732

Motivation

Why create yet another statistic that attempts to reduce all of player value to one number? Especially when there are so many other good and widely accepted measures already in use? Because the theory is sound, the operationalization is elegant, and the results appear valid.

Why use boxscore stats, ignoring plus/minus and everything that modern science now knows about possessions and efficiency, especially since defense is so poorly captured and other statistics, like assists, are arbitrary? Because boxscore stats go back to the beginning of professional basketball. Plus/minus is extremely data-intensive to calculate, and we have no way of getting that kind of data for most historical games. I’m ignoring possessions, and not emphasizing defense, because it is my belief that comparing one player’s boxscore stats to those of his team gives a reasonable estimate of player contributions–sometimes overestimating, other times underestimating, but on average, getting it approximately right. Mostly, though, calculating Winshares is possible as long as the same stats are tracked for all players on a team, and we know how many times the team won–meaning it can be applied very generally.

Why even try to use statistics to measure player value? You can’t capture that with a number! There is much to be said on both sides of this issue. I am of the opinion that statistics ought to be considered within a larger context of other data, qualitative and quantitative. However, I do feel strongly that numbers have a lot to tell us–they allow us the hope of greater objectivity, and therefore possibly less subjective, more accurate assessments. When applied identically to all players, Winshares will adjudicate “fairly,” paying no attention to max contracts, shoe endorsements, nicknames, or “intangibles.” Intangibles are tricky–they may indeed be part of player value, but they are also, by definition immeasurable, and may therefore expand to fill the role required of them? Was your favorite player not voted league MVP? Certainly they failed to consider his intangibles, which would have easily put him over the top…

Why are Winshares measured in that specific way? Don’t you know that linear weights are no good, or that assists are worth much more than you give them credit for? Read on…

Theory

Imagine a cooperative grocery store, owned by those who work there. At the end of one year, the store’s revenues exceed its expenditures by a large margin, and the workers are to be paid out of this surplus. One concept of fairness might dictate that a worker who worked p% of the total man-hours for that year ought to receive p% of the surplus. Arguably, he contributed p% of whatever effort determined whether or not the store would succeed, and should be rewarded accordingly. A worker working a large number of hours could be said to have contributed more to the store’s success or failure than another who only worked one shift a month–if the store profits by a large margin, that employee should receive a larger share of the windfall, just as if the store loses money, that employee should be held culpable for a larger share of the deficit.

Now imagine another similar store competing in the same market. Its surplus at the end of the year is twice that of the first store. Is it possible to compare the value, in terms of surplus, of employees from the two different stores? I would argue that it is possible: if pay is allocated in the same manner in both stores, with worker i in store j receiving payment in proportion to his labor contribution, the worker who receives the highest paycheck is the most valuable. That is, if pay is equal to worker man-hours over store total man-hours times store surplus, we can compare employees across any two firms in the same market.

But wait–what if some employees are more efficient workers than others? What if Alice can generate three times the revenue that Bob can generate in the same number of hours? Doesn’t our payment formula then overpay Bob and under-reward Alice, and doesn’t this complicate yet again the comparison across firms? Yes it does, and so we might try to find better measures of worker contributions to the surplus. Perhaps we could keep statistics on the number of cans shelved, or the number of transactions tendered, or the number of smiles flashed–if we could figure out even just the relative value of each of these things (that is, not necessarily how they each translate into surplus, but whether one smile is worth two cans shelved, etc.), then we are back on track. It doesn’t matter whether or not we can measure exactly how much revenue is brought in by each additional shelve stocked (although this would be interesting and useful), but if we know that it’s worth more (by some scalar factor) to clean the bathroom than it is to check receipts at the door, we can still estimate each workers contribution to the total amount of valuable work being done at the store.

This analogy carries over very well to sports, and specifically here, to basketball. A player who plays fully 1/5th of total team minutes played (that is 48 minutes per game for 82 games) ought to be credited with approximately 1/5th of his team’s success or failure–both of which can be measured in terms of wins. Using minutes to assess contributions runs into the same problem as in the stores above–they say nothing about efficiency–and as such, it is useful to find other statistics that more accurately estimate contributions to team success. The statistics employed in Winshares are boxscore stats, such as points, rebounds, assists, missed shots, etc. These are imperfect measures, but to the extent their relative value can be assessed, they may be useful in estimating each player’s contribution.

Calculation

Unfortunately, this relative evaluation is very difficult. It is often claimed by more “sophisticated” observers of the game that most fans fail to look past point-per-game numbers, giving infinitely more weight to scoring than to any other contributions. Yet, it is exceedingly difficult to identify just what the appropriate weights might be. Multiple regression analysis yields somewhat unsatisfactory results when applied in a straightforward manner–typically finding, for example, that offensive rebounds are actually detrimental to team success. Other work, including that done by Berri and Hollinger, is much more thorough, but leaves something to be desired (a topic which has been covered better elsewhere than can be possibly done by this author in this exposition).

As for Winshares, it would be disingenuous to claim that the ideal and true set of values has been found, but it is my belief that the reasoning is sound, and the results pass the “laugh test,” that is, given a subjective assessment of the sport, the relative importance of each boxscore statistic seems to be, at the very least, in the right order.

To identify the weights used, we may begin with a simple but strong assumption: the most valuable “good things” are those that opponents are most resistant to allowing, and thus are relatively rare, while the most detrimental “bad things” are those that a player is most trying to avoid, and thus are similarly relatively rare. With this in mind, I present counting sums for each of 8? boxscore counting stats from 1979-80 through 2007-08 (which I call the Modern era, characterized by the introduction of the three point shot to NBA play):

pts fgx* ftx* as or dr st bk to pf
6384067 2806562 417958 1469912 823716 1843893 516530 322015 974500 1449354

* field goals missed and free throws missed

Dividing each of these totals by the sum of the totals (17,008,507), we arrive at the following frequencies:

pts fgx ftx as or dr st bk to pf
0.37535 0.16501 0.0246 0.08642 0.0484 0.10841 0.0304 0.0189 0.0573 0.08521

Normalizing these frequencies to that of points, we get:

pts fgx ftx as or dr st bk to pf
1 0.43962 0.0655 0.23025 0.129 0.28883 0.0809 0.0504 0.1526 0.22703

Then, subtract each of the above from 1, so we are placing more weight on the rarer occurances, and set the points coefficient to 1, because the ultimate aim of all defense is to prevent scoring, and the ultimate aim of all offense is to score:

pts fgx ftx as or dr st bk to pf
1 0.56038 0.9345 0.76975 0.871 0.71117 0.9191 0.9496 0.8474 0.77297

Assign positivity and negativity according to whether each is helpful or deleterious to team success, and we arrive at a set of scalars for estimating valuable contributions (often abbreviated val):

val = pts - fgx*0.5603802 - ftx*0.9345311 + as*0.7697530 + or*0.8709732 + dr*0.7111727 + st*0.9190908 + bk*0.9495596 - to*0.8473544 - pf*0.7729732

Any player’s val less than zero is then set to zero, but val is rarely a large negative number. Compared to the difficulty of valuable contribution assessment, the final steps in Winshare calculation are extremely simple: merely find each player’s percent contribution to his team’s total sum of valuable contributions from all players, and multiply this by team wins:

winshr = (val / team val) * team wins

We are left with an estimate of individual player value that combines individual contributions and team success, and allocates the most credit to those players who did the most to win the most. There is just one adjustment made to allow comparisons across all NBA seasons: for seasons prior to the official distinction between offensive and defensive rebounds, the formula is adjusted to incorporate total rebounds in their stead.

Discussion

The first thing to note is that as we apply the formula increasingly further back in time, we might become somewhat less certain of its absolute accuracy as the boxscore statistics on which it is based drop from the official record. Thus, for the very earliest years of the BAA, we might not be as confident in our estimate as for most years since, but the results are still very compelling, and seem to hold up to scrutiny despite the relative dearth of data. One of the merits of Winshares as a measure is that it is relatively flexible across a variety of situations, relying as it does on player percent contributions, which can almost always be measured in some manner.

Another caveat is to bear in mind that Winshares is a season-cumulative statistic, and so the ceiling varies by the number of games played in a season. Winshares for the strike-shortened season of 1998-99 are much lower than other contemporary seasons, due to the fact that all teams won fewer games than they normally would have. Adjustments can easily be made, however, by finding per-game or per-minute Winshare rates, and making comparisons at that level. This helps, too, in determining the impact of an injured player, given that he has played fewer games. However, the initial impetus for constructing Winshares was to estimate player value in terms of wins, and this is best done on a season-cumulative scale.

One thing done relatively poorly by Winshares in its current iteration is measurement of the value of players traded during the season. To do this completely accurately, it would be useful to isolate only the games the player appeared in for each of his several teams, looking at individual statistics and team wins within those sub-season units. However, this sort of analysis requires data not generally available in convenient form, and truly, the logical extension of this idea is fairly well captured by the plus/minus statistic. As it stands, Winshares still does a relatively good job (subjectively assessed) in measuring traded players’ value, but it is something worth noting.

Winshares in application

Often understanding is best achieved through application, and so I present

The Top 1,000 Winshare Seasons

covering the NBA, ABA, and BAA from 1946-2008. Keep in mind the above caveats about data availability, especially for seasons prior to 1951-52. In a similar vein, here is a list of

The Top 100 Winshare Careers

again, this is cumulative across the entirety of each player’s career, and so players with longevity are advantaged. I have included games played in this listing, to allow the reader to make his or her own adjustments.

Finally, every player, every team played for, 2007-08 season.

Geometric representation

One of the more useful ways to conceptualize Winshares is as player percent valuable contributions * team success. This has a particularly interesting expression in geometric terms, where Winshares can be thought of as the area of the rectangle created by multiplying valpct by team wins. The following series of visualizations depicts Winshares as a geometric comparison of player value. The color scheme is based on playing style–more detail on this classification may be found here.

2007-08 NBA: Chris Paul edges out Kobe Bryant as most valuable player according to Winshares, Kevin Garnett and Paul Pierce turn in stellar seasons for the Celtics, and LeBron James carries a huge load for his team, and is rewarded in terms of Winshares, if not in post-season success.

1986-87 NBA: A season featuring more all-time greats than perhaps any other (as noted here), we see Larry Bird and Magic Johnson at the height of their rivalry, Michael Jordan and Hakeem Olajuwon coming into their own, and too many other star players to even mention.

1971-72 NBA & ABA (combined): Classic Lakers and Celtics teams, a young Dr. J, Kareem’s greatest year, an almost-as-great year from Artis Gilmore, and countless other NBA past greats.

Sacramento Kings Franchise History: This storied franchise didn’t quite make the playoffs in a very competitive 2007-08 Western Conference, but its history is littered with greats such as Oscar Robertson and Chris Webber.

Categories: basketball · graphics · infovis · metrics · nba · sports · statistics · theory

The People’s Statistic Project

May 7, 2008 · 8 Comments

Thought I’d point you to the People’s Statistic Project: http://peoplesstatistic.googlepages.com. Go and make your voice heard, I’ll have some analysis of the project up later.

Categories: Uncategorized

Smackdown crashing

April 20, 2008 · 3 Comments

I wasn’t invited to the TrueHoop Stat Geek Smackdown (see also), but I figure I’m just as capable of making wild, semi-empirically based predictions as anyone else, so I have done so. I’ll try to keep this up, round-by-round, and we’ll see how I do against more well-known Stat Geeks. Perhaps if I do well, someday I will be a TrueHoop-acknowledged geek…

Using just True Winning Percentages and bernoulli probabilities, I’ve calculated the probabilities of each possible series outcome, and then normalized to sum to one. (See my spreadsheet) For the first round, I have:

BOS in 4
CLE in 7
ORL in 6
DET in 5
LAL in 6
HOU in 7
SAN in 7
NOR in 6

Probabilities, as sparklines:

Categories: Uncategorized

The MVP debate, part I

March 10, 2008 · 1 Comment

The zeitgeist (MSM, blogs) seems to suggest that this year’s NBA MVP debate is a particularly interesting one, centering around Kobe Bryant and LeBron James. Who is more deserving? Who is better? Who’s made their team better? Who has been snubbed for too long? Who has more support? Etc, etc. In the next few weeks, I will be presenting several different ways of comparing these two, often in contrast with the other two players most often mentioned as potential MVPs, Kevin Garnett and Chris Paul. I plan on using traditional/modern, conventional/unconventional, statistical/subjective means of comparison, and I hope that you readers will help me arbitrate between them. To help in our collective decision making, I have set up a Yahoo! Versus vote, where anyone can make an argument for either player, and then everyone gets to vote for the arguments they find most convincing, leading to a reasonably good collective choice outcome. In fact, in the hopes of web-wide participation, I have created a button linking directly to the vote, and I’m providing code for anyone to add this same button to their web page or blog:

mvpdebate.png

To put this button on your site, just copy and paste this code into your html editor:

<a href=”http://versus.bix.yahoo.com/vs/LeBron_James-vs-Kobe_Bryant”>
<img src=”http://arbitrarian.files.wordpress.com/2008/03/mvpdebate.png”> </a>

I encourage you to add your own arguments and vote, and check back often, as more arguments will be added over time, and you can reallocate your votes as many times as you want.

Categories: Uncategorized
Tagged: , , ,

NBA Players in their prime

March 4, 2008 · 1 Comment

It has been suggested that I look at players’ statistics from only the primes of their careers. This is a good idea, given that both very inexperienced and very old players will “regress to the mean” in terms of their performance and possibly, playing style. As such, I generated a sum of each player’s boxscore statistics during the modern area across only their best seasons. My definition of “best” was simple: not their worst. For each player, I found their mean seasonal winshr, as well as their winshr standard deviations. Any seasons for which a player’s winshr was greater than the mean less one standard deviation was included in this analysis. This way, I excluded seasons in which a player was injured or relatively underused because of age or because of a minor role on their team. Chris Webber’s current and previous seasons, for example, would not be included. In this way, I hope to get at the “pure” essence of each player for an even better comparison. You will probably not be surprised to see that the diagram looks very similar to the non-peak-performance versions:

nbaprimethumb.png  NBA players at peak performance [pdf]

 A few interesting things to note, however: at their peak, Michael Jordan and Larry Bird are now among each others’ closest matches. Also, taking a macro view of the whole network, it is now easy to identify several different nodes: In bluish purple at top left, we can see defensive-minded, “dirty work” bigs, while at the bottom in blue are more scoring bigs. To their right is a reddish group of primarily scorers, while going north from there in green we see “pure point guards” and then more scoring point guards. Etc, etc. Let me know if you notice any other interesting connections or clusters in the comments.

          Categories: Uncategorized
          Tagged: , , , , ,

          Dimesworth of difference?

          March 2, 2008 · 3 Comments

          Using roll call votes from the 110th Senate through the end of last year, I have constructed a network diagram based on maximum similarities between Senators’ voting records. Essentially, distances were calculated by assigning a 1 to yes votes, and a 0 to no votes, and finding the difference between each pair of Senators on each possible roll call vote. Thus, two Senators who vote identically have a distance of 0, while two Senators who vote completely opposite ways have a distance equal to the total number of roll calls. Based on these distances, I constructed a network diagram linking each Senator to their two most-similarly-voting counterparts. I also colored each vertex according to how similar each Senator is to “all Republicans” and “all Democrats” collectively. The result revealed the highly polarized nature of the Senate: there is only a single strand linking Republicans to Democrats:

          110thnetthumb.png 11oth Senate Roll Call Network Diagram [pdf]

          I then decided to reduce the number of connections to only the single closest match for each Senator, and found something interesting that you will hear only rarely from the media: Senators Clinton and Obama are each others’ closest match, based, at least, on roll call votes in the 110th Senate through the end of 2007. This would seem to indicate that the wide disparities perceived between them in the eyes of the media and the public have little to do with actual policy/ideological divides, but rather that personality and framing (and possibly demographics) are making up the bulk of voting preferences in many Americans’ minds.

          110thisothumb.png 110th Senate Roll Call Isolated Networks [pdf]

          I was aware, to some extent, of the constructed, rather than actual, nature of the differences between the two Democratic competitors, but to see the roll call evidence fall out so starkly was surprising.

          Categories: Uncategorized
          Tagged: , , , , ,

          MLB Batter network diagram by statistical proximity

          March 2, 2008 · No Comments

          The next in a series consists of batters in the MLB from 1955-2007 (because the modern set of statistics has not changed since the 1955 season). I think these statistics lend themselves less well to this sort of analysis, but it may be interesting to you baseball enthusiasts out there.

          batnetthumb.png Batter Statistical Proximity [pdf]

          Categories: Uncategorized
          Tagged: , , , , , ,

          Network diagram example code

          February 29, 2008 · 2 Comments

          I’ve had requests for my data and for the code I used to make these plots. So, in the spirit of openness, I’m posting them. If you would like to use them, please adhere to the Creative Commons license I’ve chosen, and let me know what you come up with. The .csv is the top 1000 careers over the last quarter century-or-so, determined by a playing-time-based statistic, and the .R file will run in R, and requires you to install the package sna. The sna package is awesome, it makes network diagramming essentially idiot-proof. Note that I currently have this code writing to a PDF, and that it cannot write to the pdf if a pdf with the same filename is open. Also, remember to make sure you change the csv’s file directory in the R code, or it won’t ever work. Please let me know if it’s not working for you, or if you know of a more efficient way of doing the same thing.

          1000 Top Careers [csv]

          Network Diagram Example [R]

          Categories: Uncategorized
          Tagged: , , , , , ,

          NBA season network diagram

          February 29, 2008 · No Comments

          It  was suggested that I compare players on single season data, rather than career sum data, both as a validity test and to gain other insight. It goes without saying that players’ styles change over their career–often, scorers become less effective and try to do other things well. Sometimes (as with Jordan, for example), we see players add dimensions to their game over time. So, I present yet another network diagram, one which illustrates the changing nature of each player. A few notes: this set is somewhat scorer-heavy, because of the way I generated the list of best seasons (using a euclidean distance metric). Also, when looking at this, it helps to keep in mind that this is a two-dimensional rendering of a hyperdimensional network–unless players are actually connected, visual proximity doesn’t necessarily mean anything, although it may not mean nothing. It would appear, given the degree to which players’ seasons cluster together, that the proximity algorithm functions fairly well.

          NBA Seasons Proximity Network [PDF]

          Categories: Uncategorized
          Tagged: , , , , ,

          Senate partisanship history timeline discussion

          February 28, 2008 · No Comments

          A few days ago, I posted a history of partisanship in the Senate, which you should check out if you haven’t, it’s pretty fancy. Nathan Yau, author of the blog FlowingData, posted a helpful critique on his blog. I responed in the comments, and reproduce those comments below. If you have any comments or suggestions, please let me know. I am trying to optimize it, and any feedback is useful.

          Thanks a lot for soliciting comments. You raise a lot of good questions here, I thought I might try to respond to some of them. My answers aren’t the final answer, mostly I’d like try to do an initial justification of some of my design choices:

          * I wasn’t immediately sure what each visual cue represented e.g. size of state abbrev. until I reached the bottom. It might be worth making the annotation more prominent either by position, size, or color or all three.

          This is a pretty good point. It may help to move the key. Mostly, I put it at the bottom to minimize its obtrusiveness.

          * To me, the congress numbers don’t matter so much, but that just might be I don’t have a lot of learning on the history of American government.

          The congress numbers and years are in some ways redundant, but congress scholars often refer to congresses by their number. In fact, the years are only there for those less familiar with the congress number, to give a sense of where you are in history.

          * I’m wondering if there’s some way to make the labeling of the years more concise? If you just labeled with the first year of the two-year term, would it be obvious that you’re describing a two-year term? What if you took away the alternating gray background and just made it all white and then had a bar timeline-type thing on top (and bottom)?

          I may be able to do without both years, since it is known that there are always two years to each congress. The gray and white bars are somewhat useful, because it’s not labeled (it should be), but within each session, the points all have a certian left-right jitter–this jitter makes it easier to read, and actually conveys in a very subtle way the second dimension of the ideological scale on which each Senator is plotted. If you read more about DW-nominate, you will find that the primary dimension dominates, but for certain time periods, a second dimension becomes important. I thought I would include it subtly, because it also helped with readability.

          * What if you tried to use a color scheme? I mean, you have the red and blue for the reps and dems (which I think is right), but the gradient for the senate counts turns very bright pink and purple which doesn’t go too well. Then there’s the cyan, yellow, and green which doesn’t seem to have any specific significance other than each color represents something. What I mean is… is there a reason you chose those colors?

          The colors chose themselves: red and blue have come to be identified with each of the parties. Green was my remaining option out of the RGB set, and I made all Southerners’ green value equal to 255. Then, every Democrat’s B value varies as a function of their party unity (the degree to which they voted with the party). The same for Republicans and Red. Thus you can read members’ party loyalty into their color. The interesting thing is, disloyal northerners look dark, even blackish, but disloyal southerners’ lack of R and B makes them increasingly only green. Thus, for example, the very disloyal Southern Democrats of the mid-20th Century can obviously stick out as very green, where other Southern dems are various shades of teal and greenish-blue. This reflects a very important shift in the history of Congress, and it’s all indicated right there, just as a function of geography and loyalty transformed to color values.

          * It might be worth making the annotations bigger so that you don’t have to “zoom in” to read.

          Also possibly valid, although part of the reason I made them small is that my original intent was to design for print, where the poster will be about 24×36 inches, and the labels will be fairly legible.

          * I think I would make the median lines a bit more prominent, but that’s just me.

          Not a bad idea, but I a) don’t want them to completely dominate, and b) want to maintain legibility of the overlaid state names as much as possible. I may be able to make the medians wider, but then in a sense, one loses accuracy.

          * There’s a lot of cool stuff getting represented here, and I wonder if anything might benefit as a separate graph. Would this benefit at all as a series of graphs instead of one large graphic?

          Possible, except one of the things I like most about it is that it tells almost the entire story of partisanship and something called conditional party government (which relates to the density graphs at the bottom), all in one place. So it’s a very comprehensive and relatively quick way to get all of it “at a glance” if you know what to look for.

          Categories: Uncategorized
          Tagged: , , , , , ,

          Toward a basketball taxonomy

          February 25, 2008 · 5 Comments

          I hope this isn’t getting repetitive, because I’ve got a diagram that will blow your mind: it’s like the entire NBA in a petri dish, with all different phyla and genera of player types represented. I used the same methodology I’ve been using (with the per-minute, rather than ratio statistics), but generated the graph with fewer connections (just the single closest match) per player. As a result, there are a whole lot of isolated clusters instead of one completely interconnected network. Also, I went ahead and did 1,000 players at once, instead of the standard 250. What I got astounded me–they look like microorganisms swimming around on the microscope slide that is the NBA. I apologize for the tiny font–if you zoom in to 125%, it should be readable–but had I made the names any larger, they would have overlapped to an illegible degree.

          nbapetrithumb.png The NBA “petri dish” diagram [pdf]

          I would be very interested in collectively coming up with a sort of “baller’s taxonomy,” wherein we try and identify the different clusters using some more subjective terms. I think we could come up with a better vocabulary to describe players and define playing styles. If you have any ideas, please put them in the comments, and if there is sufficient interest, I may come up with a more formalized process, in the hopes of putting together a follow-up diagram with labels.

           

          Since I had already run the algorithm anyway (it takes a lot of cycles to do 1,000 players), I went ahead and made a completely connected version of the 1,000 player diagram. Warning: this one is pretty hard to parse.

          nba1000thumb.png 1000 player network diagram [pdf]

          Keep in mind that the search function (ctrl-f) will be really useful for these.

          Categories: Uncategorized
          Tagged: , , , , , ,

          NBA player similarities matrix revisited

          February 25, 2008 · 3 Comments

          In response to some questions at the APBRmetrics forum, I’ve put together a new NBA similarities network (Top 250 players version), wherein I use per-minute statistics, instead of my “patented” ratios method, just to see how it looks. In a lot of ways, this looks just as good or even better than the ratios version… I’m still somewhat torn, though: The ratios method, by ignoring time statistics completely, attempts to match players who, given a possession (or given an opponent with a possession), will do similar things with it, while the per-minute method does a better job of representing “substitutability.” I suppose I will let history be the judge, but I don’t think anyone loses when more pretty graphs are made:

          nbaaltthumb.png NBA player similarities [pdf]

          Another version with Extremely High Contrast Labels for Easy Reading: [pdf]

          Categories: Uncategorized
          Tagged: , , , , , ,

          Quarterback network diagram by statistical proximity

          February 25, 2008 · No Comments

          For more information, just refer back to my last post, which dealt with this same methodology applied to the NBA. I’ve done the same thing here with some of the top quarterbacks in NFL history. This time, I used propensity to rush vs. pass, and yds/att and yds/rush to color-code them. I’ll let you make your own analysis, football’s not my specialty:

          qbnetthumb.png

          Quarterback Network Diagram [pdf]

          Categories: Uncategorized
          Tagged: , , , , , ,

          Dynamic logo

          February 23, 2008 · No Comments

          This is my entry to the DataPortability.org Logo Competition, it’s a dynamically changing graph that represents the pursuit of universal Data Portability. Here’s the description I submitted:

          Both the plot and the colors in this logo change dynamically as a function of time, and a slightly different graph is generated with every pageload. This may be seen to represent our pursuit of universal Data Portability. As time passes, the chart in the logo shows our current status (colored line) slowly coming to match our eventual goal (gray line), representing both the accomplishments and setbacks we face in this endeavor. To the left are several snapshots taken throughout the day. You can see how the tail end of the graph is changing in each, and over time, the entire line shifts and realigns itself, creating a dynamic yet consistent visual identity.

          Additionally, this logo can be implemented anywhere by anyone, and the data it conveys will be personalized to each individual’s time zone, further reinforcing the data portability metaphor.

          To see a working version of the logo (WordPress.com won’t allow the JavaScript), please go here.

          Categories: Uncategorized
          Tagged: , , ,

          NBA similarity networks

          February 22, 2008 · 10 Comments

          Having covered my operationalization of statistical similarity, and offered some evidence of its usefulness, I’d like to share what I perceive as the best part of the whole endeavor, the pictures. Using R and the sna package, along with the distances I’d previously computed [zip], I’ve put together a network diagram of player similarity. Basically, each player has two or three arrows coming out of him, pointing to the players that are most similar to him. Then, using some brilliant algorithm I don’t fully grasp, each player is plotted so that they all cluster together in groups, by similarity. I’ve then colored each player/node according to the usual formula, meaning that each is colored according, basically, to how their contributions are distributed. Past analysis has indicated that propensity to take shots, post-area stats, and perimeter-area stats (to apply somewhat arbitrary characterizations), are a good way of determining colors. See other posts for more on this. Anyway, I have two versions each of two different networks: Both .png and .pdf versions of the Top 250 players of the modern era, and then .png and .pdf versions of players 251-500, the second tier. (I recommend looking at the .pdfs first, because they’re higher-resolution, and easier to scroll around. Note that the .png and .pdf versions are different because of the way the plotting algorithm works… it’s the same data, shown in a somewhat different way.) I hope you find this interesting and/or useful, and please feel free to comment on the validity of this approach.

          netthumb1.png Tier One [pdf] [png]

          netthumb2.png Tier Two [pdf] [png]

          Update: If you like those graphs, you will really really like these:

          nbapetrithumb.png

          nbaaltthumb.png
          In fact, why don’t you look around, or subscribe?

          Categories: Uncategorized
          Tagged: , , , , , ,

          The modern two-party system in the US Senate

          February 21, 2008 · 2 Comments

          I’ve created a timeline of the ebb-and-flow of party politics in the US Senate since the beginning of the modern (Democrat & Republican-dominated) two-party era. Beginning with the antebellum 35th Congress, and progressing through to the 109th, this timeline tells the story of the evolution of politics in America as played out on the floor of the Senate.

          Political Scientists, Historians, and even casual observers of political history have long noted the shift in ideological nature of the two major parties since the Civil War, and within the 20th Century alone–this timeline conveys a sense of that shift by depicting the scaled left-right ideological positions of each Senator along the vertical axis: a macro view of the entire time period illustrates the great distance between the parties up until the mid-20th Century, at which time Civil Rights-related issues began to create crosscutting cleavages within the parties, especially in the South. The bright green of Southern Democrats, voting with the Republicans in a Conservative Coalition, is readily apparent, as the distance between party medians converges.

          Just as apparent is the realignment beginning in the late 1970s-early 1980s, where, as Political Scientists have documented, the polarization of the electorate and of elected officials became a dominant trend. This is illustrated both in the main timeline, and also by the series of density plots just below the main frame. At various times, the ideological distribution is obviously unimodal, or obviously bimodal: from the 79th to the 109th Congress we can witness the polarizing of the Senate, which according to the theory of Conditional Party Government, has lead to skewed policy outcomes.

          This visualization rewards careful inspection–there are stories to be found everywhere: follow the positions of the Presidents, relative to their own party members; note how party leaders are typically close to their own party median, find patterns in individual states’ ideology over time–at any level of inspection, the data offer up a rational yet compelling history.

          The ideological dimensions are determined based on Senate Roll-Call votes, and scaled to be historically consistent, such that comparisons may be made across historical eras. The measure is called DW-NOMINATE, and was developed by Kieth Poole and Howard Rosenthal. Please leave any comments or questions, as this work is constantly under progress.

          Categories: Uncategorized
          Tagged: , , , ,

          Chris Mullin is the next Michael Jordan

          February 19, 2008 · 7 Comments

          I have in this space previously discussed how to find how similar any two players are, based solely on their boxscore statistics, and attempted, to some extent, to justify myself theoretically. Now, to unveil the results: For my dataset of all modern (1979-2007) NBA players, I subsetted the top 500 according to the formula (min^(10/9))/gp, which is a kind-of weighted minutes-per game statistic that values both playing time and longevity. Thus, I could extract some of the best (admittedly measured poorly, by playing time) younger players, and a good number of veterans at the same time. I summed their career statistics across the entire time period, and ran them through the distance finding algorithm discussed in the previous post. This resulted in a matrix of distances, which I offer to you here as a 501 x 501 cell .csv file, which I’ve zipped to about 1.3 MB:

          Top 500 distance matrix

          However, I’ve also got a selected subset (due to size considerations) of comparisons posted to Google Docs, and it should be sortable, but not editable:

          Selected distances Google Spreadsheet

          Now, for the punchline: a method such as this can be used to give us new insights. If we accept that the comparisons it makes are valid in general, then we may be able to accept the comparisons that surprise us. For example, if the matching algorithm tells us that the players most statistically similar to Michael Jordan are Kobe Bryant, LeBron James, Tracy McGrady, Dwyane Wade, Vince Carter, Clyde Drexler, and Paul Pierce, I would be tempted to accept the validity of such comparisons. Thus, I would argue that I should be willing to accept the conclusion that the player most similar to Jordan is none of these, but rather, Chris Mullin (who is of course frequently compared to Larry Bird, seeing as they are both Caucasian, but to whom I have never heard Jordan compared).

          To conclude, I urge you to play around with both the Google Spreadsheet and the entire .csv matrix on your own. Please let me know if you find the comparisons to ring generally true, and if so, whether there were any that surprised you.

          Categories: Uncategorized
          Tagged: , , , , , ,

          Objective statistical player matching

          February 18, 2008 · 2 Comments

          You may have seen elsewhere sites that allow you to see, for any given player, or for any given player-season, the other players or seasons which most closely match the one you’re looking at. I think this is neat, because it is a fundamental sports fan drive to compare players to one another — not only questions of who is better than whom, but also, To whom is this player most similar? We use these sorts of matching questions when forecasting how a collegiate draftee will fare in the NBA — Greg Oden is supposed to be like Patrick Ewing, which is a good thing, and Kevin Durant is supposed to turn out like a Tracy McGrady/Kevin Garnett hybrid, which sounds very good. We inevitably compare almost any high-scoring shooting guard/small forward (McGrady, James, Bryant, Wade, etc. Here’s an article with a long list.) to Michael Jordan, and almost every well-rounded, sweet-shooting Caucasian gets compared to Larry Bird. I believe that this eternal comparative endeavor is an important and interesting one, that can tell us something not only about individual players, but about the structure of the league as a whole.

          Thus, I set out to make my own comparisons. I seem to recall that for certain other player comparators I had seen online, only a small set of statistics were chosen on which to make the comparisons. While I am sure there were good reasons for choosing each included metric, I find such an approach unavoidably arbitrary and incomplete. Rather, I thought it imperative to use every available box-score statistic, so as to not “unfairly” skew the results. However, when I just threw in season (or career, or per-game, or per-minute) totals, the output merely put those with high total numbers close to others with high total numbers, and low with low, and so on (i.e. Michael Jordan similar to Karl Malone, because they both scored a bazillion points), without regard to their playing style, position, or skillset. To solve this problem, I hit upon the idea of converting each players’ boxscore statline into a set of ratios.

          This set of ratios would be exhaustive, including every counting stat over every other counting stat: pts/as, pts/st, pts/ftm… such that where n is the number of counting statistics for each player, n^2 is the number of ratios, including ratios==1, such as to/to, fga/fga, etc. This, I hypothesized, would facilitate legitimate comparisons: any player with identical ratios, though their counting statistic totals may differ, played a fundamentally identical game in terms of statistical output. I liked this idea for a number of reasons, including the fact that it allowed for comparisons across players of all experience levels and eras, and (I would argue) that it completely eliminated subjective concerns such as race, background, and especially, hype. Having decided this, I generated comparisons according to the following basic steps:

          1. Generate set of names and all boxscore counting statistics.
          2. For each player in the set, generate a set of ratios for each boxscore stat over every other.
          3. Percentile these, so that ratios with typically low values (e.g. pts/fga) are not outweighed by those with typically high values (e.g. min/bk).
          4. Find the Euclidean distance between each pair of players in n^2-space, by finding the square root of the sum of squared differences between each player’s ratios.

          And that’s it. Thus, for any set of players (or teams, or really, anything in the world), I can compute distances, and thus make objective comparisons. This idea is not new in any sense, but I think my contribution may be in refining the inputs a little bit, to make the outcome that much more useful. There are about 100 billion things that can be done with this, and in the next few days, I plan on releasing some of them into the wild.

          Categories: Uncategorized
          Tagged: , , , , ,

          Player head-to-head comparison charts

          February 13, 2008 · No Comments

          In the vein of the previous line of charts I’ve put together, I present  Player head-to-head comparison charts, wherein  each of the top 25 players in the league is compared to each other  according to scoring, rebounding, passing, and  defensive ability, as well as an overall metric of performance. The charts look like this:

          And are available here:

          Categories: Uncategorized
          Tagged: , , , , ,

          Embeddable league leaders and yesterday’s best

          February 12, 2008 · No Comments

          Here I present a freely embeddable, daily-updated tabulation of the NBA’s League Leading producers, and Yesterday’s Best performers:

           

          As the caption suggests, if there were no games played yesterday, or if the data just hasn’t updated yet, the chart will present the top 10 cumulative League Leaders instead of just the top five. The rankings are determined by a Euclidean distance metric (described in more detail in this graphic), but one can think of it as taking a player’s points per game, and adding value based on their other production, so that a one-dimensional high scorer may be rated worse than a somewhat lower-scoring multi-dimensional player). The best thing about all this is that it’s freely embeddable by anyone! Just take the following code:

          <img src=”http://spreadsheets.google.com/ pub?key=pjtolzxemBV5oAS03cibvMA&oid=4&output=image”/>

          And paste it wherever you would any regular image. (Make sure the whole thing fits on one line, and that there are no spaces between the quotation marks) That’s it! Now you have your own copy of the chart, and it will update every day for you, too! Please let me know in the comments if this is useful, or if you have ideas for even more useful charts.

          Categories: Uncategorized
          Tagged: , , , , ,

          Basic player charts for everybody! (880 free charts for your next post)

          February 11, 2008 · 2 Comments

          I thought that as a public service, I would put together a source for simple statistical charts of NBA player production. The first one has really basic per-game and per-48 statistics, but there are more planned for this series. I’ve got it set up as a Google Spreadsheet which will update daily, and which generates Google Charts which anyone can copy/paste/steal/leech/whatever — in fact, I hope that someone finds them useful. The first set of charts features Per-Game Output breakdowns, like:

          S Marion Breakdown

          And Per-48 Positional Comparisons, like this:

          S O'Neal Position Comparison

           

          Without further ado, I present the Basic Player Chart Generator Spreadsheet.

          Feel free to use any of the charts in any application you wish, just a) copy the image itself, or b) use the URL as an image link. Also, you can embed the entire spreadsheet in any page using the following code, just copy and paste it in its entirety:

          <iframe width=’500′ height=’300′ frameborder=’0′ src=’//spreadsheets.google.com/pub?key=pjtolzxemBV6yam3DIQjTfA&output=html
          &gid=4&single=true&widget=true’></iframe>

          Anyway, I hope you find it useful. Please don’t hesitate to offer any comments, questions, criticisms, and I am especially interested in hearing what kind of charts would be useful for the future–things that people might make use of in their own blogs.

           

           

          Categories: Uncategorized
          Tagged: , , , , ,

          New (to me) sites that I find interesting:

          February 10, 2008 · No Comments

          Alltop, which aggregates various news sources by category, and presents them all in a single, cleanly-designed place. It may replace my feedreader as my first stop for news.

          Versus and Predictify, both could be characterized as social voting sites, but with their own interesting spin.

          Pixish, which someday may ask for something I could do.

          Categories: Uncategorized
          Tagged: ,

          The rise of the three-pointer

          February 1, 2008 · 2 Comments

          The other day I was wondering about whether or not teams act rationally when making their shooting decisions, and I have some ideas about how to determine this, but I’ll post on that later. For now, I’d like to present a something really interesting I discovered as I was going through the data.

          First of all, over the Modern era (1979-80 through today), points per three-point attempt have almost exactly equaled points per two-point attempt, at approximately 0.973 (which is, interestingly, almost exactly one point per fga). This lead me to believe, at first, that it would be a good measure of shooting-decision rationality to compare a team’s or a player’s points-per-shot from inside and outside the arc: if pts/3fga, for example, is substantially higher than pts/2fga, it would seem to indicate that too many bad two-pointers are being attempted, and that some of those should be passed out for three-point attempts (but more on this in another post).

          What I found in doing a little EDA, however, indicated that if this is how we conceive of rationality, the league as a whole has not been rational on a year-by-year basis (which sort of undermines my claim). What I did find, instead, was that pts/3fga are now higher than pts/2fga, leaguewide. However, this has not always been the case:

          Adoption of the three-pointer


          It appears as though when the three-pointer was first introduced (the season which I mark as the beginning of the “modern” era) defenses were at first unprepared to handle it, but then the quickly adapted, and the ratio of pp3/pp2 was fairly low for a while. However, from the mid-80s onward, that ratio steadily increased, until today, when the league seems to have achieved some sort of equilibrium at which pp3 is noticeably greater than pp2. Not being a basketball historian, this is just the best explanation I have come up with, and I would be very interested in hearing someone more knowledgeable give me a better story to go with the data.

          Categories: Uncategorized
          Tagged: , , , , ,

          Valpct

          January 31, 2008 · 1 Comment

          Valpct is derived from val, or “valuable contributions,” began as a straightforward, unweighted sum of player and team statistics: (pts+tr+as+st+bk-to). However, it has since been modified to weigh assists and blocks somewhat differently:

          pts+as/(tfgm-fgm)/(tfga-fga)+tr+st+bk/(tdr/(tdr+oor))-to)

          Similarly for teams.

          valpct for each player is that player’s val divided by his team’s total tval, the idea is to somewhat crudely capture the player’s contribution to total team statistical output.

          Categories: Uncategorized
          Tagged: , , , ,

          Winshares

          January 31, 2008 · 2 Comments

          Often shortened to winshr, winshares are a measure of basketball productivity. Essentially, it credits each player for their contribution to team performance, allocating team wins to each player according to their valpct: (valpct*team wins) The idea is that each player is responsible for some of his team’s production, and that team’s production results in some number of wins — if a bad player is responsible for a large valpct, team wins will be lower than if a good player was in his place, this difference is reflected in winshr. This measure is useful in attempting to determine value, as it is a more flexible way of determining the “best player on the best team”, leveling the playing field for good players on great teams and great players on good teams, for example. See also this background.

          I am aware that this measure is imperfect, especially since it has no way of taking into account whether a team’s wins and success come while the player is on the floor or off (which is especially significant for injured or traded players). For this, a better statistic is probably plus/minus, but here at the Arbitrarian, we usually stick with box score stats, which are usually easier to get a-hold of.

          Categories: Uncategorized
          Tagged: , , , ,

          NBA Game Theory

          January 28, 2008 · 11 Comments

          It is common basketball folk wisdom that teams which share the ball unselfishly, passing around until someone has an open shot, should be successful. Such a concept would appear to make sense, if we accept (a) increased passing leads to increased likelihood of a relatively uncontested shot for any given possession, (b) relatively uncontested shots have a higher likelihood of going in than contested shots, and c) teams with higher field goal percentages, all else equal, are more successful than teams with low shooting percentages.First, I would like to illustrate (although it’s not proof) the likely validity of these three claims; first with correlation coefficients, and then with a scatterplot. Since I don’t have statistics for the number of uncontested vs. contested shots, we’ll combine (a) and (b) and assume that higher field goal percentages result from more uncontested shooting. For lack of an easier way to display them, the following is a list of relevant correlations.

          • cor( as/fga , fg% ) = 0.687
          • cor( fg% , win% ) = 0.429
          • cor( as/fga , win% ) = 0.462

          These are all high, and the correlation between assist ratio and shooting percentage is very high. At this point, I should note that we cannot here make any causal inferences for certain. Correlation is not causation, and neither are the easily seen trends in the scatterplot below. For example, it is almost certain that while better passing to the open man leads to more wins, teams with more wins likely have better players who are better passers and shooters in the first place, so we have an endogeneity problem, at the very least.

          Nevertheless, I present visual evidence of the relationship; assist ratio is on the x-axis, increasing from left to right, win percentage is on the y-axis, increasing from bottom to top, team field goal percentage is coded by color, increasing in reverse-rainbow order (red is best, purple is worst). Note that the trend is obviously up and to the right, indicating a positive association of the two statistics. Also notice that the top right is more red-orange-yellow than the bottom left.

          Assist Ratio by Win% by FG%

          (Click to enlarge)

          Finally, I’d like to develop some theory concerning why, if passing to the open man leads to wins (which we have some evidence for, but haven’t technically proved), some teams just don’t pass as much as others? To answer this, we may turn to game theory for a classic game with a similar idea.

          I argue that the question each player faces about whether to take a contested shot, or pass to another man, is in some sense a Prisoner’s Dilemma. If we assume that players are financially motivated, and that contracts are based more on individual statistics than on overall team success (evidence for which assumption awaits another article), the payoffs to being a high scorer on a bad team are probably better than being a low-scorer on a better team.

          The Prisoner’s Dilemma for a two-player team thus looks like this:

            Look to pass to open man Be a chucker
          Look to pass to open man win-win lose much-win much
          Be a chucker win much-lose much lose-lose

          (Note that we’re also assuming that both players decide simultaneously before the game or season which kind of player they’ll be.) The Nash Equilibrium for this game is that both players will choose to shoot first, pass second. This can be extended to apply to five-player teams as well, with similar results. In fact, the more players there are, the more likely one would be to defect, and since the win-win outcome requires complete cooperation, increasing the number of players increases the likelihood of all defecting to Chucker. To the outide observer, this is disheartening, because the win-win outcome is a team with a good record, the lose much/win much outcomes at least might result in a scoring title, and the lose-lose outcome results in the New York Knicks. Yet, all of the individual incentives pull toward the lose-lose outcome. I can offer no better evidence for my thesis than the following table of teams thus far in the ‘07-08 season, ordered by assist ratio:

          Team as/fga fg% win%
          PhoenixSuns 0.320 0.490 0.703
          UtahJazz 0.317 0.490<