You may have seen elsewhere sites that allow you to see, for any given player, or for any given player-season, the other players or seasons which most closely match the one you’re looking at. I think this is neat, because it is a fundamental sports fan drive to compare players to one another — not only questions of who is better than whom, but also, To whom is this player most similar? We use these sorts of matching questions when forecasting how a collegiate draftee will fare in the NBA — Greg Oden is supposed to be like Patrick Ewing, which is a good thing, and Kevin Durant is supposed to turn out like a Tracy McGrady/Kevin Garnett hybrid, which sounds very good. We inevitably compare almost any high-scoring shooting guard/small forward (McGrady, James, Bryant, Wade, etc. Here’s an article with a long list.) to Michael Jordan, and almost every well-rounded, sweet-shooting Caucasian gets compared to Larry Bird. I believe that this eternal comparative endeavor is an important and interesting one, that can tell us something not only about individual players, but about the structure of the league as a whole.
Thus, I set out to make my own comparisons. I seem to recall that for certain other player comparators I had seen online, only a small set of statistics were chosen on which to make the comparisons. While I am sure there were good reasons for choosing each included metric, I find such an approach unavoidably arbitrary and incomplete. Rather, I thought it imperative to use every available box-score statistic, so as to not “unfairly” skew the results. However, when I just threw in season (or career, or per-game, or per-minute) totals, the output merely put those with high total numbers close to others with high total numbers, and low with low, and so on (i.e. Michael Jordan similar to Karl Malone, because they both scored a bazillion points), without regard to their playing style, position, or skillset. To solve this problem, I hit upon the idea of converting each players’ boxscore statline into a set of ratios.
This set of ratios would be exhaustive, including every counting stat over every other counting stat: pts/as, pts/st, pts/ftm… such that where n is the number of counting statistics for each player, n^2 is the number of ratios, including ratios==1, such as to/to, fga/fga, etc. This, I hypothesized, would facilitate legitimate comparisons: any player with identical ratios, though their counting statistic totals may differ, played a fundamentally identical game in terms of statistical output. I liked this idea for a number of reasons, including the fact that it allowed for comparisons across players of all experience levels and eras, and (I would argue) that it completely eliminated subjective concerns such as race, background, and especially, hype. Having decided this, I generated comparisons according to the following basic steps:
- Generate set of names and all boxscore counting statistics.
- For each player in the set, generate a set of ratios for each boxscore stat over every other.
- Percentile these, so that ratios with typically low values (e.g. pts/fga) are not outweighed by those with typically high values (e.g. min/bk).
- Find the Euclidean distance between each pair of players in n^2-space, by finding the square root of the sum of squared differences between each player’s ratios.
And that’s it. Thus, for any set of players (or teams, or really, anything in the world), I can compute distances, and thus make objective comparisons. This idea is not new in any sense, but I think my contribution may be in refining the inputs a little bit, to make the outcome that much more useful. There are about 100 billion things that can be done with this, and in the next few days, I plan on releasing some of them into the wild.