Tag Archives: matching

Plotting the colors

Here is my contribution to a growing “literature” on Mechanical Turk color-naming: Dolores Labs paid MechaTurks to apply labels to 10,000 color swatches, and offer a cool color explorer (hat tip to Infosthetics). They generously made the data available, and some public-minded soul cleaned up the data. A Mr. Wattenburg took the first shot at aesthetic presentation, and as FlowingData has noted, his version is an improvement. Neoformix produced an even more kicked-up version. Thus concludes my lit review…

I took the cleaned-up data and narrowed it down to those color names applied at least twice. For each unique label, I found the mean RGB values, and estimated a distance matrix based on each colorname’s characteristics. This distance matrix I then fed into the two newest ways of visualizing the Dolores Labs Mechanical Turk Color Data (DLMTCD): network and cluster diagrams!

colorisothumb.png
DLMTCD cluster diagram [pdf]

smdolthumb.png
Network with white background, different algorithm [pdf]

colornetthumb.png
DLMTCD network diagram [pdf]

The size of the vertices is a function of the number of times that color label is applied. Note that the network diagram employs transparency for easy reading, so the color you see is not exactly the color presented to the MTs. The nice thing about the pdf format is that you can zoom in and out and pan around as much as you want, and ctrl-f allows you to find any term in the network. Let me know what you think in the comments, and many thanks to Dolores Labs.

Chris Mullin is the next Michael Jordan

I have in this space previously discussed how to find how similar any two players are, based solely on their boxscore statistics, and attempted, to some extent, to justify myself theoretically. Now, to unveil the results: For my dataset of all modern (1979-2007) NBA players, I subsetted the top 500 according to the formula (min^(10/9))/gp, which is a kind-of weighted minutes-per game statistic that values both playing time and longevity. Thus, I could extract some of the best (admittedly measured poorly, by playing time) younger players, and a good number of veterans at the same time. I summed their career statistics across the entire time period, and ran them through the distance finding algorithm discussed in the previous post. This resulted in a matrix of distances, which I offer to you here as a 501 x 501 cell .csv file, which I’ve zipped to about 1.3 MB:

Top 500 distance matrix

However, I’ve also got a selected subset (due to size considerations) of comparisons posted to Google Docs, and it should be sortable, but not editable:

Selected distances Google Spreadsheet

Now, for the punchline: a method such as this can be used to give us new insights. If we accept that the comparisons it makes are valid in general, then we may be able to accept the comparisons that surprise us. For example, if the matching algorithm tells us that the players most statistically similar to Michael Jordan are Kobe Bryant, LeBron James, Tracy McGrady, Dwyane Wade, Vince Carter, Clyde Drexler, and Paul Pierce, I would be tempted to accept the validity of such comparisons. Thus, I would argue that I should be willing to accept the conclusion that the player most similar to Jordan is none of these, but rather, Chris Mullin (who is of course frequently compared to Larry Bird, seeing as they are both Caucasian, but to whom I have never heard Jordan compared).

To conclude, I urge you to play around with both the Google Spreadsheet and the entire .csv matrix on your own. Please let me know if you find the comparisons to ring generally true, and if so, whether there were any that surprised you.

Objective statistical player matching

You may have seen elsewhere sites that allow you to see, for any given player, or for any given player-season, the other players or seasons which most closely match the one you’re looking at. I think this is neat, because it is a fundamental sports fan drive to compare players to one another — not only questions of who is better than whom, but also, To whom is this player most similar? We use these sorts of matching questions when forecasting how a collegiate draftee will fare in the NBA — Greg Oden is supposed to be like Patrick Ewing, which is a good thing, and Kevin Durant is supposed to turn out like a Tracy McGrady/Kevin Garnett hybrid, which sounds very good. We inevitably compare almost any high-scoring shooting guard/small forward (McGrady, James, Bryant, Wade, etc. Here’s an article with a long list.) to Michael Jordan, and almost every well-rounded, sweet-shooting Caucasian gets compared to Larry Bird. I believe that this eternal comparative endeavor is an important and interesting one, that can tell us something not only about individual players, but about the structure of the league as a whole.

Thus, I set out to make my own comparisons. I seem to recall that for certain other player comparators I had seen online, only a small set of statistics were chosen on which to make the comparisons. While I am sure there were good reasons for choosing each included metric, I find such an approach unavoidably arbitrary and incomplete. Rather, I thought it imperative to use every available box-score statistic, so as to not “unfairly” skew the results. However, when I just threw in season (or career, or per-game, or per-minute) totals, the output merely put those with high total numbers close to others with high total numbers, and low with low, and so on (i.e. Michael Jordan similar to Karl Malone, because they both scored a bazillion points), without regard to their playing style, position, or skillset. To solve this problem, I hit upon the idea of converting each players’ boxscore statline into a set of ratios.

This set of ratios would be exhaustive, including every counting stat over every other counting stat: pts/as, pts/st, pts/ftm… such that where n is the number of counting statistics for each player, n^2 is the number of ratios, including ratios==1, such as to/to, fga/fga, etc. This, I hypothesized, would facilitate legitimate comparisons: any player with identical ratios, though their counting statistic totals may differ, played a fundamentally identical game in terms of statistical output. I liked this idea for a number of reasons, including the fact that it allowed for comparisons across players of all experience levels and eras, and (I would argue) that it completely eliminated subjective concerns such as race, background, and especially, hype. Having decided this, I generated comparisons according to the following basic steps:

  1. Generate set of names and all boxscore counting statistics.
  2. For each player in the set, generate a set of ratios for each boxscore stat over every other.
  3. Percentile these, so that ratios with typically low values (e.g. pts/fga) are not outweighed by those with typically high values (e.g. min/bk).
  4. Find the Euclidean distance between each pair of players in n^2-space, by finding the square root of the sum of squared differences between each player’s ratios.

And that’s it. Thus, for any set of players (or teams, or really, anything in the world), I can compute distances, and thus make objective comparisons. This idea is not new in any sense, but I think my contribution may be in refining the inputs a little bit, to make the outcome that much more useful. There are about 100 billion things that can be done with this, and in the next few days, I plan on releasing some of them into the wild.