Here is my contribution to a growing “literature” on Mechanical Turk color-naming: Dolores Labs paid MechaTurks to apply labels to 10,000 color swatches, and offer a cool color explorer (hat tip to Infosthetics). They generously made the data available, and some public-minded soul cleaned up the data. A Mr. Wattenburg took the first shot at aesthetic presentation, and as FlowingData has noted, his version is an improvement. Neoformix produced an even more kicked-up version. Thus concludes my lit review…
I took the cleaned-up data and narrowed it down to those color names applied at least twice. For each unique label, I found the mean RGB values, and estimated a distance matrix based on each colorname’s characteristics. This distance matrix I then fed into the two newest ways of visualizing the Dolores Labs Mechanical Turk Color Data (DLMTCD): network and cluster diagrams!
DLMTCD cluster diagram [pdf]
Network with white background, different algorithm [pdf]
DLMTCD network diagram [pdf]
The size of the vertices is a function of the number of times that color label is applied. Note that the network diagram employs transparency for easy reading, so the color you see is not exactly the color presented to the MTs. The nice thing about the pdf format is that you can zoom in and out and pan around as much as you want, and ctrl-f allows you to find any term in the network. Let me know what you think in the comments, and many thanks to Dolores Labs.
Which better applies to your favorite team? Using NCAA basketball data collected by Facebook, I’ve thrown together a scatterplot of the teams which elicit most passion (measured by number of opinions expressed), contrasted with the favorability with which each team is viewed. Unsurprisingly, several of the larger state schools rank among the top in terms of number of opinions expressed, and just as obviously, Duke elicits the greatest number of opinions. Princeton, Yale and Harvard all rank toward the bottom in terms of favorability, although this is likely not due to their fearsome basketball reputations. I have to feel sorry for the Bethune-Cookman Wildcats, who appear to have a small, but hateful, following. The most beloved team appears to be the Wake Forest Demon Deacons, followed closely by St. John’s Red Storm. Between Wake, NC State (also well-liked), UNC and Duke, North Carolina is well represented at the extremes. Enough prologue, Here’s the graphic:
NCAA Men’s Basketball Fans and Haters [pdf]
And, for those interested, here is a listing of teams by percent favorable opinions:
NCAA Men’s Basketball Favorability
I would love to see crosstabs for the fans/haters. If one wanted to operationalize “greatest rivalry” I think this would be an excellent way to do so.
It has been suggested that I look at players’ statistics from only the primes of their careers. This is a good idea, given that both very inexperienced and very old players will “regress to the mean” in terms of their performance and possibly, playing style. As such, I generated a sum of each player’s boxscore statistics during the modern area across only their best seasons. My definition of “best” was simple: not their worst. For each player, I found their mean seasonal winshr, as well as their winshr standard deviations. Any seasons for which a player’s winshr was greater than the mean less one standard deviation was included in this analysis. This way, I excluded seasons in which a player was injured or relatively underused because of age or because of a minor role on their team. Chris Webber’s current and previous seasons, for example, would not be included. In this way, I hope to get at the “pure” essence of each player for an even better comparison. You will probably not be surprised to see that the diagram looks very similar to the non-peak-performance versions:
NBA players at peak performance [pdf]
A few interesting things to note, however: at their peak, Michael Jordan and Larry Bird are now among each others’ closest matches. Also, taking a macro view of the whole network, it is now easy to identify several different nodes: In bluish purple at top left, we can see defensive-minded, “dirty work” bigs, while at the bottom in blue are more scoring bigs. To their right is a reddish group of primarily scorers, while going north from there in green we see “pure point guards” and then more scoring point guards. Etc, etc. Let me know if you notice any other interesting connections or clusters in the comments.
I hope this isn’t getting repetitive, because I’ve got a diagram that will blow your mind: it’s like the entire NBA in a petri dish, with all different phyla and genera of player types represented. I used the same methodology I’ve been using (with the per-minute, rather than ratio statistics), but generated the graph with fewer connections (just the single closest match) per player. As a result, there are a whole lot of isolated clusters instead of one completely interconnected network. Also, I went ahead and did 1,000 players at once, instead of the standard 250. What I got astounded me–they look like microorganisms swimming around on the microscope slide that is the NBA. I apologize for the tiny font–if you zoom in to 125%, it should be readable–but had I made the names any larger, they would have overlapped to an illegible degree.
The NBA “petri dish” diagram [pdf]
I would be very interested in collectively coming up with a sort of “baller’s taxonomy,” wherein we try and identify the different clusters using some more subjective terms. I think we could come up with a better vocabulary to describe players and define playing styles. If you have any ideas, please put them in the comments, and if there is sufficient interest, I may come up with a more formalized process, in the hopes of putting together a follow-up diagram with labels.
Since I had already run the algorithm anyway (it takes a lot of cycles to do 1,000 players), I went ahead and made a completely connected version of the 1,000 player diagram. Warning: this one is pretty hard to parse.
1000 player network diagram [pdf]
Keep in mind that the search function (ctrl-f) will be really useful for these.
In response to some questions at the APBRmetrics forum, I’ve put together a new NBA similarities network (Top 250 players version), wherein I use per-minute statistics, instead of my “patented” ratios method, just to see how it looks. In a lot of ways, this looks just as good or even better than the ratios version… I’m still somewhat torn, though: The ratios method, by ignoring time statistics completely, attempts to match players who, given a possession (or given an opponent with a possession), will do similar things with it, while the per-minute method does a better job of representing “substitutability.” I suppose I will let history be the judge, but I don’t think anyone loses when more pretty graphs are made:
NBA player similarities [pdf]
Another version with Extremely High Contrast Labels for Easy Reading: [pdf]