Here is my contribution to a growing “literature” on Mechanical Turk color-naming: Dolores Labs paid MechaTurks to apply labels to 10,000 color swatches, and offer a cool color explorer (hat tip to Infosthetics). They generously made the data available, and some public-minded soul cleaned up the data. A Mr. Wattenburg took the first shot at aesthetic presentation, and as FlowingData has noted, his version is an improvement. Neoformix produced an even more kicked-up version. Thus concludes my lit review…
I took the cleaned-up data and narrowed it down to those color names applied at least twice. For each unique label, I found the mean RGB values, and estimated a distance matrix based on each colorname’s characteristics. This distance matrix I then fed into the two newest ways of visualizing the Dolores Labs Mechanical Turk Color Data (DLMTCD): network and cluster diagrams!
DLMTCD cluster diagram [pdf]
Network with white background, different algorithm [pdf]
DLMTCD network diagram [pdf]
The size of the vertices is a function of the number of times that color label is applied. Note that the network diagram employs transparency for easy reading, so the color you see is not exactly the color presented to the MTs. The nice thing about the pdf format is that you can zoom in and out and pan around as much as you want, and ctrl-f allows you to find any term in the network. Let me know what you think in the comments, and many thanks to Dolores Labs.
It has been suggested that I look at players’ statistics from only the primes of their careers. This is a good idea, given that both very inexperienced and very old players will “regress to the mean” in terms of their performance and possibly, playing style. As such, I generated a sum of each player’s boxscore statistics during the modern area across only their best seasons. My definition of “best” was simple: not their worst. For each player, I found their mean seasonal winshr, as well as their winshr standard deviations. Any seasons for which a player’s winshr was greater than the mean less one standard deviation was included in this analysis. This way, I excluded seasons in which a player was injured or relatively underused because of age or because of a minor role on their team. Chris Webber’s current and previous seasons, for example, would not be included. In this way, I hope to get at the “pure” essence of each player for an even better comparison. You will probably not be surprised to see that the diagram looks very similar to the non-peak-performance versions:
NBA players at peak performance [pdf]
A few interesting things to note, however: at their peak, Michael Jordan and Larry Bird are now among each others’ closest matches. Also, taking a macro view of the whole network, it is now easy to identify several different nodes: In bluish purple at top left, we can see defensive-minded, “dirty work” bigs, while at the bottom in blue are more scoring bigs. To their right is a reddish group of primarily scorers, while going north from there in green we see “pure point guards” and then more scoring point guards. Etc, etc. Let me know if you notice any other interesting connections or clusters in the comments.
Using roll call votes from the 110th Senate through the end of last year, I have constructed a network diagram based on maximum similarities between Senators’ voting records. Essentially, distances were calculated by assigning a 1 to yes votes, and a 0 to no votes, and finding the difference between each pair of Senators on each possible roll call vote. Thus, two Senators who vote identically have a distance of 0, while two Senators who vote completely opposite ways have a distance equal to the total number of roll calls. Based on these distances, I constructed a network diagram linking each Senator to their two most-similarly-voting counterparts. I also colored each vertex according to how similar each Senator is to “all Republicans” and “all Democrats” collectively. The result revealed the highly polarized nature of the Senate: there is only a single strand linking Republicans to Democrats:
11oth Senate Roll Call Network Diagram [pdf]
I then decided to reduce the number of connections to only the single closest match for each Senator, and found something interesting that you will hear only rarely from the media: Senators Clinton and Obama are each others’ closest match, based, at least, on roll call votes in the 110th Senate through the end of 2007. This would seem to indicate that the wide disparities perceived between them in the eyes of the media and the public have little to do with actual policy/ideological divides, but rather that personality and framing (and possibly demographics) are making up the bulk of voting preferences in many Americans’ minds.
110th Senate Roll Call Isolated Networks [pdf]
I was aware, to some extent, of the constructed, rather than actual, nature of the differences between the two Democratic competitors, but to see the roll call evidence fall out so starkly was surprising.
The next in a series consists of batters in the MLB from 1955-2007 (because the modern set of statistics has not changed since the 1955 season). I think these statistics lend themselves less well to this sort of analysis, but it may be interesting to you baseball enthusiasts out there.
Batter Statistical Proximity [pdf]
I’ve had requests for my data and for the code I used to make these plots. So, in the spirit of openness, I’m posting them. If you would like to use them, please adhere to the Creative Commons license I’ve chosen, and let me know what you come up with. The .csv is the top 1000 careers over the last quarter century-or-so, determined by a playing-time-based statistic, and the .R file will run in R, and requires you to install the package sna. The sna package is awesome, it makes network diagramming essentially idiot-proof. Note that I currently have this code writing to a PDF, and that it cannot write to the pdf if a pdf with the same filename is open. Also, remember to make sure you change the csv’s file directory in the R code, or it won’t ever work. Please let me know if it’s not working for you, or if you know of a more efficient way of doing the same thing.
1000 Top Careers [csv]
Network Diagram Example [R]