The dataset I’ve been using doesn’t have player position data in it, so the other day I was playing around with cluster and factor analysis, which I don’t really know how to do yet, and trying to come up with a way to estimate players’ positions from the data. I did come up with a pretty novel method, which works about 70% of the time (for another post, someday), but I also had the following idea:
How would one arbitrarily divided basketball skills a priori? In the course of my cluster and factor analyses, a few things fell out (I know, not exactly, a priori): there are distinctions between people who take a lot of shots, small men (guard-types), and big men (centers/PFs). I came up with the following rough estimators of the degree to which every player aligns with each invented archetype:
“shooteR” = fga/(fga+tr+as+st+bk)
“Guard” = (as+st)/(fga+tr+as+st+bk)
“Big” = (tr+bk)/(fga+tr+as+st+bk)
This way, the individual with the highest shooteR rating for whom field goal attempts comprise the highest percent of his stat sum. These ratings don’t really mean very much, except to roughly suggest certain tendencies, but when you generate percentiles for each player, you get a nicely ordered rating (which is something I’ll use several times in future posts) on each aspect that falls between 0 and 1. This is useful, because the means of each aspect in the population as a whole are very different:
stat : mean
R : 0.504
G : 0.189
B : 0.307
So, the “percentalization” sends a player with an average distribution to (0.5, 0.5, 0.5).
The great thing about these three aspects is that they really lend themselves to comparison: in 2007, for example, the top players in each category were as follows:
R: Willie Green (0.725), Michael Redd (0.723), Adam Morrison (0.689)
G: Steve Nash (0.430), Brevin Knight (0.430), Eric Snow (0.422)
B: Tyson Chandler (0.644), Jeff Foster (0.630), Erick Dampier (0.617)
The numbers in parentheses are proportions of the player’s total stats, not their percentiles. Notice how the highest Guard values are much lower than the highest shooteR and Big values. Converting these to percentiles adjusts for this somewhat. It also means that, while the R, G and B percentages must add to one for each player, the sum of a players’ three percentiles may be greater than 1.5.
Rather than just listing some players with their values here, I have created a visualization, in which each players percentiles for R,G, and B have been converted to RGB color values. For the axes, I have used y = points/gp and x = other/gp, or (tr+as+st+bk-to)/gp. “Other” per game is interesting to see and at least somewhat useful to measure a player’s nonscoring contributions, but it works best for now because it is simple, and I am only trying to get the players spaced out in two dimensions for display. Without further ado, here is the scatterplot (you’re going to want to click on it, which will open a 2048 x 1536 .png (perfect for your desktop background) .
I have already listed the reddest, greenest and bluest players, but here’s another list of high acheivers:
Yellowest (R&G): Allen Iverson, Earl Boykins, Tyronn Lue, Leandro Barbosa
Cyanmost (G&B): Ben Wallace, Andrei Kirilenko, Marcus Camby, Jason Kidd
Magentalikes (B&R): Eddy Curry, Andrea Bargnani, Rasual Butler, Andres Nocioni
Closest to white (R&G&B): LeBron James, Hedo Turkoglu, Bobby Jackson, Cuttino Mobley
Let me know if you notice anything interesting in the plot (for example, Carlos Boozer, Pau Gasol and Elton Brand appear to be pretty similar players), and enjoy!