You like to window shop, I like to Windows shop; and my curiosity-driven gaze through the virtual panes shone its beam on yet one more data set devoted to the ranking of athletes, this time of the football/soccer genus.
Wheeled into view by the Guardian in Google spreadsheet mode and attired anew here for you in Excel:
the data sort the results of the aggregated, columned (P through EI) judgements of 124 sportswriters, each of whom nominated up to 40 players assessing them a point evaluation ranging from a maximum of 40 downwards (you may want to consult Sky Sports’ differently-ordered top 100 here).
Heading the list – which highlights the top 100, though 380 players were presented to the judges – and unsurprisingly so, is Portugal’s. and Real Madrid’s. Cristiano Ronaldo, putting 68 points worth of distance between himself and the no-less-unsurprising Lionel Messi, he of Argentina and Barcelona. That three of the rankings’ top five players populate the same Barcelona front line. even as that team merely holds down second place in its league. is the kind of sporting discrepancy sure to give the sages something to think about, but I’m a spreadsheet guy. And the fact that only Ronaldo and Messi appeared on every sportswriter’s ballot – and that, as a consequence, some rather formidable players found no place at all among soe writers’ top 40 – is perhaps equally extraordinary. Indeed – the Algerian and Leicester City star Riyad Mahrez won one first-place vote – even as he was completely shunned by 16 other writers.
But what of the spreadsheet? Start at the beginning, with its header data walking on the sheet’s ceiling at A1:O1. Those identifiers need to be lowered into row 4, hard by their attendant field data; and by extension, the hyperlink to the judges’ names and rules for assigning rankings stretching across the merged cell P3 must be moved, or its contiguity to the dataset will commit an offside against any pivot table building from the records against the perpetrator.
I’m not sure why the sheet’s numbers were left aligned, either, though I’m not pressing any charges for that formatting decision. And because it’s clear that the Guardian has player birth dates – otherwise, the Age at 20 Dec 2016 data could not have been furnished – exactitude might have been slightly better served, and pretty easily at that, via the YEARFRAC function, and an age calculation extended to a couple of decimal points, e.g. 27.83.
But if all that qualifies as a nit-pick, the rankings themselves in column A could be more justifiably questioned. Players with equivalent ratings, e.g. Alex Teixeira and Anthony Modeste at 156, were enumerated thusly: 156=, an expression that strips the number of its quantitative standing. Why not assign a 156 to each player instead, as do the women’s and men’s tennis rankings we reviewed in our immediately previous posts?
Moreover, some apparently identical scores appear in fact to have been differentiated. Raphaël Varane and Serge Aurier both check in with 34 points and five sportswriter votes cast, the latter a rankings disambiguator for the Guardian; yet Varane earns a ranking of 118 to Aurier’s 119. The same could be said about Omar Abdul Rahman and William Carvalho, invidiously niched at 133 and 134. In any event, if you do want to numerate all the rankings, say by figuring average ranking by country and/or team, then point a find-and-replace at the data, finding every = and replacing these with nothing.
I’d also withdraw the blank, colored row 105 that hems the top 100 from the lesser-rated crew beneath; while the row means to delimit and frame the footballing elite, per the Guardian’s story, any analysis of the larger cohort of course would need to unify the data set – and that means dismissing its blank rows. In that spirit, then, you’d also want to disavow row 129; whatever informational service its text may perform, it isn’t a ranked player’s record. I also can’t explain why some of the country and club entries, e.g. cells G207 and H219, appear in blue, or why Antonio Candreva in row 213 is described as Italian when his countrymen are identified with Italy. In addition, Nani’s (row 83) nationality is ascribed to Portugal, but with a superfluous space, as is Paul Pogba’s France (row 21).
Also, the formulas cascading down the Up/down field in B that meter a player’s current movement through the rankings from his 2015 score could have simply read =C5-A5, for example, sans the SUM function and its parenthetical braces.
But note as well that players who went unranked in the preceding year received a hard-coded NEW classification for 2016 that in fact could have been subsumed formulaically, e.g.
You’ll also note that the formulas in O compiling the number of a player’s first-place votes look like this:
The quotes are superfluous, and I confess to surprise that the formula works. The entries in P5:EI5 are values, after all, not labels. But work it does.
And for another matter that warrants our scrutiny, consider the Highest Score Removed field in L. The Guardian determined that any player’s highest rank – or at least one instance thereof – be stricken from his final score as an outlier. That sort of decision rule isn’t unprecedented – figure skating and gymnastics judging protocols often drop highest and lowest scores – but in those sports the extremes at both the high and low end are ignored; the Guardian only points its thumb down at the high – again, just one high, even if others have been issued to the player. Ronaldo’s 63 first-place votes are thus contracted to 62, but those 62 of course exhibit precisely the same score as the ostensible outlier.
Moreover, and unlike other juried events, the number of judges who decided to score a given player here is very much a variable. Thus we need to account for the 76 players whose final score of 0 belies their receipt of an actual, if solitary, vote. That vote, of course, was barred as an outlier, leaving the player with nothing, so to speak. Along these exclusionary lines, it follows then that players named by exactly two sportswriters incurred the loss of their higher score – even though one could challenge the insistence that these are somehow more outlying than their other, lower score.
The matter of how to dispose of score extremes has been disputed (see this mathematical exploration) – it would be difficult, for example, to imagine an even halfway-well-intentioned teacher dropping a student’s highest test score (though the classroom scenario features one judge of many test performances; in figure skating many judges arbitrate one performance) – but in the interests of pressing on we could, for example, learn something about the larger aggregate picture by approving a data set comprising the 254 players who received at least one Raw Total point, , i.e. prior to the removal of their highest score. If we’re provisionally satisfied with the makeup of that cadre, we could for starters simply pivot table a count of players by country and country raw point total:
Values: Nationality (by Count, of course; the data are textual).
RAW TOTAL (Sum).
RAW TOTAL (again, by Average, formatted to two decimals and with a comma)
I get, in excerpt:
We see that Spanish players win appearance honors, but among the more productive countries Argentina claims the highest average player score, and by quite a margin.
And if you’re wondering, there’s one American player in there, even as he didn’t make the screen shot cut above – Christian Pulisic, who plays for the German Borussia Dortmund squad and is ranked 138th, with 25 raw-totalled points. But he’s only 18 years old – and if we rank the players by age, he comes in at number 2.