London Borough Data, Part 2: Another Look At Some LOOKUPS

1 Sep

Last week’s run through the London borough data workbook directed its eponymous scrutiny at a fleet of VLOOKUP formulas in the book’s Profile sheet, ones that retrieved a trove of per-borough attributes, e.g. its migrant population and employment numbers. We proceeded to ask after the decision to draw the LOOKUPs’ index number (the function’s third argument, the one that returns the “answer” to the LOOKUP’s query) from a range wholly external to its table array. And here’s case in point number two on the problem.
The book’s Chart-Map sheet, chief among its elements a smartly responsive cartographing of the boroughs by its user-selected attributes, binds its color-coded breakout to the VLOOKUPs in the adjoining, borough-populated range A5:B42, its second column dialing into the data array set down in the heretofore-hidden sheet, the no-less-eponymous Sheet1. The problem – again – is with their index numbers, supplied in the sheet by the invisible gray-on-gray entries in Chart-Map’s Z5:Z42 (I’ve recolored the values below in a more forthcoming white):


(The inessential decimals, nowhere to be found in the referenced cells whence they came, likewise remain to be explained; but that isn’t make-or-break, we’ll agree.)

Again, precisely the same question could be asked of the formulas here that we asked last week; they’ve keyed their operation to a range of proxy index numbers that could have, and really should have, been drawn from the table array itself. The reasons why were recounted in last week’s post, in which I filed my brief for the array’s independent, self-capable knack for returning index numbers, provided you’re on speaking terms with MATCH and like functions. (In addition, the Chart-Map’s T5:Z61 region – all of whose cells wear the gray-on-gray mask -hides all kinds of referential data, much of which undergirds the workings of the map itself.)

All of which tilts the slope towards the next question. As with the Profiles sheet we tangoed with last week, Chart-Map hangs its mapped results atop a drop-down menu comprising the seventy-or-so borough attributes on which the user can click. But it’s pipelining its core data from the Sheet1 sheet; but Profiles quarried its numbers from the Data tab instead, and it may be proper to ask why.

I’m asking why because the Chart-Map VLOOKUP for Barnet, the borough that exemplified my formulaic review last week, reads


But why couldn’t it read

=VLOOKUP(A12,Data!C:BW,MATCH(A3,Data!C1:BW1,0)) ?

To be sure, the latter expression is the more distended, but it

a) Again closes off and bats away any recourse to an external source for those VLOOKUP index numbers, and
b) Returns to the Data sheet for its lookup array – the same array Profiles used, rather than the one offered up by newcomer Sheet 1.

The question, rephrased, then: why perform lookups upon two data distinct sources when only one – either one – could execute both jobs? And by corollary, we could dare to ask if two data sets – that is, Data (and we won’t make much here of its blank row 2) and Sheet 1 – are needed at all as a result.

Click back and forth between the two. I am, as ever, happy to be brought into line, but I would allow that the two data sets are in effect equivalent, their difference a matter of orientation, with the boroughs beamed horizontally in Sheet 1 and vertically in Data, and with the borough attributes by extension pulling along the respective perpendiculars in the two sheets.

In short, the data in the two sheets seem basically, if not completely, identical, and if so – even as I can’t claim a line-by-line vetting of all the numbers – then it’s clear that one is expendable. We’re left to wonder in turn why both have been allowed to maintain their position in the workbook. I’m just asking.

I’m also asking about the formulaic involutions in lookup that restate the same borough population totals elsewhere. The totals in column B have routed their data from the same-lettered column in the Chart-Map sheet, which in turn secured its numbers via the VLOOKUPs debated above. But lookup’s (probably an infelicitous tab name, at least for our purposes) D column taps into that B data but two fields to its left, deciding in addition to test the fittedness of those values with an IF statement, e.g. for Barnet:


Because all the data appear to be in place, I’m not certain what misgiving provoked the above formula’s search for the problem “.” in lieu of a borough’s actual population, but the misgiving seems to have been expressed; and while I’m all for leaving a spreadsheet’s formulas intact for the next person, I don’t know the contingency for which these formulas are preparing. But perhaps more to the point, couldn’t have all the formulaic toing-and-froing been overruled by a simple hard-coded, typed, references to the borough population data in the Data sheet?

And another question, as long as I’m asking: what do we make of Sheet 1’s borough population totals in the in row 5, e.g. the expression for Barnet in E5?:


First, E79 points to a data set of which I was simply unaware: a near reproduction, starting in row 75, of the data in the upper half of the Sheet1. That upper half – the half I had assumed served as the one and only data set in the sheet – simply appears to apply its ubiquitous formulas to invoke parallel data below – plus a curiously diminutive value shadowing the data in row 73:


Those entries in the lowest row above are sequenced downwards, each succeeding borough in its column receiving a decrement of .0001 – and I don’t know why. Nor do I know what the formulas above in Sheet1 bring to the analysis that the hard-coded values below don’t.

But didn’t Mr. Dylan instruct us not to criticize what we can’t understand? I’m not criticizing.

London Borough Data, Part 1: A Look At Some LOOKUPs

25 Aug

When the back end gets moved to the front things start getting interesting. One needn’t look too far into the London-borough-profiles workbook inlaid into the London Data Store site before you wind up shaking hands with the contributory data that primes the crisply-drawn map of the boroughs and their demographic traits in the Chart-Map sheet, as well as the ingredients batching the Profiles sheet into something comestible.

The data, featuring around 70 fields’ worth of borough-specific breakouts, are transparent to a fault, their formulas unpaved by hard-coded values and most available for the viewing and emending, and with their two hidden worksheets exposable at the touch of an Unhide command.

As a result, there’s a lot going on in this book. Start with Profiles, a collection of takes on a selected borough’s facts and figures, set into motion by a Data Validation drop-down selection menu (enabled by the List option) that reaps its list entries from the borough names fitted in the Data sheet’s column C (and range-named Boroughs). Click on the borough of Barnet for example and click again, this time on C7, and stare discreetly at this most curious formula:


The formula, meaning (and succeeding, after all) to convey Barnet’s population total, presses its lookup across three different sheets, and I think we need to ask if all that looking is necessary. The formula grabs the drop-down item in C6, which submits itself to a lookup table (or lookup array, in Excel-speak) awaiting in the Data sheet (and note the row-less C:BW reference, which in effect introduces every cell between those columns into the lookup array, and I’m not sure why that overreach was necessary). But it learns its column number – VLOOKUP’s third argument – from a cell A1 that’s been farmed out to yet a third sheet you need to Unhide, and entitled lookup (again, you can reveal the sheet without any obstruction).
In excerpt, lookup’s column A starts out looking like this:


Then scroll down Barnet’s borough attributes in Profile’s C column and observe the VLOOKUP column number references stationed in their formulas, each upped by an increment of one from the previous formula one row up, and each receiving that value from the lookup’s A as above – a column that has no concerted role to play in the workbook apart from its 76 numbers, sorted highest to lowest.

Thus the VLOOKUPs in Profiles sight their column, or index, number in a vertically stacked tower of consecutive values that names the horizontal columnar position of the field in which the formula is interested. Barnet’s population of 380,000, then, emanates from the third column in the data!C:BW table array:


But our formula is ordering up that 3 not from data!C:BW – the actual data source – but from a improvised list of numbers quite external to the VLOOKUP fray, in the sheet called lookup.

It’s a rather inorganic and makeshift route to that 380,800, particularly when a known, standard means toward that end avails, something like this:


Here the VLOOKUP index number writes itself after having solicited the good offices of a neighbour in row 7, none other than GLA Population Estimate 2015 caption in B7 of the Profiles sheet. We can use MATCH to scout the position of that header in the selfsame Data!C:BW table array – or 3, where Barnet’s 380,800 awaits.

The point is that as a veritable matter of definition a lookup’s table array should be self-sufficient; that is, any VLOOKUP’s intent should be anticipated by the array, without any need to look elsewhere for the enabling data. The spire of values in lookup’s A3:A74 is an extra-table array artifice, whose placement suggests that the sheet designer didn’t quite know how to apply MATCH to the task.

Ok, so that’s not a federal offense, and of course the formulas in Profiles do work; but we need to call attention to the ultimately dispensable outside assistance furnished by that range in lookup’s A3:A74. There’s no need for it.
Now there’s at least one other remarkable feature of the Profiles sheet, this time of the formatting kind. I had advanced the guess to myself that the variously colored, merged header cells in the A column had been treated to some manner of conditional formatting, but it turns out they hadn’t been. By way of verification, tap F5 and click Special… on the resulting Go To dialog box and you’re admitted to its Go To Special box, wherein if you click Conditional formats:


Click OK and all sheet cells subject to such any conditional format (by default) will be simultaneously highlighted for your visual inspection. And to my surprise, it was rather cells C7:G77 that had received the formats, one of which looks like this:


That formula tells us that any cell whose row number yields a remainder of 1 when it’s divided by 2 as per the MOD function should be formatted in the above gray. The other condition:


requires that any cell situated in a row number divisible by 2 without a remainder turn white. In short, the formats team to paint a row-banding effect across the impacted range, in fact a default feature of any data set that decides to turn itself into a table. But it doesn’t seem possible to have recast C7:G77 into table mode, because when I tried it I was told:


And C6 in the data set’s header row comprises a reference to D3, the cell in which the Borough-selection drop-down menu holds court. So the banding-format idea is a smart one.
But then I had a thought. It appears as if the conditional format that daubs even-numbered rows – the white cell fill color – looks an awful lot like Excel’s default cell format, which, after all, offers a black font color atop a white cell. So do we really need this conditional format?

With my lips pursed with that semi-rhetorical question I deleted that format – and saw


And I have no idea what that yellow’s doing there.
I told you there’s a lot going on in this workbook. I didn’t say I could explain it all.

Getting There: Australian Politician Housing Expense Data, Part 2

18 Aug

We’re following the money down under, via the conduit of expenses-plotting spreadsheets flowing our way via the Guardian‘s and its piece on the spending activity of Australian politicians. Having spent last week counting the beans doled out for politicians’ flight outlays, we’ve disembarked and booked a look at the hotels at which they’ve pulled in for the night, or nights. In short, time to take a look at quarters for quarters, in the Travel Allowance 2014 workbook wrapped up in the zip file we downloaded last week.

You’ll recall that last week’s encounter with the Australia data flew into a cloud cover of data quality issues, conspicuous among these spending figures suited up in textual format. That turbulence has cleared here; all the rate amounts in column G and all but 24 of the amount entries in J are properly numerated. Those 24 appear to be simultaneously encumbered by footnoting asterisks and HTML-code snippets, e.g.

$1,072.00 <i>**</i>

It seems the following value-restorative tack works, which I’ve penned in U8888 (note that before broaching the formula below I sorted the amounts Smallest to Largest, a directionality that ground-floors text entries to the bottom of the stack, crowding them into the J8888:J8910 range. Thus my first formula writes


That expression flags the position of the first instance of the “<” via FIND and comes away with all the characters in J8888 stationed to the left of that symbol, less 1. It finally metamorphoses J8888’s textual tenant into an upstanding value, while interestingly spurning the $ currency sign in its course. Once done, a standard Copy > Paste Values addressed to J8888:J8910 and a deletion of the formulas from U will complete the repair.

I can’t explain, however, the values’ several formatting incarnations, e.g. -85, -$104.00, and the reddened -$328.00 (the latter not hued by any conditional format, but rather by a Format Cells option). As a functional proposition the diversity is meaningless; as Ms. Stein put it, a number is a number is a number, particularly if the plan is to feed them all to a pivot table in which a new, all-of-a-piece format is made to devolve upon them. But the inconsistencies stoke questions about the data quality that shouldn’t have to be asked.

And a question or two could likewise be asked about the yes/no data marking the withinelectorate field T column. Its header, I’m assuming, wants you to know which lodgings were contracted within and without the office-holder’s district. But a question need be entered about the spate of blanks washing across the field. Since the respective answers to the within/without question should be irrefutably binary, it becomes fair to say that an absence of answer means nothing but an absence of information. But blanks aren’t pivot-table friendly; break out the yes/nos this way:

Row Labels: withinelectorate

Values: withinelectorate (necessarily by Count)

and you get


That is, the blanks remain uncounted, and given their unmissable prevalence I think you’d want the associated number of blanks right in there. The standard workaround: just select T2:T8911 and run a Find and Replace through the cells: Find (leave blank) and Replace with NA, or something like it. But that routine flung my laptop into something like an infinite loop, however, and after having put the Esc to its endless fulminations I bowed to Plan B, whereupon I wrote


to U2, copied down the column, and plugged a Copy > Paste Values of the results back into the T column and deleted the U formulas. Something about discretion and the better part of valor.

Then refresh the pivot table and jigger the Values into % of Column Total mode and:


I think all those NAs matter, standing in the way of a definitive understanding of these home-and-away travels.

Now back to those amounts, 24 of which are negative. We learned last week that these two dozen sub-zero totals attest reimbursements, presumably for trips not taken or ones errantly billed to the reimburser – that is, the Australian government. Reasons aside, their numbers are indeed few, but they need to be dealt with just the same. I’d allow that they be sorted to the bottom of the field and pushed away from the operative data set via our storied blank-row interpolation, because these negatives report what in are effect non-bookings, for which the politician can be neither credited – or blamed, for that matter.

Hewing to my own advice, I’m left with 8885 records worth of expenses, which we can, for example, pivot table by party, number of bookings, and amount:

Row Labels: party

Values: party (Count; in effect, any field all of whose cells are occupied could be counted)

Amount (Sum, formatted with commas and to two decimals). I get


Jogging both sets of values into % of Column Totals we see:


(You don’t need the redundant 100% Grand Totals here,) The party-amount proportions line up, more or less, but they need to be proportioned in turn to the numbers of sectarians per party. We’ve discussed this problem elsewhere (trust me, because I can’t yet track down the post) – in which politician names and their parties visit the data set repeatedly, even as we want to count the records of each party member and his/her party uniquely. Again, the most limpid instruction set would advise us to construct this pivot table:

Row Labels: poliitican

Party (Tabular design without subtotals and grand totals)

That understated table could be selected (via , PivotTable Tools > Options > Select > Entire PivotTable in the Options button group) copied, and subjected to a Paste Values (and a and it could be pasted atop itself, in effect replacing the pivot table), thus in effect carving a new, lean data set which in turn could itself be pivot tabled:

Row Labels: party

Values: party (Count, along with % of Column Total)


Thus we see that with 43.70% of the recorded politicians and 47.00% of the declared expenses the Liberals seem slightly more indulgent than their contingent size would project, while Labor’s 37.39%/31.80% enrollment/expenditure split speaks to their apparent frugality. But of course neither inference can serve as anything resembling the last word on the subject. Any impugning of legislator or party intents would require a far finer sifting of the nature and geographical plotting of the work to be done. But it’s a start.

A raft of other analytical permutations are in there to be spun, of course, but here’s just one more:

Row Labels: details

Values: details (Count)

amount (Sum)

I get, in excerpt:


Scroll down and you’ll see 3760 charges totalling over $4,000,000 for Sittings of Parliament – whatever that might mean, though it likely refers for housing costs accumulated during the body’s sessions – or about 60% of all expenses. But don’t get excited over the Abbott Government’s Budget Cuts and Commission of Audit; they mean Prime Minister Tony Abbott – not me.

Getting There: Australian Politician Flight Expense Data, Part 1

11 Aug

Citizen journalism or free lunch? Admirable voluntarism or crowdfunded advantage-taking? It’s your call, but the Guardian wants you to donate your – and your, and your – analytical smarts to the data on nine spreadsheets itemizing Australian politicos’ expenses, in a collective scanfest to see what sort of discrepancies and pause-givers might be worthy of investigative pursuit.

The sheets, which can be downloaded en masse in zipped form by clicking here on the page linked above:


report travel allowance information, the locations of and costs attending politicians’ sorties away from their district, and data about trips taken by tandems of elected officials who happened to belong to the same party, that latter compendium edging towards the question as to whether they all had to come along for the ride.

The two largest workbooks uncover flight and associate cost data for 2012-2013 (combined) and 2014; and the nearly 45,000 trips filling the two books pump out enough raw materials to perhaps substantiate a story by themselves. After all, the earlier sheet’s 22,044 flights (770 of which predate 2012, by the way) stand atop a tottering denominator of only 244 discrete office-holders, beating out an average of 91.47 flights per politician – or one flight every four days. That’s 90 movies, too. And the 2014 average of 88.63 is also very much up in the air.

Because the data across the two workbooks in effect mean to tell us the same sorts of things, a consolidation might be in order, in the interests of treating the data all of a piece. I’ve used the verb “consolidation” irresponsibly, though, because it implies a turning toward Excel’s long-standing Consolidate feature, which doesn’t seem to be the tool of choice here; Consolidate acts mathematically upon the data to be consolidated (i.e., by adding like items), but all we want to do is effect a merger of sorts, such that the 45,000 records in the two files line up obediently in their appointed fields in a unified workbook. It seems to me that a workaday copy-and-paste of the flight data from one book (probably from 2012-2013 to 2014) to the other will work, once you understand – and you will – two disparities: First, that the first five fields comprise comparable information that nevertheless find themselves in different columns, because the Date field waves at us from columns B and D in the 2012-13 and 2014 books respectively. And second, the electorate, state, party, partygroup, and house fields are quite peculiar to the latter file. There’s nothing there in 2012-2013.

Moreover, the 2012-13 workbook enrolls names that don’t appear in the 2014 edition, e.g. Patrick Secker, and as such no state/party/partygroup information about these one-timers could avail at all. That means of course that assorted blanks and fellow-traveling error messages will sully any attempt to composit cross-workbook breakouts along party and state lines and the like. My first assessment of the discrepancies, worked out after redeploying the 2014 data as a de facto lookup table, turns up about 6% worth of unusables, or about 11.5% of the 2012-13 records – again, meaning that these names just can’t be looked up in 2014, because they’re not in that workbook. That disconnect might be prohibitive, particularly because party expense correlations call so much reportorial attention to themselves. The simple, if only partially satisfactory way out, then, would be to confine some analyses to the 2013-14 information.

What are universally available across the sheets are the date data. You’ll be pleased to know that every one of the 44,911 two-workbook records is in fact a date; but you’ll be a little less pleased to discover that the amount data a) aren’t numeric at all, but rather have been straitened into text mode, and b) about 9,000 of the amounts shed any pretence to numeracy at all, substituting hyphens or absolutely nothing for the impostor values. And that is quite beside the 16 entries whose incipient values are freighted with asterisks, e.g. 528.81**.

Now, ok – the asterisks, and all the other text-presenting values in the field, could be righted by couching the data in the VALUE function, e.g.


Or, in the case of the asterisk-ridden:


That expression imparts value to its referenced cell after shearing its pair of asterisks and rescuing the numeric remainder via a LEFT. (Note by the way the dollar signs, reflective of the indigenous Australian currency. And note as well the negative amount entries, apparent reimbursement figures. Thanks to the Guardian’s Nick Evershed for that clarification.)

But with 20% of its flight-charge amount records calling in absent, confident conclusions about who’s done what won’t scamper in your direction, though of course, some findings do remain to be plausibly plumbed, e.g. this pivot-table matrix (I’ve assumed you’ve combined the two workbooks to the extent possible):

Row Labels: Location-from

Column Labels: Location-to (in fact there are 792 blanks in there, and two in Location-From. I’ve renamed the Row and Column Label headers To and From respectively.)

Values: Date (Count; the field was chosen because it is accounted for in every record)

Slicer: politician

You’ll also note this Location-from misfit:


 Thus Australian Prime Minister Tony Abbott’s itinerary looks like this:


 That sort of tableau may prove instructive, but with so much important data simply not there the limitations here are real.

But we have seven other zipped spreadsheets to look at.

Daunting Date Data: UK Unemployment Claims

4 Aug

Here’s a droplet from the torrent of spreadsheets loosed weekly by the UK Office of National Statistics, one that embodies an interesting, if messy, set of problems for the analyst:

Claimant count data

The data historicize British monthly unemployment claims data from a June 1881 point of inception, and drill down to May of this year, across two different sheets.

There are, first of all, significant comparability issues across the years and the sheets, some of which are expounded in the linked document on the Historic Data sheet as referenced in cell A5. If your intentions here are investigatory, those assuredly beg your attention. Indeed, the qualifier in cell A7 of that sheet warns against a melding of the UK and Historic Data – the latter only counting claims from 1881 through 1970 – a recommendation that presumably justifies the very fact of two sheets.

That’s one issue and a real one, albeit of the extra-spreadsheet kind. But there yet might something to learn from each sheet regarded singly – namely, something about unemployment claims by year.

And to advance that prospect, all good pivot tablers will doubtless move to group the Historic Data (let’s start there) by year(s) and proceed to average their claim rates (which may roughly proxy for unemployment percentages).

And it’s right there where you have to start paying attention. The dates tumbling down the curiously untitled field (we’ll call it Date) in A don’t seem to be dates at all, but rather indolent text; but for once, as we hope to see, these impersonators will serve the analysis well – because Excel can’t evaluate putative dates that were calendared before its baseline day of January 1, 1900. That day receives the date number 1, with every succeeding day counting off its relation to the foundational January 1 by an increment of 1. Thus December 17, 1983 quantifies itself as 30667, for example, the number of days distancing itself from the January 1, 1900 touchstone.

Thus Excel can’t ratify Jun 1881 in native date mode (though a VBA-macro –driven workaround is described here) – and indeed, even if all the entries in the Historic Data sheet postdated January 1, 1900 they’d still resist grouping, because they’re text.

But for our purposes, that insufficiency shouldn’t matter. To commence what could be billed as faux date grouping, all we do is check into the D column, head it Year, and enter, in D12:


That expression drags the 1899 out of the A12’s Jun 1899 and upgrades it to full-fledged quantitative standing. Copy it down D and you’re in business, priming this sort of pivot table

Row Labels: Year

Values: Rate (%) (Average, honed to two decimal points. But note that 31 months report NA in lieu of an actual rate).

Because the Years are duly numeric, we can group these into say, five-year bins, yielding:


You’ll note the Depression-driven 19.10% in the 1931-1935 bin.

Now there is, apropos the discussion above, a formatting curiosity that’s wormed its way into the dates. You’ll see that the A12:A234 range bears the General format, befitting a non-committal receptivity to any and all date/numeric types, and these precisely dovetail the pre-1900, text data. But A235 – entertaining the Jan 1900 date –asserts a Custom format, consonant with a numeric entry – even as we’ve allowed that all the field’s cells are text. And sure enough – click A235 and jab at the Number Format drop-down menu in the Home > Number button group:


You’ll see that every formatting variant therein fails to vary, each one reporting the same Jan 1900 expression; and that very constancy, Custom format notwithstanding, means – text.

But ok – again, that discovery does nothing but rubber-stamp our earlier averral, to the effect that all of the entries set down in the Date field are merely, and only, text. But what’s interesting about Jan 1900 – and by that I mean ‘Jan 1900 – is that, some wordplay intended, it isn’t completely valueless.

To explain: given the overweening centrality of numbers to the spreadsheet remit, Excel accords certain quantitative properties to apparent numbers costumed as text. Enter ’17 in A3, for example (and the apostrophe textualizes the entry) and go on to enter =A3*2 in A4, and you’ll get…34, a really, truly, usable number. And less obviously, enter ‘Jan 1900 in A3 and type =A3*2 in A4, and you’ll get 2 – and that’s because Excel takes the liberty of treating ‘Jan 1900 as January 1,1900 – numerical value 1. Multiply it by 2, then, and you get 2.

Thus even as ‘Jan 1900 owns up to its text-label lineage – and as such, can’t be grouped by a pivot table, for example – it’s happy to consort with numeric operators, and pass itself off as a number when asked to assent to the deception.

And if you want to work with the presumptive years in the UK sheet, you’ll want


because the years in the sheet’s A column position themselves before the month in the UK date cells. And because they do, they won’t return the same kind of de facto numeric values that surprised me in the Historic Data sheet. If you want those, you’ll have to restructure the data thusly, e.g., the 1971 Jan entry in A6:


The RIGHT/LEFT extraction reverses the placements of the 1971 and the Jan, after which an authentic date/number should emerge.

Should. But scroll A column downwards, and you’ll drive past entries such as this one, in A230:

 1989 Sep#

That footnoting pound sign, or hashmark (its name depends on the side of the Atlantic on which you’re reading this), inflicts one more dollop of complication into the mix, requiring this formulaic rewrite:


The nested MID rescues the Sep from its entanglement with that sign,as that month name begins in character position 6 in A230.

Ho hum; all in a date’s work.

ATP Tennis Rankings, Part 2: Some Baseline Data

27 Jul

Now that we’ve established the fact of the relative French hegemony over the men’s rankings a breakout of said parameter would appear to be in order, that is, something like this:

Row Labels: Rank (Grouped in bins of 100; needless to say you can demarcate these as you wish)

Values: Rank (Count)

Slicer (or Report Filter): Country

Slicing for France I get:


It must be the wine. With 40 players in the top 400 – and the US has 30, by way of comparison – France is doing something right, and by turning the values over into Show Values As > Running Total mode (Rank serves as the Base Field) you’ll see


That is, the top-ranked half reaching through 1100 counts 96 Frenchmen, or 69% of their total, a favorably top-heavy distribution. The equivalent US proportion stands at 53%.

Now of course there’s another variable we’ve neglected till now – Age, which should have some notable things to say about who’s ranked where. First, I’d calculate age by posting the TODAY() function in some unoccupied cell, say L1, and range-name it, with all due sparseness, t. Then name E1 Age and enter, in E2, =(t-d2)/365.25.

That expression, which we’ve seen before, subtracts the player’s birthdate – a number, after all, when all the formatting is said and done – from today’s date, whenever that day may fall, and divides the difference by 365.25, or the near-average length of a year.

But note that 17 players report no birthdate, and so by sorting the ages Oldest to Newest we’ll drop those empty readings to the bottom of the dataset, wherein we can ask our perennial blank row to shove the vacancies from the date-bearing records.

Rounding the results to two decimals, we can begin by pivot-tabling a breakout of average age by grouped rankings, here with the presently usable range (through row 2150) e.g.:

Row Labels: Rank (again, Grouped by 100)

Values: Age (Average). I get


And that, I would contend, is interesting. Top 100 rankers are the oldest, their 28.79 years exceeding by 1.6 years the penultimate average of 27.19, which happens figure in the very next tranche of 101-200. Indeed, that 28.79 betters the overall men’s average by 4.74 years, not a trifling disparity, and doubtless a meaningful one.

Next, a simple frequency distribution of ages across the rankings would be correlatively valuable, but it isn’t that simple. Start here:

Row Labels: Age

Values: Age (Count)

It’s when we try to group the ages – and I want to organize them in units of one year – that the problems fly off our racket:


Now we ran into the extended-decimal problem some time ago, having alleviated that unsightly tableau courtesy of a consultation with Debra Dalgleish (sorry for misspelling your name last week, Debra). All we need do is delete the number’s decimal portion


and proceed, or so we’d like to think. But when we click OK we see


That is, the “By 1” entry in the above dialog box is all-too literal, squeezing the ages into tufts of two years, describing an interval of one year. But I want each year stand-alone, and decoupled from any companion age. And that is a perfectly standard stratagem, one in fact that could be engineered quite swimmingly without any grouping at all under “normal” circumstances – that is, were the data-set ages decimal-free at the outset. But because our field’s ages are now precise to a fault, we need to rewrite E2 thusly:


thereby flattening Novak Djokovic’s 28.18 years into an uncomplicated, but real 28.00 (i.e., that is a meaningful 28, not a format-tinctured 28.18). Copy down the column, and refresh the table. Now I get (ages sorted youngest to oldest):

Flip the numbers over into % of Column Total and throw in Rank (Average, two decimals) in Values and


21 is the modal age, but with an average ranking of 1265.73 that cohort has yet to hit its stride. Note 28’s 719.32, but check out the even lower rankings for 30 and 31, numbering 68 and 40 players respectively. And the 34 34s score even better.

You may also be intrigued by the one 48-year-old urging himself upon the rankings, instated at 1716. That demographic belongs to the Israeli/American Oren Montevassel, who as of this writing sits at 1727 (remember our data take us back to December of last year). His Twitter account states its owner is “…considered best player in the world for 40+”, and that judgment may be wholly justified.

In fact, 125 players shared the 1716 ranking, presumably because it amounts to just a few tournament rating points. And to that stratum also belongs the rankings’ youngest entrant, the Italian Nicolas Merzetti, who with a birth certificate (it’s been translated for me) dated October 29, 1999 is less than a third of Montevassel’s age.

Hey ATP – where’s your publicity department? Why aren’t these two guys paired in a doubles team?

ATP Tennis Rankings, Part 1: Some Baseline Data

20 Jul

DalYou may not need or want to know that you’re the world’s 1572nd best plyer of your favorite trade, but tennis plyers, er, players, can’t help but be reminded of precisely where they stand among their peers, and all the time at that. It’s the rankings, those invidious markers of professional worth that want you to know just how many of your fellow practitioners out there do your job better than you do; and the Tennis Abstracts site ranks 2100+ male players on the Association of Tennis Professionals (ATP) tour for your interest, consideration, and analysis (I don’t see an equivalent compendium for the woman’s tour, by the way, though other sorts of listings for them are on the site). You can check it out in standard spreadsheet form courtesy of yours truly here:

 ATP Men’s Tennis Rankings

The data set comprises exactly four fields, one of which – the player names – doesn’t really have that much to add to the inquiry, though I wouldn’t I delete it. But those three other fields nevertheless suffice to speak instructively about the rankings.

First, a necessary qualification: we need to recognize that, with its December, 2014 timestamp, the rankings – which naturally oscillate with every tournament – aren’t absolutely to-the-minute, an imperfection that is real and should be rued. You’ll happen upon a timelier arraying here at the ATP web site, but its currency is frittered away by the site’s poorer data organization: the players come to screen in blocks of only 100 at a time, and their country codes have been supplanted by text-less national flag icons, which don’t copy-and-paste very well at all, if at all. And that means you have to know your flags.

Shortcomings duly recorded, we can return to our data and attempt a breakout of ranked players by country, an obvious but need-to-know kind of metric. And here too another problem of sorts sprays molasses into the gears: those aren’t flags in there, but I can’t always tell which country is referenced by which three-letter code. It took me a round of Googling, for example, before I persuaded myself that MNE means Montenegro, BIH stands for Bosnia-Herzegovina, and MAR is, naturally, Morocco (ok – it’s Maroc in French; look here for some global clarifications). And CHI is Chile, not China; that latter country is signified by CHN.

But once you get past that international complication this straightforward pivot table beckons:

Row Labels: Country

Values: Country (Count)

Country (again, here % of Column Total). Sort by either Count of Country values, largest to smallest. I get in excerpt:


Quelle surprise. I would not have anticipated France’s superior representation, or Italy’s third-place showing but I’m a duffer; followers in the know may find their expectations wholly fulfilled.

But what about the countries’ numbers and their players’ average rankings (whereby lower is better)? Let’s slide Country into Values, summarizing by Average and formatting to two decimals, e.g.:


Leaving existing sort order in place, we again find the French among the higher achievers. Remember that with 2166 players populating the ratings, the average across that total comes to 1033.5; and so an evaluation of 844.46 – 189 places above the average – spread across 139 Frenchmen, is no small attainment (note that some players with equivalent, overall tournament point totals will share a ranking. Thus if two players are ranked 83, the next player down the pecking order receives an 85).

The problem of course is that the average reported above won’t directly reconcile absolute country totals with the calculation. We see, for example, that Czechoslovokia ‘s average 799.78 rank outpaces France’s appreciably, but it’s gotten there with but 40 players – and that disparity seems to shortchange the latter country. Thus it seems to me that dividing a country’s number of ranked players by its average rating – again, in which lower is better – would realize a truer measure of the national standings. Not perfectly true, perhaps, but truer.

That alternative would appear to call for a calculated pivot table field, something like this:


But that won’t work at all, in part because, to quote Excel boffin Debra Dalgliesh:

“A calculated field always uses the SUM of other values, even if those values are displayed with another function, such as COUNT.”

You may want to think about that caution – I know I need to –but either way, you can’t build a calculated field atop a set of counted values – and country names can only be counted. The way in to a solution then, would return us to the data set itself, to which we need to strap in another field, which I’ll call Country Count, presumably in column E. I’d then select all the country data in C2:C2167, name that range country, and in E2 enter:


and copy down E, even as you blunt the values’ two decimals if you see them. As advertised, the COUNTIF totals respective incidences of each country name, and don’t be bothered by the fact that the total won’t vary among countrymen down the column. For example, the appropriate E cell for each French player will return 139, but that redundancy will be attended to by the ensuing pivot table, which should assume this shape:

Row Labels: Country

And it’s here where we devise the calculated field, which I’m calling AdjustedRank, e.g.:


Click OK and sort by AdjustedRank, Largest to Smallest (here I’ve formatted the results to six decimals):


That’s a bit more like it, I think. Compare France’s adjusted rank with Czechoslovakia’s; now, in spite of the latter’s more impressive average rank, France’s greater player numbers overwhelm that advantage commensurately, at least more or less.

I’m hedging because, taken to its limiting case, my new field begins to cause itself problems. 21 countries contributed exactly one player to the rankings; if one of those players, for illustration’s sake, were rated number one, my little equation:

=Country count/Average ranking

Would yield 1, and so would embarrass France’s real-world, table-topping .164602. That scenario isn’t unimaginable, either, and so the above fraction might have to be returned to the shop for a rethink, the better to properly mathematize all contingencies.

OK, so my calculated field isn’t irreproachable. But I ask you: can 139 Frenchmen be wrong?


Get every new post delivered to your Inbox.

Join 176 other followers