NHS Patient Waiting Times, Part 1: Weighting the Formulas

22 Jan

Divide finite resources by a spiking demand and the quotient shrinks, a mathematical truism all too pertinent to the UK’s National Health Service nowadays. Evidence of a system in crisis commands the country’s media and political attention, and of course data contribute integrally to the debate. One testimony – an NHS cataloguing of hospital treatment waiting times distributed by hospital and medical specialty – is here:


The workbook’s front end comprises a Report tab, whose dropdown menus synopsize patient waiting time data for any user-selected hospital and treatment area. You may want to trace the contributory formulas, and note the lookups performed by the OFFSET function, about which we hope to say a bit more a bit later. Note as well the section topped by the Patients Waiting to Start Treatment reference data, vested in the IncompProv sheet; information about patients who completed their what the NHS terms their pathway and started treatment is displayed by the two recaps that follow, the first keyed to the AdmProv sheet, the second to NonAdmProv.

Those sheets – that is, the ones bearing the Prov term – are all comparably organized, each earmarking and cross-tabbing its first 19 records by medical speciality aggregates and waiting times, these expressed in one-week intervals. The next, 20th record then vertically sums the per-week totals; the remaining records break out the data by individual hospitals.

Those 20 aggregating rows loose a familiar complication upon the data sets, at for the pivot tablers among you. As per previous discussions, the rows in effect double-count the patient figures, by duplicating the hospital-by-hospital totals. And again, for the pivot-table minded, the weekly wait fields could have been more serviceably sown in a single field, for reasons which recalled numerous times.

But those impediments don’t daunt all efforts at analysis. You’ll observe the GT 00 to 18 Weeks SUM fields posted to the far reaches of the PROV sheets, these tabulating the numbers of patients obliged to wait up to 18 weeks for the respective attentions described in the sheets. I began to wonder, then, if an efficient formulaic way to measure patient waits for any number of weeks could be drawn up, and I came up with two (and there are others).

That is, I want to be able to learn for example how many General Surgery patients had to wait up to 14 weeks for treatment, by empowering the user to enter any medical specialty and week-wait duration (I’m assuming here that the wait is to be invariably measured from the inception of the wait – 0 weeks).

Assuming I’m looking at the IncompProv sheet – though what we do there should transpose pretty effortlessly to the other Prov data – I’d begin by instituting a dropdown menu of medical specialties, say in E1. I’d continue with another dropdown, this one unrolling the wait headers in set down row 3 (that is, the dropdown data – and not the menu itself – are to be found in row 3)..

I’d next select F3:B22, thereby spanning all the specialty aggregate names along with the week-wait values data. While the selection is in force I can muster an old but valuable option, Formulas > Create from Selection in the Defined Names button group. That run-through takes me here:


Click OK and each row and column in the selection receives a range name coined by the label to its left or immediately above its column (note that labels themselves are not enrolled in the range, and Excel’s reference to values in the above may mislead you. The ranges here are all textual). And by way of a final, preparatory touch, I’d consecutively number the weeks right above their columns in row 2, assigning 1 to 0-1 Weeks, 2 to 1-2 Weeks, and so on.

Remember we’re attempting to tally the aggregate number of patients awaiting treatment in a selected speciality for from 0 to a specified number of weeks. Per that agenda, I’ll enter the number of weeks – that is, the maximum specified week wait – in F2. Let’s say I enter 15, and then proceed to click on Ophthamology cell F8. I’m thus instructing the budding formula to calculate the number of patients who awaited treatment in that speciality for a period of up to 15 weeks. Next I’ll enter the formula, say in E2:


I sense an explanation is in order, so here goes. In the final analysis, of course, the above expression sums patient numbers for the given specialty across a variable number of weeks – in the example, 15; and the formulaic potential to confront those variables is realized here by the OFFSET function.

OFFSET offers itself in two syntactical varieties, one requiring three arguments, the other five. The lengthier version, applied here, serves to identify the coordinates of a range, which will in turn submit itself to the SUM. In our case, the first of OFFSET’s arguments gives itself over to the INDIRECT function, itself performing a cell-evaluational role upon the nested CELL(“address”) expression. CELL(“address”) will, upon a refresh of the spreadsheet (triggered by the F9 key or some data entry elsewhere on the worksheet), return the address of the cell in which the cellpointer currently finds itself. Thus, for example, if you enter =CELL(“address”) in A12 and proceed to enter a value in C32, the function back in A12 will return $C$32 – the cell reference, not the cell’s value. But $C$32 is a textual outcome, i.e., not =$C$32. INDIRECT then restores a bona fide cell-referring status to $C$32, and reports the value currently stored in that cell.

So what does this have to do with our task? This: click on any medical specialty in F3:F22 and refresh the sheet. The address of that specialty populates OFFSET. For example, click on the name Ophthalmology refresh, and OFFSET will turn up F8, for our purposes the first cell in the range we’re striving to define. The two zeroes that follow tell OFFSET to in fact to define the range precisely at the F8 inaugural point – that is, not to move any rows or any columns away from that F8 range anchor. The 1 next instructs the formula that the imminent range comprises but one row – the row on which the Ophthalmology data are resting; and the F2+1 measures the range’s width in number of columns – in our case 16, or 15 plus 1 – plus 1, because the data for week 15 stand in the 16th column from the medical speciality in which we clicked to initiate the formula.

The answer in this particular case, then, is 330,472, counting of the number of ophthalmology patients who had to wait up to 15 weeks to receive treatment; and you could divide that figure by the Ophthalmology patient total – VLOOKUP(CELL(“contents”),F4:BH22,55,FALSE) – to learn that about 86% of the patients were waiting up to 15 weeks for treatment. So again, by clicking on Ophthalmology in F8 and refreshing the spreadsheet, the formula is put to work.

But there are other formulas that will work, too.

The Guardian’s Top 100 Footballers: Rating the Rankings

8 Jan

You like to window shop, I like to Windows shop; and my curiosity-driven gaze through the virtual panes shone its beam on yet one more data set devoted to the ranking of athletes, this time of the football/soccer genus.

Wheeled into view by the Guardian in Google spreadsheet mode and attired anew here for you in Excel:


the data sort the results of the aggregated, columned (P through EI) judgements of 124 sportswriters, each of whom nominated up to 40 players assessing them a point evaluation ranging from a maximum of 40 downwards (you may want to consult Sky Sports’ differently-ordered top 100 here).

Heading the list – which highlights the top 100, though 380 players were presented to the judges – and unsurprisingly so, is Portugal’s. and Real Madrid’s. Cristiano Ronaldo, putting 68 points worth of distance between himself and the no-less-unsurprising Lionel Messi, he of Argentina and Barcelona. That three of the rankings’ top five players populate the same Barcelona front line. even as that team merely holds down second place in its league. is the kind of sporting discrepancy sure to give the sages something to think about, but I’m a spreadsheet guy. And the fact that only Ronaldo and Messi appeared on every sportswriter’s ballot – and that, as a consequence, some rather formidable players found no place at all among soe writers’ top 40 – is perhaps equally extraordinary. Indeed – the Algerian and Leicester City star Riyad Mahrez won one first-place vote – even as he was completely shunned by 16 other writers.

But what of the spreadsheet? Start at the beginning, with its header data walking on the sheet’s ceiling at A1:O1. Those identifiers need to be lowered into row 4, hard by their attendant field data; and by extension, the hyperlink to the judges’ names and rules for assigning rankings stretching across the merged cell P3 must be moved, or its contiguity to the dataset will commit an offside against any pivot table building from the records against the perpetrator.

I’m not sure why the sheet’s numbers were left aligned, either, though I’m not pressing any charges for that formatting decision. And because it’s clear that the Guardian has player birth dates – otherwise, the Age at 20 Dec 2016 data could not have been furnished – exactitude might have been slightly better served, and pretty easily at that, via the YEARFRAC function, and an age calculation extended to a couple of decimal points, e.g. 27.83.

But if all that qualifies as a nit-pick, the rankings themselves in column A could be more justifiably questioned. Players with equivalent ratings, e.g. Alex Teixeira and Anthony Modeste at 156, were enumerated thusly: 156=, an expression that strips the number of its quantitative standing. Why not assign a 156 to each player instead, as do the women’s and men’s tennis rankings we reviewed in our immediately previous posts?

Moreover, some apparently identical scores appear in fact to have been differentiated. Raphaël Varane and Serge Aurier both check in with 34 points and five sportswriter votes cast, the latter a rankings disambiguator for the Guardian; yet Varane earns a ranking of 118 to Aurier’s 119. The same could be said about Omar Abdul Rahman and William Carvalho, invidiously niched at 133 and 134. In any event, if you do want to numerate all the rankings, say by figuring average ranking by country and/or team, then point a find-and-replace at the data, finding every = and replacing these with nothing.

I’d also withdraw the blank, colored row 105 that hems the top 100 from the lesser-rated crew beneath; while the row means to delimit and frame the footballing elite, per the Guardian’s story, any analysis of the larger cohort of course would need to unify the data set – and that means dismissing its blank rows. In that spirit, then, you’d also want to disavow row 129; whatever informational service its text may perform, it isn’t a ranked player’s record. I also can’t explain why some of the country and club entries, e.g. cells G207 and H219, appear in blue, or why Antonio Candreva in row 213 is described as Italian when his countrymen are identified with Italy. In addition, Nani’s (row 83) nationality is ascribed to Portugal, but with a superfluous space, as is Paul Pogba’s France (row 21).

Also, the formulas cascading down the Up/down field in B that meter a player’s current movement through the rankings from his 2015 score could have simply read =C5-A5, for example, sans the SUM function and its parenthetical braces.

But note as well that players who went unranked in the preceding year received a hard-coded NEW classification for 2016 that in fact could have been subsumed formulaically, e.g.


You’ll also note that the formulas in O compiling the number of a player’s first-place votes look like this:

=COUNTIF(P5:EI5). “40”)

The quotes are superfluous, and I confess to surprise that the formula works. The entries in P5:EI5 are values, after all, not labels. But work it does.

And for another matter that warrants our scrutiny, consider the Highest Score Removed field in L. The Guardian determined that any player’s highest rank – or at least one instance thereof – be stricken from his final score as an outlier. That sort of decision rule isn’t unprecedented – figure skating and gymnastics judging protocols often drop highest and lowest scores – but in those sports the extremes at both the high and low end are ignored; the Guardian only points its thumb down at the high – again, just one high, even if others have been issued to the player. Ronaldo’s 63 first-place votes are thus contracted to 62, but those 62 of course exhibit precisely the same score as the ostensible outlier.

Moreover, and unlike other juried events, the number of judges who decided to score a given player here is very much a variable. Thus we need to account for the 76 players whose final score of 0 belies their receipt of an actual, if solitary, vote. That vote, of course, was barred as an outlier, leaving the player with nothing, so to speak. Along these exclusionary lines, it follows then that players named by exactly two sportswriters incurred the loss of their higher score – even though one could challenge the insistence that these are somehow more outlying than their other, lower score.

The matter of how to dispose of score extremes has been disputed (see this mathematical exploration)  – it would be difficult, for example, to imagine an even halfway-well-intentioned teacher dropping a student’s highest test score (though the classroom scenario features one judge of many test performances; in figure skating many judges arbitrate one performance) – but in the interests of pressing on we could, for example, learn something about the larger aggregate picture by approving a data set comprising the 254 players who received at least one Raw Total point, , i.e. prior to the removal of their highest score. If we’re provisionally satisfied with the makeup of that cadre, we could for starters simply pivot table a count of players by country and country raw point total:

Row: Nationality

Values: Nationality (by Count, of course; the data are textual).


RAW TOTAL (again, by Average, formatted to two decimals and with a comma)

I get, in excerpt:


We see that Spanish players win appearance honors, but among the more productive countries Argentina claims the highest average player score, and by quite a margin.

And if you’re wondering, there’s one American player in there, even as he didn’t make the screen shot cut above – Christian Pulisic, who plays for the German Borussia Dortmund squad and is ranked 138th, with 25 raw-totalled points. But he’s only 18 years old – and if we rank the players by age, he comes in at number 2.




Birth Month and Tennis Rankings: Part 2

30 Dec

We could precede any look at the birth-month data for women tennis players with a couple of variously obvious questions. The first asks, most evidently, how these data will compare with those of the men’s cohort. The second asks about those very suspicions; that is, why we’d bother to promote the sense that the women’s results might depart from the men’s. Why should they?

But we can’t begin to suspect without seeing the data, and those offer themselves up to us on the Women’s Tennis Association site here; but I’ve prepared a neat pre-packaged version here:


Those 1300 or so rankings (1313 in fact, that less-than-round number presumably reflective of equally-ranked players) come complete with player country of origin and (real) birth dates, just what we need (note however, that the player names have been freighted with a superfluous space that you’ll need to trim should you work with those data).

But I digress. Why, after all, might the women’s birth months vary from the men’s? A popular surmise maintains that women players are typically the younger, a nugget of popular wisdom worth mining, as it turns out; ranked women exhibit an average age of 22.52, while the men figure to 24.27. But a birth-month gender divergence thesis would leave popular wisdom out in the cold.

So let’s see. Paralleling the men’s inquiry, we could pivot table the women’s data thusly, for starters:

Rows: DOB

Values: DOB (Count, then % of Running Total against the DOB baseline). Turn Grand Totals off.

I get:


Again, a first-half-of the-year imbalance emerges, albeit somewhat less pronouncedly than the men’s 55.90%. If we pitch Rankings into the Columns area and group these by bins of 100, we get in excerpt:


Here the first-half predominance is striking, though again the universe’s 100 cases might throw up some interpretive cautions. If, per the men’s survey, we next group the rankings by tranches of 500:


(Remember, and unlike the men’s rankings, the women’s data comprise only 1313 players; and as such the 1001-1500 bin contains 312 records). We see a slow increment in first-half percentages across this coarser grouping, but the edge holds in each case.

For the American contingent, a country Slicer can again be put to work, to recreate last week’s result here for women:


Unlike the men, a US first-half effect does registers for the 114 women from the States.

I’m not sure what, if anything, that means –  particularly given the modal birth month for the American women of February – but we’re left to consider the import of the larger findings (while remembering of course that the first half of any year comprises fewer days, too); and again, the notable persistence of the first-half birth-month margin sets its explanatory challenge before us, and toughened by the data’s cosmopolitan demographics. 84 countries have provided the ranked women, and that variety doubtless bespeaks diverse recruitment and instructional programs, all aggregating to the above distributions. And the rough likeness of the men’s and women’s birth-month distributions may simply affirm a gender-invariant character to those programs these days. In any case, if you’ve been looking for some journalistic marching orders, perhaps you’ve found them here.

Now it was during these speculations that another means for assaying the birth-month phenomena came to mind. Instead of breaking out births by months – a wholly sensible recourse, to be sure – it occurred to me that a birth-month index of sorts could be developed by determining the percentage of days of any particular birth year elapsing from January 1 of a baseline year, divided into a player’s actual birth date. Thus, a player born around July 1 – more-or-less the year’s halfway point (there are leap years in the chronology, of course) – would exhibit a birth fraction, as it were, of .5.

The idea in turn would be to average all the players’ birth fractions, with the intention of learning how near or far from .5 the average might veer. A relatively low average – e.g. .45 – would signal a cohort average birth date prior to July 1, and thus offer another, finer reading of the birth-month data. By way of contrast, if one breaks outs births by month – as we have to date – then births on June 1 and June 30 are to be understood as equivalently June-occurring – even as the former date holder is of course older.

With that program in mind I can move into column H, title it YearPercentage are something like it, and enter in H2:


Then of course you’ll copy down the H column.

(Your formula labors here and elsewhere may profit by subjecting them to a durable, onscreen look in a free cell, by referencing the formula with the FORMULATEXT function.)

What is this formula doing? Something like this: it calculates the number of days a player’s birth date is distanced from January 1 of her birth year, and divides that number by the number of days appropriate to that year. In the case of the highest-ranked Angelique Kerber, born on January 18 (remember to send her a card) – the 18th day of the year: if we divide 18 by 366, (the day count of the leap year 1988), we get .049, the proportion of the elapsed year.

So let’s try to detail the workings of the formula.  The DAYS function counts the number of days spanning two dates, beginning here with E3, or Kerber’s birthday. The DATE(YEAR(E3),1,1))+1 segment returns January 1, 1988, by grabbing the year from E3, and then  posting 1 and 1, or the first day of the first month. Equipped with those three identifying bits, DATE then realizes the specified date, with the +1 tacked on to see to it that, for example, a January 1 birth date returns a 1, and not a 0.

Kerber’s numerator, then, should read 18, a figure divided in turn by either 365 or 366, the two possible year day counts. The formula asks, with the intercession of the MOD function that appraises the remainder of a number divided by the second argument – in this case 4 – if the year drawn from E3 is precisely divisible by that 4. If it is – that is, if the formula discovers a leap year – we use 366; otherwise, the formula supplies a denominator of 365.

Once you copy the expression down the H column you can simply enter a standard AVERAGE somewhere:


I get .4732, suggesting a player birth-month “average” appreciably in advance of the June 30/July 1 yearly midpoint.

And while of course that result appears to merely corroborate that which we already divulged through the earlier pivot tables, our finding here is advantaged by a greater precision.

And precision, as any player whose serve bounces a half-inch outside the lines will tell you, matters.

Birth Month and Tennis Rankings: Part 1

23 Dec

We’ve batted this ball around before, but those hacks were taken on other fields. Still, a recent (UK) Times piece by Daniel Finkelstein on birth order and its association with soccer players’ ascent to the British Premiership league returned the analytical ball to me on a different court – in this case the one earmarked for tennis.

We’ve looked at tennis, too, but with a consideration of country and age-driven breakouts of mens’ tennis players – not their birth months. So I booked some time on the tennisabstract site and its current, online-sortable rankings of the male of the species, which you can copy and paste from here.

The rankings seem current indeed, by the way; an ascendant Andy Murray in the pole position attests to their recency. In search of some deep background on the matter, I Googled my way into the menstennisforums site, and its precedent discussion of the birth-month-rankings relationship (you need to join the forum, by the way; a free enrollment entitles you to limited access to its holdings). In this connection a Taiwanese contributor screen-shot this birth-month-rankings distribution for 2014 player-rankings data:


 We see that the birth months of all ranked players skew heavily toward the first half of the year, and rather discernibly, though occupants of the top-100 exhibit a far evener natal distribution, among that far smaller sample (if in fact the cohort can be permissibly understood as a sample. A sample of what, after all?) Yet 54% of the top 500 present a first-half birth certificate, as do 55% of top-1000 position holders. The proportion for all 2221 ranked players: 56%. Something, then, seems to be at work. So what about 2016 data?

That sounds like a question we could answer. But before we give it a try, a pre-question of sorts could be posed at the activity: does it pay to bother? If the 2014 data above have been faithfully compiled – and they probably have – would much interpretational gain be realized by another look at the men’s rankings, but two years’ later? With a player cohort exceeding 2000, would statistical sense be served by recounting the birth month distributions?

Well, they said Clinton would win, too. Distributions change, and testing the data anew – which after all are not wholly coterminous with 2014’s player pool – is worth the try, especially since we’ve budgeted for the project (a bit of blog humor, that was).

So let’s see, starting with this pivot table (note: 13 players have no birth dates to report, and are to be filtered away throughout):

Rows: DOB (grouped by Months only)

Values: DOB (Count, then % of Running Total In (this against the DOB baseline, the only one undergirding the pivot table. Turn Grand Totals off, too).

I get:


The running totals’ month-by-month accumulation indeed emulates the 2014 56-44 first/second-half yearly breakout, along with the respective monthly contributions to the whole. No surprises, then – but replication does have its place.

And how do our month distributions compare with the 2014 top 100, 500, and 1000? We can start by dragging DOB into the Columns area and grouping these into bins of 100, retaining the running total effect. Isolating the first bin in the screen shot, I get:


 Here, and unlike the 2014 figures, the first/second-half differential breaks 59-41%, comporting with the rankings’ overarching tendency, although again, of course the universe of 100 players will not mollify a statistician.

For the birth-month distribution for the top 500, group the rankings by that interval:


Pretty much more of the same. Then group by 1000:


The approximate 56-44 weighting runs through the data and its several granularities; and remember that the third, 2001-3000 bin, comprises only 65 players.

Now what if we isolate the contingent from the US? We’ve learned in a previous post about the August birth-month effect that seems to prefigure the career prospects of baseball players from that country. First, in view of the likely diminished US-specific aggregate that’ll sprinkle just a few numbers across the rankings I’ll remove Rank from the table, introduce a Slicer for Country and click USA, and restore Grand Totals. I’ll also tap DOB a second time for Values duty, one instance to convey the straight sums, the other to record that running column percentage. Here I get:


Note first of all that only 164 Americans appear among the 2087 ranked players, around 7.9% of them all, even as that proportion leads all nations. Second we see that no Jan-Jun differential obtains for the US, though the 23 Americans born in October could perhaps be wondered about.

But the global birth-month disparity holds, and as such calls for an accounting. Tennis players, after all, are among the most international of sporting populations, the rankings admitting players from 98 countries. The simple, but yet-to-be-substantiated hypothesis, would maintain that January 1 cut-off dates for age-specific tennis youth programs advantage older players, but that’s an early surmise. (Note by the way that UN birth data by month across the 1967-2015 periods reveals no January-June skew.)

First conclusion: more work needs to be done here. And while we’re at it, think about Michael Grant, an American ranked 836 and born in 1956, having earned his highest rank of 96 in…1979. Well done, Mr. Grant, I’d say – and he was born in Februrary.

But what about women players? Good question.

L’autre election: Budget Participatif, 2016

12 Dec

Trust me; there have been other elections contested across the planet of late that do not involve candidates with big hair and/or trademark pants suits. Consider, by way of example, that now-annual attestation of Parisian fiscal democracy, the Budget Participatif, for which we budgeted a couple of thousand words last year in a pair of posts.

The budget referendum, you may recall, asks Parisians to point their collective thumbs up or down at several hundred project proposals for the city, some specific to one of Paris’ arrondissments (districts), the others citywide. What’s not pinpointedly specific is the definition of a Parisian, understood here as a resident of the city – that’s all. My Google translator imparts some additional slack to the eligibility requirements: “All Parisians may vote without age or nationality (Parisians who live in Paris are deemed Parisians).“ We’ll have to call the translation a free one.

Locutions aside, the Paris Open Data site again brings 2016 referendum results to our attention, right here:


Just click on the Excel link; then take a look at what you’ve downloaded.

Surprise. If you think back to last year’s resultats spreadsheet – and if your recollection fails, observe this excerpt:


Seven useful fields in there, naming and counting the information any interested party would seek to know: the sums of the budgets earmarked for the project (in euros), project arrondissement (75000 points to a citywide proposal), internet and in-person vote numbers, and their joined totals (though I’m not sure what the decimals bring to the party), the fate of each vote (GAGNANT flags a winning project, NON RETENU a losing one), and project description. Now unwrap this year’s workbook. It can’t be manageably screen-shot; its 72 columns won’t miniaturize intelligibly, so you’ll have to unwrap it yourself and endeavour to survey its mighty expanse.

In a year’s time the Budget Participatif worksheet has mushroomed its field count by an order of magnitude – even as it sets forth what is in effect the same information, with perhaps one exception we hope to acknowledge later. Where in the 2015 rendition but one superordinate field properly subsumes all the voting information about each and every arrondissement (i.e., Localisation, in column D), the current sheet grants three fields to each – one for its internet and in-person votes, the third totalling the former two.

And if nothing else, new navigational privations burden the sheet. If you want to view the voting numbers for the 20th arrondissement, then, you’ll be in for a long scroll. And the arrondissements are only intermittently sorted in the Localisation field, too.

I’ve belaboured the point in the past, but I’ll belabour anew: the data set reformation instituted by the 2016 version discourages the kind of analysis to which one would be inclined to subject the data.

For but one example: if in the 2015 rendition I wanted to pivot table election results via a Slicer featuring arrondissement numbers, I’d try

Row: Projets

Projets Gagnants/Non retenus (I’ve worked with the Tabular Form layout, and eliminated subtotals)

Slicer: Localisation

(You’ll note the unfilled Values area – our exclusive concern here with text enumeration entitles us to the omission. And you’ll probably want to turn Grand Totals off.)

And I’d wind up with something like:


 But you won’t be able to replicate the above on the 2016 sheet – because again, each arrondissement has been gifted with a set of fields all its own, and you can’t filter or slice across fields; you slice the items populating a unitary field. And I’m not so sure how a standard filtering of these rows would work, either.

Indeed – given the wholesale reimagining of the data, ask yourself what pivot tables the current Budget sheet could facilitate. There’s also the matter of row 541 and its queue of what appears to be totals of columnar figures, but these don’t add their respective rows 2 through 540 precisely. Those imprecisions aside, I’d allow that the row need be deleted, or at least resettled, from the data set.

And because of the arrondissement-specific nature of much of the Budget voting – in which district residents decide on indigenous projects – a spate of zeros floods the sheet. This excerpt:


Selects a project vote sampling from the 9 through 11th arrondissements, and the corresponding vote for these in the 1st. What you’re seeing makes near-perfect sense; residents of the 1st aren’t supposed to vote for the projects above (though indeed, the fugitive single vote for the Plus d’arbres dans les rues du 10e begs for scrutiny); and that Paris-wide eligibility stricture has the effect of loosing more than 29,000 overwhelmingly extraneous, zero-bearing cells into the data set, or nearly three-quarters of all the cells.

But The 2016 sheet does widen at least one new vista on the vote numbers, though: it breaks out the vote for citywide projects (the ones denoted 75000) by arrondissement, an insight that the 2015 iteration doesn’t afford. Does that gain, then, offset the inconveniences wrought by the new data organization?

Bonne question; and if Paris Open Data is happy to foot my Eurostar bill I’ll be happy to ask it for you in person.

The Vote, 2016: A Fait Accompli in Progress

29 Nov

When is an election over – when the winner is declared, or after the votes have been counted? The alternatives are neither mutually exclusive nor mutually determinative; while the declared winner readies his regime change, the ballot count proceeds, eerily distanced from an outcome that has already been affirmed and conceded. Mr. Trump plunges ahead toward his office, while the shadows umbrate some other figure on the wall.

Such is election 2016. Both presidential candidates seem to have won – something – but only one gets to hold a party on January 20th, courtesy of the misshapen, superstructural interposition we call the Electoral College (note that Donald’s Trump’s popular vote percentage falls beneath that of both George W. Bush – himself a minority-vote winner, and Mitt Romney, who outpolled Trump but lost. Of course a large third and fourth vote this time has something to do with that disparity).

But the vote continues to be tallied, and a most useful, near-real time spreadsheet of the numbers as they stand right now is yours to download, this time the courtesy of David Wasserman of the Cook Political Report site. Indeed – on a couple of occasions I’ve clicked its on-site refresh button and watched the number change right then and there. To get the sheet, look here (note the sheet inhabits its space in Google Spreadsheet form, and sometimes its File command – the one you’d click in order to download the data in Excel mode – isn’t always there. It may take a couple of accesses before you see it. That’s been my experience).

Spreadsheets of this kind and organization again ask again question I’ve posed here more than once– namely, the one about the intentions you bring the data. If they’re purely self-educational – i.e. you simply want to learn what the sheet has to say about the current count – then it’s perfectly fine as is, and by definition there appears to be nothing more to do. Read, then, and be edified.

But if you want to act upon the data – that is, try your analytical hand at learning something more than what you’re seeing – you’ll need to decide if the worksheet calls for some manner of restructuring. If for example you decide you want to treat the numbers to a round of pivot tabling, than restructure you must: You must, for example, strip away blank rows 6, 10, and 25, along with the textual captions slid into 11 and 26. You’ll also need to vacate rows 7 through 9 as well; leaving them alone will unleash a triple count of the vote totals upon any pivot table, as the U.S. Total in 7 doubles the individual state vote count, and the Swing/Non-Swing State data in 8 and 9 duplicate the count yet again.

But even if one opts against the pivot table strategy there’s still work that could and perhaps should be done, and things to be learned from the sheet. First, I’d restore all the numbers in the data set to their default right alignment; centering New York’s 4,153,119 votes for Clinton (as of this writing) immediately atop North Dakota’s 93,758 imposes a false presentational symmetry for the values, and the formulas in columns E, F, and G, e.g. in E7:


could have submitted to the more parsimonious


OK, that one’s a small point, and here’s another: the parentheses bracketing the formulas in the I column could be lopped off. But the sheet’s color-scheming raises another, weightier issue, begging the question if the collection of tints before us embody a set of conditional formats, or rather and merely a pastiche of fixed-color design decisions. The answer is all of the above; conditional formats range across some of the data cells, while the latter motif dyes others.

To discover exactly which cells have been subjected to which treatment, we can make our way back again to the agile F5, Go to > Special > Conditional Formats option (we’ve done this before):


Ticking the option button above instructs Excel to go to, or select, all the cells in the sheet that in fact bear some conditional format, and following through here I get, in excerpt:


Note therefore that a good many cells – e.g. those populating the first seven columns – sport static, manually-colored hues that represent party associations – blue for the Democratic, red for Republican, yellow for the generic Others. The “margin” columns of H:J, however, were conditionally formatted, for example:


That is, the numbers in the above cells exceeding zero revert to Democratic blue, bespeaking a win for that party; the less-than-zero values turn those cells Republican red.

But note that the M column has been likewise conditionally formatted, even as none of its values seem to have undergone any change of appearance. Click any cell in M, click Conditional Formatting > Manage Rules, and you’ll understand why:


While the greater/less than zero conditions have been entered here as well, the worksheet designer neglected to assign any formatting consequences to the M cells. (Cell A2, disclosing nothing but identifying information about the sheet, has been conditionally formatted as well – but I’m assuming that treatment is a simple, inadvertent mistake.)

But consider the state-name enumerations in the A column. Their respective colors reflect a win for the appropriate party, and these could have and should have been given over to conditional formats (in this case formulas) – and they weren’t. Absent that device, the spreadsheet designer apparently needs to inspect the present vote totals for each state and manually apply the relevant color, again state-by-state. The recommended formulas look something like this (after first selecting the state names in A12:A64):


That is, if the vote in the C (Republican column) exceeds that in the associated B column, color the cell red. Let B exceed C, and the cell reverts to blue. Of course if we had reason to suspect or anticipate a state win for Others, a third condition would have to be introduced.

Now there’s another non-pivot-table-driven finding of interest that the data, in their present incarnation, grant us. Note that states whose vote count has been officially finalized are asterisked, and so we might want to tally that count in turn. In effect, then, we want to look for the appearance of an asterisk in any given state’s name, a task amenable to the same sort of COUNTIF with which we developed key word searches in our Trump tweet anthology. But here you need to be careful. If I enter


We’ll realize a count of 52, because the formula ascribes an equivalent functionality to all three asterisks – that is, a wild-card property, even though the center asterisk is the precisely character for which we’re searching. To overthrow that trammel, you need to enter:


The tilde signals the formula to regard the middle asterisk as the search item (see a discussion of the tilde here; I had to check it out myself). My current total of 17 (one of which is the District of Columbia) tells us that 34 states have yet to complete their presidential vote count, and this three weeks after Election Day.

It looks as if Mr. Wasserman still has a lot of work to do.

The President-Elect’s Tweets

21 Nov

Donald Trump’s Twitter account describes its holder as President-elect of the United States; so the reports, then, must be true. The deed has been done, the unthinkable has been thought, the reality checks have been written and distributed to the disbelieving. Or is it all another case of fake news?

Call me the naïf – but on the assumption that it really did happen, it next occurred to me that a reeling nation might be restored to equilibrium by taking yet another look at the latest tweets streaming from the curious mind of the chief-executive-in-waiting.

And so it was back to the web site of record, twdocs.com and its burgeoning trove of planetary tweets, for yet another audit of Mr. Trump’s now-presidential ruminations. My spreadsheet haul comprises the victor’s last 3019 tweets as of the afternoon of November 21, dating back to February 19 and moving me to build a first pivot table breaking out his tweet total by month (note that those 3019 exclude replies and retweets, possibly a procedural error on my part). I get:


We’ll note the quicksilver wax-wane of the October and November tweet totals (remembering that the latter sum counts about two-thirds of the month’s transmissions), both numbers perhaps a correlate of both pre-election frenzy and Mr. Trump’s current preoccupation with other things. Since (and including) the November 8 election day, 51 tweets (appear to) have issued from the @realDonaldTrump signature, these continuing to exhibit the curious, perhaps even trademark ebullience of its eponymous subscriber. There shall be no Marlowe-Shakespeare authorial controversies here; the prose is surely Trump’s – even as his book’s contents may have other claimants. (And by the way – if you can’t get enough of our fearless leader’s literary output, visit the compendious Trump Twitter Archive, a repository of just about every tweet ever fired off by the commander in chief.)

So what is there to be learned about November’s 127? I once again ran these tweets through a battery of key-word searches as per previous Trump posts and via the same COUNTIF routine (sorting the tweets in latest-to-oldest order rows 7 through 133 will offer up the relevant range to be counted, and will spare you from all array formula concerns). I then subjected the 597 October 1-and-beyond tweets to the same searches, with these joint results, sorted in order of the November tweet key-word appearances:


I did say something about ebullience; and with nearly two-thirds of the November tweets studded with decisive exclamations I think I’m on to something there. And with his flurry of thank yous the ever-courteous Mr. Trump is nothing if not grateful to his minority of supporters. You’ll also note the references to crooked Hillary holding steady, though to be fair the incidence of that sobriquet for all 3019 stands at 6.82%. The man is clearly mellowing, exclamation points notwithstanding. (A technical aside here: the search term @nytimes need be preceded by a text-format-bestowing apostrophe. Overlook that punctuation, and Excel will read the @ sign as a vestigial Lotus 1-2-3 formula code.) The slightly odd downturn in Pence-bearing tweets and the slightly-odder-still dip in references to Trump probably reflects the fact that the gentlemen have since gotten their jobs, and no longer need to tug your sleeve as insistently.

If you’re doing your own downloading (that’ll be $7.80, tax included) you’ll doubtless find the tweets make for some interesting, and entertaining, reading. The November 20 encomium for General James “Mad Dog” Mattis forces one to wonder why Trump wanted you to know his nickname; and his view, voiced likewise on the 20th, of the post-production preachment aimed by some of the cast of the show Hamilton at playgoer Mike Pence – “The cast and producers of Hamilton, which I hear is highly overrated, should immediately apologize to Mike Pence for their terrible behavior” – looks past Pence’s own recommendation that people see the musical. And Trump’s November 15 ascription of “genius” to the Electoral College won’t square with his 2012 tweet to the effect that the institution is a “disaster” (that allusion to the College, by the way, is the one and only among the 3019 tweets in my dataset.) And for what it’s worth, this pivot table:

Rows: Source

Values: Source

Drums up this distribution:


And it tells me that Mr. Trump’s phones are a lot smarter than mine, and apparently more numerous.

And if you are in fact downloading and analyzing, there’s one other spreadsheet-specific matter about which you’ll want to know: In past posts I made something of an issue about the uncertain time zones which the data in the Created At field record. I had speculated that the times keyed themselves to the zone in which the downloader resided, but a helpful note from Joel of twdocs set me straight. The tweets are in reality French-timed; that is, set to the time in that country (in which twdocs’ server is stationed) – or generally six hours later than Trump’s native New York. And indeed – my question to Joel about my hourly puzzlements spurred him to rename the Created At field to Created At (UTC+1hr).

And if – if – we assume that Trump’s 51 Nov 8-and-beyond tweets sprung from New York, a halfway plausible proposition, as the commander-in-chief likes to hunker down in his namesake towers – we could insert a column to the immediate right of Created At, call it NY Time, and enter, in what is now B7:


.25 is Excel’s way of expressing six hours – that is, one-quarter of a 24-day. Copy that little formulation down B and you’ve established New York-zoned tweet times.

And what that does among other things is trim the number of November 8-plus dates to 46, because the our six-hour recalibration has dragged five erstwhile November 8 times back into November 7.

If we then tread the path of least resistance and muscle in a blank row beneath 52, we can pivot table for the hours during which Trump’s most recent tweets were blurted to his 15.6 million followers, by corralling the A6:Z52 range:

Rows: NY Time (grouped for hours only)

Values: NY Time (Count)

I get:


The president-elect seems to like his tweets in the morning, or at least he does now. But again, an hour-driven scrutiny of all 3019 tweets can’t reckon as confidently with his whereabouts across the last nine months – he was campaigning, after all – and the corollary uncertainty about exactly when he dispatched his tweets.

But hasn’t Mr. Trump said he wants to be unpredictable?