Archive | April, 2017

The French Presidential Vote, Part 1: The Numbers from Paris

27 Apr

The French electorate has spoken, even as it has neglected to finish the conversation. Round one of the country’s double-staged presidential vote has left two candidates – Emmanuel Macron and Marine Le Pen – standing, and stage two’s denouement, calendered for May 7, will halve the count.

While several media have plotted enlightening cartographic breakouts of the votes, e.g. the New York Times, Bloomberg, and the Guardian – my query about spreadsheet-available data for the election was answered in part by the www.opendataparis.fr site, whose team kindly pointed me to a workbook detailing the vote in Paris, at least. I’m pointing it to you here:

 Presidential Results 2017 Paris

And detailed it is, perhaps to a fault. After you’ve clicked past the necessary column auto-fits you’ll probably determine that the relentlessly identical election date reference in column B can be ignored or deleted, for example, along with no-less-invariant “Présidentielle 2017 – 1er tour” legend in A (note: the data here were extracted from a far larger vote compendium archiving many elections, in which the data in A and B are importantly differentiating). It seems in addition that the Numéro de la circonscription field in E is blank, save its header, though again this field is populated with other election data in the larger workbook parenthesized above. Moreover, I can’t see the clarificatory virtues of the fields in F and G; they seem to merely re-present Paris’ arrondissement (district) numbers in elaborations that restate the plainer entries in the numero d’arrondissement 01 a 20 field in C. If I’m missing something, I’m in the book.

Of course even if you do avail yourself of the arrondissement data in F don’t fret about all those Number Stored as Text messages signaled by the pesky green rectangles; you’re not going to add or subtract arrondissments. But you may have a right, however, to be bothered by the text-number conflation in some other fields, namely O and P, which appear to record blank and otherwise voided ballots, if my French can be relied on. For those, you can select the columns (even simultaneously) and tick the Convert to Number option draped immediately beneath Number Stored as Text.

Once we’re provisionally happy with the data as we’ve reconstructed them, we can ask the preeminent question: how did each candidate do in Paris? This pivot table should conduce toward an answer:

Rows: nom du candidat ou liste

Values: nombre de voix du candidat ou liste….

Apply it again, % of Column Total

You may want to rename those stupendous Values titles, Anglicizing them perhaps to Number of Votes and Percent of All Votes.

I get:

Paris1

What shouts its headline at us, of course, is Paris’ extraordinary inattention to the candidacy of Marine Le Pen. The national runner-up shambled into a barely-perceptible fifth place in the city, outvoted seven-fold by Emmanuel Macron, whose 321,000 margin over Le Pen in Paris contributed one-third of his country-wide superiority (for the national numbers see this recapitulation).

But the data also support a closer look at the numbers, a look worth taking. We can, after all, size up the vote by Paris’ 20 arrondissements. :

Row Labels: nom du candidat ou liste

Columns: numero d’arrondissement 01 a 20

Values: Percent of all votes (by % of Column Total; remove Grand Totals, as they all add to 100%)

I get:

Paris2

The fact of inter-district variation is instructive and unsurprising, but in need of interpretation just the same. Note, for example, the domination of the posh 16th arrondissement by center-right aspirant François Fillon, and his fallback to 12.78% in the far more proletarian 20th, the latter (and neighboring 19th) captured in turn by the “hard-left’s” Jean-Luc Mélenchon, who at same time came away with but 5.40% in the 16th. Le Pen maxes out at 6.48% in the Left Bank’s 13th, but her weakest showing plays our in the not-far-away 6th, also on the Rive Gauche.

The vote considerations could also be scaled to the very different arrondissement sizes and turnouts, e.g.

Rows: numero d’arrondissement 01 a 20

Values: Percent of All Votes (here a simple Sum; yes, the heading could stand a rewrite. You may or may not want to keep the Grand Totals off-screen here).

I get:

Paris3

Note the rough association between arrondissement number and vote totals, remembering that arrondissements spiral outwards and clockwise, with the 1st and 2nd mapped in Paris’ center.

And in view of all the  intra-arrondissement, politico-demographic variation, a comparison of arrondissement turnout rates might be correspondingly informative; but that remit comes with a challenge – this one of spreadsheet, not political provenance. That challenge owes to the way in which the vote records are organized, because in addition to its candidate-respective total every record names the pertinent arrondissement, a polling-place id – numero de bureau de vote 000 a 999 – as well as the number of registrants enrolled at that place – nombre d’inscrits du bureau de vote (and yes, you’ll have to convert those data into numbers, too).

The problem, then, is to stem the repeated counting of the registrant total alongside the individual candidate votes per record. If, for example, ten candidates have won votes at a given ballot place, the registrant sum for that place will appear ten times in the data set, even as we want it to appear but once. Comprendez?

Here’s what I think qualifies as workaround. First, I ranged-named the arrondissement numbers in C arr, and the ballot-place ids in D bur. I next jumped into next-free-column R, named it Registrants, and entered, in R2:

=H2/COUNTIFS(arr,C2,bur,D2)

What’s this formula doing? It’s grabbing the registrant total for each record in H (once you copy the formula down that column) and in effect dividing it by the number of ballot places (bureaus) in the arrondissement, a count realized by the COUNTIFS. Thus, for example, if a given record reports 1300 registrants in a ballot place in which ten candidates competed, the formula divides 1300 by 10 in each record. Add these ten, and we’re back to the 1300 registrants – but appearing only once.

If we then proceed to this pivot-table:

Rows: numero d’arrondissement 01 a 20

Values: Registrants

Sum of Number of Votes

paris4

The registrants totals should properly sum all the unique voters enrolled in each arrondissement, as per the explanation above.

I next devised a calculated field, which I called arrtotal:

Paris5

Thrown into the pivot table, and percentage-formatted accordingly, I got:

Paris6

 The turnout proportions are impressively similar but not perfectly equivalent, ranging from the high of 86.63% in the tony 7th to the low of 77.49% in the 19th. There are surely correlations in there to be studied, but either way, the percentages far transcend the rates for US presidential contests.

Any thoughts about that, Monsieur Trump?

The World’s Airports: Checking In With the Data

9 Apr

Looking for a sure-fire ice-breaker to rustle that awkward silence? Try asking this one: how many airports roll their runways out for flyers worldwide?

I mean, if that question doesn’t electrify your partner into a conversational frenzy then…well I do go on.

Anyway, the answer as of this writing, is 50,863. That’s a pretty big number, gushing over and above a Quora’s respondent’s count of 43,983, tallied in 2016. My source, David Megginson’s Ottowa-based ourairports site, reports the larger figure, and had actually added 20 airports to the count by the time I accessed the data anew (the airports.csv workbook), and after my having performed the analyses below.  By the time you get to the site, who knows?.

The data for the most part seem ready to go, and as such are answerable to a number of standard but useful questions, starting with a pivot-tabled distribution of the types of airports:

Rows: Type

Values: Type

Type (again, % of Column Totals)

I get:

air1

We see that nearly 65% of all airports qualify as small, the sorts at which you’ve likely not disembarked (I’ve discovered a definition of “small” minted by the United States’ Federal Aviation Administration, denoting a facility receiving .05% and .25% of the nation’s boardings, but of course that understanding is US-specific). You also probably haven’t touched down at one of the 20 world’s balloonports, 17 of which get airplay in the US. And I for one am slightly surprised by the substantial representation of heliports in the census. But copters, remember, started hovering long before your mouse did.

You’ve also doubtless duly recorded the count of closed airports, begging the analytical/presentational question whether that not-insubstantial cohort should be tolerated by the dataset, or otherwise grounded. By running a standard data filter on the set, filtering both for “closed” and “CA” in ISO_country, it can be seen that 629, or more than a quarter, of shuttered airports are based in Canada. Whether that figure perpetrates a recording bias by the workbook’s Canadian authorship I don’t know, though I suspect something else is at work here. I suppose that’s where you come in. (ISO country codes are issued by the International Organization for Standardization, and are set forth here.) My inclination is to remove these dormant locales, but only impermanently, via the storied fall-back: run a Find and Replace at “closed” in the C column, replacing every such instance with ZZ, sorting C by A to Z, and threading a blank row between 48541 and 48542, thus ruling the erstwhile closed entries out of the data, but keeping then on call for retrieval if you need them.

We could then break out airport totals by country, for starters:

Rows: ISO_Country

Values: ISO_Country (sorted largest to smallest)

ISO_Country (again, % of Column Totals)

I get:

air2

The United States’ remarkable numerical dominance of the airport count (and that’s Brazil at number 2, a perhaps no less remarkable standing as well) requires another look, or set of looks. If we introduce a Slicer to the table and filter there for medium and large airports the picture changes:

air3

The American proportion slims greatly, reflecting the weeding of small airports from the set. In fact according to Wikipedia only 16 of the world’s 50 busiest airports are US-sited, and if we slice for small_airport instead we get:

air4

America’s rule is restored, trailed again by Brazil.

Now if you wanted to break out American airports only, and by state, you’ll have to direct your attention – and formula-writing acumen – the iso_region field in J. We see that two-character state codes sidle the right side of the region-keying hyphen, e.g. US-NJ. To extract state codes, then, we could try this expression in S2, the first pertinent cell in the nearest free column (you’ll also want to name this budding field, say US State):

=IF(I2=”US”,RIGHT(J2,2),”ZZ”)

The formula inspects the country-bearing column J for US; if it finds it, it tugs two character from the right of the kindred cell in J. Once put in place try, for starters:

Rows: US State

Columns: US State (Count, of necessity)

I get, in excerpt:

air5

Two loose ends among the data need to be tied. First, the -A item that’s sorted itself to the head of the labels clearly promulgates an error, or given its count, a pair of errors, likely of the data-entry kind. By double-clicking the count cell of 2 an insurgent sheet adduces the two wayward records, including the airports’ latitudes and longitudes. By copying and pasting those coordinates into this site, I was told that the airports in fact operate in Indiana and Georgia.

That realization in turns calls for a find-and-replace of the errant data, but it seems that feature just won’t work with formula results, presumably because in our case you’d wind up replacing a formula with a hard-coded textual substitute and Excel wants to protect you from the inconsistency. You thus could run a copy > paste values atop S and then replace -A with IN for the record listing a 39-degree latitude, and exchanging GA for the -A lined up with the 33-degree measure (you could also find the airports’ respective Idents in column B and replace the state names in the records themselves).

The second restorative is a simple one, a filtering out of the ZZ, non-American airports. You can then sort the airport totals, largest to smallest:

air6

Ten percent of all US airports, then, serve flyers in Texas, with runner-up California counting half as many venues. Appoint type to the Columns area and you’ll see, in part:

air7

Note the proportion of heliports to all airports in California, though in fact a scan down the numbers puts the ratios, if not the absolute numbers, in Massachusetts, Connecticut, and New Hampshire even higher, for example.

Guess those flyers are in it for the short haul.