The Supreme Court Database, Part 2: Doing Justice to the Numbers

17 Jul

With 70 years and 61 fields worth of skinny on the 78233 Supreme Court rulings spread across 1946 through 2015, there should be a lot to learn from all those data – once we decide what questions could profitably be asked of them.

What about seasonality? That is, do Court decisions, and for whatever reason, issue from the hands of opinion writers in equal measure across session months?

They don’t. Try this pivot table:

Rows: dateDecision (grouped by Months only)

Values: dateDecision (count)

dateDecision (again, by % of Column Total)

I get:

scourt1

Note that yearly court sessions commence on the first Monday in October, and typically proceed through late June or early July (for more, look here. The October inception point means in turn that the term years recorded in column K actually subtend only the later months of that year, and so extend well into the succeeding one. For example, the earliest decision disseminated in 1946 was time-stamped November 18 of that year.)

We see that more than half of all Court decisions were ratified in the April-June quarter, notably swelled by the aggregate June efflorescence. One could conjecture that a need to dispose of a case backlog pressed hard against impending summer recesses might explain the flurries in June.

And what about what the database calls its issue areas, i.e. the categorical domains of the cases pled before this court of last resort, and their yearly distributions thereof? The issue area field in column AO assigns the litigations to 14 broad classificatory swaths defined here; these can be copied and pasted to an empty range in the workbook and regarded as a lookup table (which we’ll name caseIssue).

Next, enter the field title caseIssue in BK1, the free column alongside our improvised uniquecount field (see last week’s post), and type in BK2:

=VLOOKUP(AO2,caseissue,2)

And copy down (you may also want to drop a copy > paste special atop the results, a modest austerity measure that stamps hard-coded equivalents of the outcomes that condensed my file by about 600K.

But now we are made to return to the question imposed upon us in the previous post: because our data set remembers the votes of individual Justices for each case, the case ids naturally appear recurrently and, for our purposes here, redundantly – up to nine times each.  In the last post we applied a formula to isolate single instances of each case in order to realize our count of cases heard by year. Here we need to conduct another count of sorts of unique entries, but one that at the same time preserves the categorical identities of the cases.

So I banged together this pivot table, for starters:

Rows: Term

Caseid

Caseissue

In addition, click Design > Report Layout > Repeat All Item Labels, a decision that reiterates the term (year) data for each record, thus overriding the standard pivot table default posting of each year but once. Turn off Grand Totals, too, and make sure to select the Tabular Form report layout.

Doing what pivot tables are trained to do, each caseid is singled out, per this screen shot excerpt:

scourt2

Then filter out the #VALUE item from caseissue; these represent records (485 of them) for which no issuearea data avails. Note that the Values area remains unpopulated, because we’re not concerned here with any aggregated data breakout – at least not here. Rather, we want to treat the table as a de facto data set, itself to be subject to a second pivot table.

And is that prospect thinkable? It is, and a scrounge around the internet turns up a few slightly variant means toward that end. One strategy would first have the user select the entire table and superimpose a Copy > Paste Values overwrite upon it, downgrading the table, as it were, into a mere standard set that lends itself to a perfectly standard pivot tabling.

But that rewrite isn’t necessary; in fact, you can pivot table the pivot table straight away, by first clicking outside the pivot table and, by way of one alternative, now clicking Insert > Pivot Table. At the familiar ensuing prompt:

scourt3

Select the entire pivot table (but I was unable to enter the default table’s name, e.g. PivotTable1, in the field, however). Note in addition that while conventional pivot tables default their location to New Worksheet, Existing Worksheet has been radio-buttoned here; but I’ll revert to type and opt for New Worksheet, because I want the table to begin to unfold as far to the left of the worksheet as possible – that is, column A. Click OK, and the fledgling second pivot table and Field List check in onscreen.

Then the standard operating procedure resumes:

Rows: Term (grouped in tranches of 5 years)

Columns: caseissue

Values: caseissue (Count, necessarily, by % of Row Totals; as we’re now working with percentages, turn off the Grand Totals that will invariably figure to 100%.)

I get:

scourt4

I hope the shot is legible. If not, the table is yours for the clicking.

I’m not sure the outcomes elaborate “trends”, but that hardly discommends them from analysis. You’ll note a rather decided drop in Federal Taxation and Union-related cases, for example, and a general upswing in Criminal Procedure hearings, even as other categories (Due Process and even Civil Rights) fail to slope straightforwardly; in any event the percentages need to be read against the absolute numbers contributory toward them, because they aren’t always large. With what is in effect a matrix comprising 14 year groupings by as many case categories, those 196 potential intersections map themselves to 8683 cases; do the math, and some of those quotients are likely to be small.

And there’s a postscript to all this, of course. Last week’s post resorted to a moderately fancy formula that managed to seek out but one instance of each Court decision, preparatory to our count of caseloads by year. Now I realize that we – I – could have done the pivot-table-of-pivot-table thing to bring about the same result, and with only one field. That is, for the first pivot table-cum-data set:

Rows: Term (in conjunction with Repeat All Item Labels)

Case ID (and Tabular Form report layout)

Then pivot table the above:

Rows: Term

Values: Count

Voila.

 

Hey, it’s my blog – I’m allowed to learn something, too.

The Supreme Court Database, Part 1: Doing Justice to the Numbers

4 Jul

It stands to reason as a matter of near-definitional necessity that open data in a democracy would crank open a window on its judicial system; and the United States Supreme Court seems to have obliged. The Supreme Court database, a project of Washington University that has adapted the foundational labors of the late Michigan State University Professor Harold J. Spaeth to the spreadsheet medium, affords researchers an educative profusion of information about Court doings – once you understand what you’re looking it.

That qualification broadcasts a caution to unlettered laymen such as your faithful correspondent, who’s still can’t understand why chewing food with one’s mouth open hasn’t been declared unconstitutional.

But enough about me. In fact, the database comprises a collection of datasets that breaks the Court information along a pair of variables, Case and Justice Centered Data, each of which supplies four variably-formatted sets tied to the above parameters. Here I’ve downloaded Cases Organized by Supreme Court Citation:

court1

The workbook details yearly vote activity of Court justices dating from 1946 through 2015 (and you’ll note the parallel compendium for votes cast between 1791 and 1945); but again, you’ll need to appreciate what the workbook’s 61 fields want to tell you. (You can also conduct detailed data searches on site’s elaborate analysis page.)

For guidance, then, turn to the site’s very useful Documentation page, wherein the meanings behind the headers/fields are defined via the links holding down the page’s right pane (one quirk: the pane arrays the header/variables by category, and not by the sequence in which they actually appear in the workbook).

But we needn’t think too deeply about the field’s yields in order to issue a telling first read on the Court’s caseloads. We can move to break out the number of cases the Court has considered by year, by counting the number of unique case ids posted to column A. But there’s a hitch, resembling the one that has stalled us in a number of previous posts: because the records identify each justice’s vote per case, the case ids naturally appear in quantity – in the great majority of cases nine times, registering the number of sitting justices at any given time. (The number isn’t invariably nine, however, because occasional vacancies depress the total.)

But whatever the justice count, we want to reckon the number of unique cases the Court heard during a given year. We’ve entertained a number of takes on the matter in the past, and here’s another one, this a reasonably elegant formulation adapted from a solution advanced here. Slide over to column BJ, the one adjoining the dataset’s last occupied field, name it UniqueCount, and enter in BJ2:

=IF(COUNTIF(A$2:A2,A2)=1,1,0)

Copy the above down B (note that, owing to processing issues, the copy down may take some time).

What is this expression doing? It’s conducting a COUNTIF for the appearance of each entry in the A column (positioned in the formula as the COUNTIF criterion), the frozen A$2 serving to progressively expand the criterion range as the copy-down proceeds. If any instance of the formula returns an aggregate count exceeding one, the IF statement instructs it post a zero. Thus we’re left in uniquecount with a collection of 1’s and 0’s, the 1’s if effect counting each case id once and ignoring any additional appearances of that id. (Note as well that here we need to conduct a row-by-row count of unique entries, because the 1’s need to be summed inside the pivot table. Other formulas will deliver the total

number of unique elements in a solitary formula; look here, for example for a pretty well-known array-formulaic solution.)

Once we’ve harnessed the UniqueCount values, i.e. we’ve winnowed the data for but one citation of each case, we can put these immediately to work in this pivot table:

Rows: Term

Values: uniquecount

I get (in excerpt, after having frozen the title row):

court2

Beginning with the early 90s, the Court appears to have, for whatever reason, imposed a dramatic scale-down in cases heard see this analysis of the phenomenon). The Court adjudged 197 cases in 1967; by 2014 the number had contracted to 70.

For another, most interesting gauge of the court’s collective disposition, we could turn to the Direction parameter in column BF. Its entries are tidily binary; a 1 attests a justice’s conservative orientation to the vote, a 2 a liberal tilt. (Of course those conclusions require something of a judgement call; for a detailed accounting of the Database’s coding determinations look here.)

Some basic math should help ground the analysis. If the Court rules conservatively through a 5-to-4 vote, its Direction aggregate will figure to 13 – 5 1’s and 4 2’s, or a Direction average of 1.44 – 13 divided by 9. An equivalent liberal majority comes to 14, or a 1.56 average. A theoretical ideological midpoint of 1.5, then, centers the averages.

With those understandings in tow we can develop what could be called a directional index of Court votes, e.g.

Rows: Term

Values: Direction (average, rounded to two decimals).

I get, again in excerpt:

court3

Of course, you’ll want to peruse all the averages, but the data seem to affirm the Court’s measurable movement to the right. From a relative liberal high of 1.70 in 1963, the averages begin to descend, first falling though the 1.5 divide in 1975, bottoming out in 1998 and 2002 at 1.40. The 2015 average of 1.49, however, plunks down right in the middle of the road – and a grouping of the terms in five-year tranches situates the votes for latest available interval right atop the center stripe (or 1.501996, if you insist).

court4

A logical follow-on, then, would have us average the voting propensities of individual justices, names nicely paired with their votes in the justiceName field in BC. For good measure we can easily tally the number of decisions each Justice opined, by simply counting the number of times his or her name appears in the field (of course a Justice votes only once per decision):

Rows: justiceName

Values: Direction (average, to two decimals)

justiceName (count)

After sorting the averages smallest to largest and pasting together two screen shot excerpts I get:

court5

Now those tabulations are instructive, albeit not wholly unpredictable. Justices Lewis Powell, Clarence Thomas, Warren Burger, and the late Antonin Scalia hold down the farthest-right positions, with William O. Douglas – a justice publicly renowned for his juridical liberality – pushing hardest towards the left, as it were. Chief Justice Earl Warren – famously lauded and/or scored for comparable takes on the law – isn’t far behind. And for the record it was William Brennan, among the post-1945 justices at least, whose 5325 votes marks him as the most participative; but at the same time remember that the 1946 divide artificially stunts the vote totals of justices whose appointments predated that year.

But time to recess for lunch. And I know – you’ll be watching closely to see how I chew my food.

 

 

 

 

 

 

 

 

 

Texas Child Abuse Data: The Grim Count

18 Jun

Credit the Austin (Texas) American Statesman with assuming the most disagreeable, but perhaps necessary, task of recording and expounding instances of child abuse perpetrated in the state – in this case, abuse taken to lethal extremes.

The paper’s Child abuse and neglect fatality database maintains data about 779 such instances, and makes them available in spreadsheet form as well:

cps-reports

(You’ll need to auto-fit the C column.)

Because the above-referenced web page features an array of graphical synopses of the abuse data:

state1

We could properly ask if a spreadsheet-driven consideration could abet the information already put into place above. Perhaps it could.

Note the uppermost graphic, a gender-coded block of icons of the victims. Looking past the pink-blue gender associations, any click on an icon triggers a caption naming the victim, his/her age, and cause of death. The icons are not rigorously ordered chronologically, and I am not sure why. Moreover, and perhaps curiously, the graphic does not supplement its content with a simple numeric count of victims by gender; and while a scan of the icons inclines the view toward a male predominance, precision can be served via this simple pivot table:

Rows: Gender

Values: Gender (count; the data are textual)

Gender (again, % of Column Total, and turn off Grand Total)

I get:

state2

I for one am surprised by the disparity; the statistical reality of why boys account for 60% of the reported victims poses a real and important question, though the finding would perhaps need to be cross-checked by data from other studies compiled for other regions. Could it be that boys are seen as somehow sturdier and more accustomed to roughhousing treatment, and hence putatively better able to weather abusive assaults?  I don’t know, but again, the gender question is worth asking.

Indeed, if one reconstructs the pivot table:

Columns: Gender

Rows: Cause of Death

Values: Gender (count)

One sees in excerpt:

state3

We learn here that the boy-girl differential persists steadily across abuse categories.

Calculating the average age of victims could be understood as a grim but significant task, one complicated by the structure of the worksheet. Columns E presents age data variously denoted tragically by month, year, and weeks categories in F; and as such we need to unify the parameters. Perhaps the simplest means of reconciliation is to pry open a column between F and G, and enter in what is now G2:

=IF(F2=”Months”,E2/12,IF(F2=”weeks”,E2/52,E2))

The nestedIF expression identifies the time categories stored in the F column, and calibrates these into fractions of years, or whole years. Once we copy the formula down G we can execute a Copy > Paste Values into E, delete the temporary G column, and at the same time delete the Age Type field, containing as it does the now-inaccurate variable age categories. We could then in the interests of clarity retitle the field in E Age in Years.

We need at the same time to appreciate the aggregate understatement of ages the data convey. Because ages in years data were clearly rounded off, we cannot know if a three-year-old victim was in actuality three years and eleven months, for example. An important measure of precision could have been introduced with a date of birth column, and one assumes those data are available.

In any event, we can proceed to group the victim age, perhaps in bins of one years:

Rows: Age in Years (grouped by a value of 1)

Values: Age in Years (count)

Age (again, by % of Column Total)

I get:

state4

We see that more than half the victims were two years’ old or less, qualified by the caution entered above. A first, if distressingly obvious, interpretation would associate vulnerability negatively with age: the younger the child, the more susceptible to deadly abuse.

Another point, apropos the Fault assignment field: the pertinent Statesman graphic – a pie chart captioned Paramour at fault- implicates a relationship partner (Paramour) in 137 of the 779 deaths recorded. That total, which can be swiftly realized by turning on a standard filter and simply entering Paramour in the vacant search field – indeed counts 137 such instances. But you’ll observe that many of these simultaneously arraign a parent; enter that term in the search field and you’ll turn up 581 records (remember that the filter will dredge any record that features the search term anywhere within a cell). Search again for Relative, for example, and 85 more cases populate the filter. In other words, a totalling of all perpetrator types will exceed the 779 records – even as the actual, lower total of is 779 too many.

New York City Park Clean-up Data: Sweeping Through the Numbers

30 May

Thinking twice about cleaning the bedroom – you know, that bespoke chamber of horrors heaving in entropic fury, the one housing the bed on which you sleep – somewhere? If you are, then how many times would you think about cleaning New York playgrounds and parks?

How about 880,000, the number of tidy-ups the city’s Sanitation Department performed on its recreational spaces for the 2015-16 fiscal year? Mighty big number – but even so, they don’t do windows.

They do just about everything else, though, and a canvass of those 880,000 (more or less – please read on) contributions to New York’s civic betterment that have been carted into the data set on the city’s open data site may inspire you to consider haulage as a next-career move. (You’ll understand I’m much too nice a guy to force WordPress to absorb 130 MB worth of download from a greedy subscriber. You’ll need to get this one yourself, I’m afraid; just click the Download button on the above link, opting for CSV for Excel or CSV for Excel (Europe), the latter presumably flipping day-month entries in order to comport with regional defaults.)

But even if you put your CV on hold, the still data have something to say, though their fields call for a spell of reflection before you proceed, apart from the necessary auto-fitting of the several date fields. Indeed – your reflection may need to be rather sustained.

If, for example, you want to break out the task activity by New York’s five boroughs – and you probably will – you may want to extract the borough codes from the gispropnum entries in column A: B=Brooklyn, M=Manhattan, Q=Queens, R=Richmond (the official name for Staten Island), and X=Bronx (you may want to consult the helpful field descriptions here). It seems to me that convenience could be served by deleting the propid data in B (these code the city properties on which work was performed, but may not contribute substantively to your analysis, though you may decide otherwise), renaming the field Borough, and entering in B2

=LEFT(A2,1)

Copy the expression down B and the borough information readies itself for grouping, after which you could presumably run a Copy > Paste Values on the B results and delete A, again in the interests of economizing. You’ll also of course observe the Null entries, which presumably report on activities not carried out on site, e.g., Staff Pickup and Lunch.

You may opt for additional field deletions as well in light of the workbook’s size, and the corollary understanding that each such deletion will relieve the book of about 880,000 data-bearing cells. You may have little use for the vehicle_number information, for example, and perhaps route_id, though I suppose both dislodgings may have to be thought through.

But there are deeper concerns. One might want to begin simply, by pivot-tabling task totals by month of the fiscal year:

Rows: date_worked (grouped by months and years. Remember that this fiscal year was booted up on July 1, 2015)
Values: date_worked (% of Column total. You may want to turn off subtotals, and certainly Grand Totals, as the latter will ineluctably add to 100%).

I get:

Park1

Note the apparent, pronounced pullback of activity in September – though I don’t know why – and the orderly step-down in work from March through September, though of course that trajectory plots a seasonality course that runs backwards against the actual yearly grain.

But that read isn’t necessarily that simple, or that accurate. That’s because a good many of the records seem to cite tasks that don’t quite qualify as an actual cleaning chore, e.g., lunch, De-Mobilization. So here’s a plan B: install the activity field in both the Columns and Values area, and in the latter, filter all activities save the one termed Work, which seems to serve as a catch-all item for…actual work on the parks. Continue to impose % of Column total on the values, and I get:

park2

Once we’ve so trim the activity data – a retrenchment that abridges about 174,000 records from the whole – the monthly distribution redraws its curve. The conspicuous disparities bulge around July and August, each month here grabbing nearly two additional percentage points from the other months, even as the two September figures hold fast. Thus the obvious task-distributional suggestion – that non-work activity somehow features more numerously in July and August – remains obvious; but accountings for the numbers are not. I’m open to suggestions.

In any case, once we’ve put the work item in its place we could key the tasks to their boroughs, by exchanging the Borough field we fashioned earlier for date_worked. I get:

park3

(Remember that N signifies the NULL entry.) It is perhaps notable that the assignments seem to have been similarly distributed across four of the city’s boroughs, with the far-less-populous Richmond well back. But these likenesses call for a deeper think, however, because they need to reconcile population (and of course Manhattan’s effective daytime census enormously exceeds its residential cohort) to aggregate park/playground counts and areas.
And what about the time expended on the tasks, again proceeding with the work-item filter as above? Here a New York open data field legend enters an important qualification about the nhours field: “A count of the adjusted number of hours a given task was performed [my emphasis].” You’ll see that for some instances – say row 43, a task whose start-to-end times evaluate to 35 minutes but is yet recorded by nhours as .292 hours, or 17.52 minutes – the lower estimate may reflect the simultaneous efforts of two workers, even as compensation of course would require the higher figure.
Going with nhours, then, we could replace activity in Values with nhours and format the resulting sums accordingly:

park4

You’ll note that even as the task numbers stood as near-equivalents for Queens and the Bronx (the latter again represented by the letter X), work in the former borough seemed considerably more time-intensive.

Again, I don’t know why, but I’m just a spreadsheet guy, after all; if you want that Pulitzer you’ll have to take it from here.

The French Presidential Vote, Part 2: The Numbers from Paris

12 May

Ok – you think you know who won the election, but in the interests of confirmation and deep background, you want to know how the French presidential runoff vote broke in Paris, don’t you?

Well, if you’ve made it this far, one supposes you do. The May 7 mano a mano – kind of – between Emmanuel Macron and Marine Le Pen culminated predictably in a flat-out triumph of the former, in both country and city, but even the obvious is properly reducible to its details.

You don’t need my pivot tables to learn that Macron won 89.68% of the determinate Paris vote (i.e., some other ballot casters voted but threw their support to no one via an empty form, and other ballots were voided), sustaining his hegemony in the city over his right-inclining opponent.  But for the deeper background (and again thanks to the folks at Open Data Paris) the data are here:

Election Results – Paris May 7

You can also download a far larger compendium of Paris election votes dating back to 2007 here.

Our workbook is identically organized, really an extension, of the document at which we looked in the previous post; but since of course the candidate count in our present iteration comes to two, the ensuing sheets look correspondingly sparer. Thus we can rerun this pivot table, after I named the prolix header in N “Candidate Vote”:

Rows: numero d’arrondissement 01 a 20

Columns: nom du candidat ou liste

Values: Candidate Vote (by % of Row Total; turn Grand Totals off).

I get, this time:

run1

The table reports the overwhelming but irregular Paris mandate for Macron; considered in reverse, we see a comparative cresting of support for Le Pen in the well-heeled 16th arrondissement and the adjoining but Left-Banked 15th, and her bottoming out in the relatively hip, gentrifying 3rd.

How do these outcomes square with the election’s first phase, run through on April 23rd? Of course that competition distilled eleven candidacies into May 7th’s two, and so a like-for-like retrospection need be conducted with some care. Still, we can play a useful trick on the April 23rd data, by compiling an identically-conceived pivot to the one above, filtering the numbers for Macron and Le Pen alone. Thus slimmed, the table proceeds to regard the Macron-Le Pen vote as the operative whole, and computes its percentages accordingly.

That is, we can return to the previous (April 23) worksheet data, reconstruct the above table, and isolate Macron and Le Pen. I get:

run2

Remember that Macron won 34.83% of the Paris April 23 vote overall, with Le Pen’s haul figuring to 4.99%; here their head-to-head relationship restores a couple of percentage points to Le Pen relative to the second vote; presumably her loss of ground in the finale speaks to the greater support thrown to Macron by the sectarians of the nine also-rans in the first go-round, and the decrement is fairly consistent across arrondissements. One glaring, provocative exception: Le Pen’s fall-back from 16.33% to 9.85% and a similar differential in the “ethnic” 19th and 20th respectively, these heavily Jewish and Muslim quarters. In light of those demographics, it’s the April 23rd figures that give cause for surprise.

As for turnout, the analysis starts with a decision, or set of decisions, about definition. Does turnout emerge by simply dividing registered voters by all candidate votes, or do you toss in blank submissions (Nombre de blancs (uniquement) du bureau de vote), as well as ballots that, for whatever reason, were voided (Nombre de nuls (uniquement) du bureau de vote) along with proxy (presumably absentee, or Nombre de procurations du bureau de vote) ballots?

We broached this question in part in the previous post, but didn’t think the matter completely through. There we applied a formula that took into account the multiple counting of most of the parameters above, because each entry among the data recorded an individual candidate vote for a particular polling place, but at the same time repeatedly listed the number of blank votes, etc. – one listing for each of the eleven candidates on April 23.

In fact a simpler work-through appears to be available. On the May 7 sheet we can simply scoot to the bottom of the data set and enter a SUM function in cell N1794, thereby totalling all candidate votes. We can then enter SUMs in row 1794 for the other pertinent parameters and divide these results by 2, the number of times each polling place appears in the data set – that is, once for each remaining candidate (and yes, we could do the same for the April 23 data, dividing instead by eleven). But remember that the candidate-vote numbers in N are not subject to the divide-by-two restorative, because each record in N details a unique vote figure.

If then, we want quantify turnout via the number of candidate votes/number voters registered (nombre d’inscrits du bureau de vote, and yes; these are text entries that need to be converted back into values), we simply enter (somewhere) =N1794/H1794. That figure: 72.76%. The corresponding proportion for April 23: 82.71%, clearly, and notably, higher. The percentage of blank votes turned in this past Sunday: 4.44%, five times the .82% accumulated on the 23rd. A statement was thereby made.

Now if we want to break out turnout by arrondissement – say here by dividing actual candidate votes by registrants by district, I named the arrondissement-bearing range in C2:C1793 arr, jumped into the free cell in S2 and entered 1, notched a 2 in S3, and filled down to S21, yielding numbers 1 through 20. In T2, then, I tried this formula:

=SUMIF(arr,S2,N$2:N$1793)/(SUMIF(arr,S2,H$2:H$1793)/2)

The expression sums all candidate votes (in N) by the 1st arrondissement, per the entry in S2. It divides that total by all polling-place registrants in the 1st, again dividing itself by 2 to offset the double-counting of each place (the votes in N don’t’ require that adjustment, because each record report a unique vote figure). After copying the formula down, I realized these turnout percentages:

run3

In line with the April 23 data, the 19th and 20th exhibit lowest turnouts, even as we’ve operationalized them here.

For an index of sorts of disaffection, we can then calculate the blank-vote percentages, e.g.

=SUMIF(arr,S2,O$2:O$1793)/2/(SUMIF(arr,S2,H$2:H$1793)/2)

Note that here both numerator and denominator divide themselves by 2, because both elements need to fend off the double-count spectre (and note the placement of parentheses). I get:

run4

Correlation between candidate vote turnout and blank vote proportions: -.54, pretty substantial and not shocking. After all, a preference for he, or she, bespeaks by definition a non-preference for no one.

So just to confirm, then: he won.

The French Presidential Vote, Part 1: The Numbers from Paris

27 Apr

The French electorate has spoken, even as it has neglected to finish the conversation. Round one of the country’s double-staged presidential vote has left two candidates – Emmanuel Macron and Marine Le Pen – standing, and stage two’s denouement, calendered for May 7, will halve the count.

While several media have plotted enlightening cartographic breakouts of the votes, e.g. the New York Times, Bloomberg, and the Guardian – my query about spreadsheet-available data for the election was answered in part by the www.opendataparis.fr site, whose team kindly pointed me to a workbook detailing the vote in Paris, at least. I’m pointing it to you here:

 Presidential Results 2017 Paris

And detailed it is, perhaps to a fault. After you’ve clicked past the necessary column auto-fits you’ll probably determine that the relentlessly identical election date reference in column B can be ignored or deleted, for example, along with no-less-invariant “Présidentielle 2017 – 1er tour” legend in A (note: the data here were extracted from a far larger vote compendium archiving many elections, in which the data in A and B are importantly differentiating). It seems in addition that the Numéro de la circonscription field in E is blank, save its header, though again this field is populated with other election data in the larger workbook parenthesized above. Moreover, I can’t see the clarificatory virtues of the fields in F and G; they seem to merely re-present Paris’ arrondissement (district) numbers in elaborations that restate the plainer entries in the numero d’arrondissement 01 a 20 field in C. If I’m missing something, I’m in the book.

Of course even if you do avail yourself of the arrondissement data in F don’t fret about all those Number Stored as Text messages signaled by the pesky green rectangles; you’re not going to add or subtract arrondissments. But you may have a right, however, to be bothered by the text-number conflation in some other fields, namely O and P, which appear to record blank and otherwise voided ballots, if my French can be relied on. For those, you can select the columns (even simultaneously) and tick the Convert to Number option draped immediately beneath Number Stored as Text.

Once we’re provisionally happy with the data as we’ve reconstructed them, we can ask the preeminent question: how did each candidate do in Paris? This pivot table should conduce toward an answer:

Rows: nom du candidat ou liste

Values: nombre de voix du candidat ou liste….

Apply it again, % of Column Total

You may want to rename those stupendous Values titles, Anglicizing them perhaps to Number of Votes and Percent of All Votes.

I get:

Paris1

What shouts its headline at us, of course, is Paris’ extraordinary inattention to the candidacy of Marine Le Pen. The national runner-up shambled into a barely-perceptible fifth place in the city, outvoted seven-fold by Emmanuel Macron, whose 321,000 margin over Le Pen in Paris contributed one-third of his country-wide superiority (for the national numbers see this recapitulation).

But the data also support a closer look at the numbers, a look worth taking. We can, after all, size up the vote by Paris’ 20 arrondissements. :

Row Labels: nom du candidat ou liste

Columns: numero d’arrondissement 01 a 20

Values: Percent of all votes (by % of Column Total; remove Grand Totals, as they all add to 100%)

I get:

Paris2

The fact of inter-district variation is instructive and unsurprising, but in need of interpretation just the same. Note, for example, the domination of the posh 16th arrondissement by center-right aspirant François Fillon, and his fallback to 12.78% in the far more proletarian 20th, the latter (and neighboring 19th) captured in turn by the “hard-left’s” Jean-Luc Mélenchon, who at same time came away with but 5.40% in the 16th. Le Pen maxes out at 6.48% in the Left Bank’s 13th, but her weakest showing plays our in the not-far-away 6th, also on the Rive Gauche.

The vote considerations could also be scaled to the very different arrondissement sizes and turnouts, e.g.

Rows: numero d’arrondissement 01 a 20

Values: Percent of All Votes (here a simple Sum; yes, the heading could stand a rewrite. You may or may not want to keep the Grand Totals off-screen here).

I get:

Paris3

Note the rough association between arrondissement number and vote totals, remembering that arrondissements spiral outwards and clockwise, with the 1st and 2nd mapped in Paris’ center.

And in view of all the  intra-arrondissement, politico-demographic variation, a comparison of arrondissement turnout rates might be correspondingly informative; but that remit comes with a challenge – this one of spreadsheet, not political provenance. That challenge owes to the way in which the vote records are organized, because in addition to its candidate-respective total every record names the pertinent arrondissement, a polling-place id – numero de bureau de vote 000 a 999 – as well as the number of registrants enrolled at that place – nombre d’inscrits du bureau de vote (and yes, you’ll have to convert those data into numbers, too).

The problem, then, is to stem the repeated counting of the registrant total alongside the individual candidate votes per record. If, for example, ten candidates have won votes at a given ballot place, the registrant sum for that place will appear ten times in the data set, even as we want it to appear but once. Comprendez?

Here’s what I think qualifies as workaround. First, I ranged-named the arrondissement numbers in C arr, and the ballot-place ids in D bur. I next jumped into next-free-column R, named it Registrants, and entered, in R2:

=H2/COUNTIFS(arr,C2,bur,D2)

What’s this formula doing? It’s grabbing the registrant total for each record in H (once you copy the formula down that column) and in effect dividing it by the number of ballot places (bureaus) in the arrondissement, a count realized by the COUNTIFS. Thus, for example, if a given record reports 1300 registrants in a ballot place in which ten candidates competed, the formula divides 1300 by 10 in each record. Add these ten, and we’re back to the 1300 registrants – but appearing only once.

If we then proceed to this pivot-table:

Rows: numero d’arrondissement 01 a 20

Values: Registrants

Sum of Number of Votes

paris4

The registrants totals should properly sum all the unique voters enrolled in each arrondissement, as per the explanation above.

I next devised a calculated field, which I called arrtotal:

Paris5

Thrown into the pivot table, and percentage-formatted accordingly, I got:

Paris6

 The turnout proportions are impressively similar but not perfectly equivalent, ranging from the high of 86.63% in the tony 7th to the low of 77.49% in the 19th. There are surely correlations in there to be studied, but either way, the percentages far transcend the rates for US presidential contests.

Any thoughts about that, Monsieur Trump?

The World’s Airports: Checking In With the Data

9 Apr

Looking for a sure-fire ice-breaker to rustle that awkward silence? Try asking this one: how many airports roll their runways out for flyers worldwide?

I mean, if that question doesn’t electrify your partner into a conversational frenzy then…well I do go on.

Anyway, the answer as of this writing, is 50,863. That’s a pretty big number, gushing over and above a Quora’s respondent’s count of 43,983, tallied in 2016. My source, David Megginson’s Ottowa-based ourairports site, reports the larger figure, and had actually added 20 airports to the count by the time I accessed the data anew (the airports.csv workbook), and after my having performed the analyses below.  By the time you get to the site, who knows?.

The data for the most part seem ready to go, and as such are answerable to a number of standard but useful questions, starting with a pivot-tabled distribution of the types of airports:

Rows: Type

Values: Type

Type (again, % of Column Totals)

I get:

air1

We see that nearly 65% of all airports qualify as small, the sorts at which you’ve likely not disembarked (I’ve discovered a definition of “small” minted by the United States’ Federal Aviation Administration, denoting a facility receiving .05% and .25% of the nation’s boardings, but of course that understanding is US-specific). You also probably haven’t touched down at one of the 20 world’s balloonports, 17 of which get airplay in the US. And I for one am slightly surprised by the substantial representation of heliports in the census. But copters, remember, started hovering long before your mouse did.

You’ve also doubtless duly recorded the count of closed airports, begging the analytical/presentational question whether that not-insubstantial cohort should be tolerated by the dataset, or otherwise grounded. By running a standard data filter on the set, filtering both for “closed” and “CA” in ISO_country, it can be seen that 629, or more than a quarter, of shuttered airports are based in Canada. Whether that figure perpetrates a recording bias by the workbook’s Canadian authorship I don’t know, though I suspect something else is at work here. I suppose that’s where you come in. (ISO country codes are issued by the International Organization for Standardization, and are set forth here.) My inclination is to remove these dormant locales, but only impermanently, via the storied fall-back: run a Find and Replace at “closed” in the C column, replacing every such instance with ZZ, sorting C by A to Z, and threading a blank row between 48541 and 48542, thus ruling the erstwhile closed entries out of the data, but keeping then on call for retrieval if you need them.

We could then break out airport totals by country, for starters:

Rows: ISO_Country

Values: ISO_Country (sorted largest to smallest)

ISO_Country (again, % of Column Totals)

I get:

air2

The United States’ remarkable numerical dominance of the airport count (and that’s Brazil at number 2, a perhaps no less remarkable standing as well) requires another look, or set of looks. If we introduce a Slicer to the table and filter there for medium and large airports the picture changes:

air3

The American proportion slims greatly, reflecting the weeding of small airports from the set. In fact according to Wikipedia only 16 of the world’s 50 busiest airports are US-sited, and if we slice for small_airport instead we get:

air4

America’s rule is restored, trailed again by Brazil.

Now if you wanted to break out American airports only, and by state, you’ll have to direct your attention – and formula-writing acumen – the iso_region field in J. We see that two-character state codes sidle the right side of the region-keying hyphen, e.g. US-NJ. To extract state codes, then, we could try this expression in S2, the first pertinent cell in the nearest free column (you’ll also want to name this budding field, say US State):

=IF(I2=”US”,RIGHT(J2,2),”ZZ”)

The formula inspects the country-bearing column J for US; if it finds it, it tugs two character from the right of the kindred cell in J. Once put in place try, for starters:

Rows: US State

Columns: US State (Count, of necessity)

I get, in excerpt:

air5

Two loose ends among the data need to be tied. First, the -A item that’s sorted itself to the head of the labels clearly promulgates an error, or given its count, a pair of errors, likely of the data-entry kind. By double-clicking the count cell of 2 an insurgent sheet adduces the two wayward records, including the airports’ latitudes and longitudes. By copying and pasting those coordinates into this site, I was told that the airports in fact operate in Indiana and Georgia.

That realization in turns calls for a find-and-replace of the errant data, but it seems that feature just won’t work with formula results, presumably because in our case you’d wind up replacing a formula with a hard-coded textual substitute and Excel wants to protect you from the inconsistency. You thus could run a copy > paste values atop S and then replace -A with IN for the record listing a 39-degree latitude, and exchanging GA for the -A lined up with the 33-degree measure (you could also find the airports’ respective Idents in column B and replace the state names in the records themselves).

The second restorative is a simple one, a filtering out of the ZZ, non-American airports. You can then sort the airport totals, largest to smallest:

air6

Ten percent of all US airports, then, serve flyers in Texas, with runner-up California counting half as many venues. Appoint type to the Columns area and you’ll see, in part:

air7

Note the proportion of heliports to all airports in California, though in fact a scan down the numbers puts the ratios, if not the absolute numbers, in Massachusetts, Connecticut, and New Hampshire even higher, for example.

Guess those flyers are in it for the short haul.