Texas Child Abuse Data: The Grim Count

18 Jun

Credit the Austin (Texas) American Statesman with assuming the most disagreeable, but perhaps necessary, task of recording and expounding instances of child abuse perpetrated in the state – in this case, abuse taken to lethal extremes.

The paper’s Child abuse and neglect fatality database maintains data about 779 such instances, and makes them available in spreadsheet form as well:

cps-reports

(You’ll need to auto-fit the C column.)

Because the above-referenced web page features an array of graphical synopses of the abuse data:

state1

We could properly ask if a spreadsheet-driven consideration could abet the information already put into place above. Perhaps it could.

Note the uppermost graphic, a gender-coded block of icons of the victims. Looking past the pink-blue gender associations, any click on an icon triggers a caption naming the victim, his/her age, and cause of death. The icons are not rigorously ordered chronologically, and I am not sure why. Moreover, and perhaps curiously, the graphic does not supplement its content with a simple numeric count of victims by gender; and while a scan of the icons inclines the view toward a male predominance, precision can be served via this simple pivot table:

Rows: Gender

Values: Gender (count; the data are textual)

Gender (again, % of Column Total, and turn off Grand Total)

I get:

state2

I for one am surprised by the disparity; the statistical reality of why boys account for 60% of the reported victims poses a real and important question, though the finding would perhaps need to be cross-checked by data from other studies compiled for other regions. Could it be that boys are seen as somehow sturdier and more accustomed to roughhousing treatment, and hence putatively better able to weather abusive assaults?  I don’t know, but again, the gender question is worth asking.

Indeed, if one reconstructs the pivot table:

Columns: Gender

Rows: Cause of Death

Values: Gender (count)

One sees in excerpt:

state3

We learn here that the boy-girl differential persists steadily across abuse categories.

Calculating the average age of victims could be understood as a grim but significant task, one complicated by the structure of the worksheet. Columns E presents age data variously denoted tragically by month, year, and weeks categories in F; and as such we need to unify the parameters. Perhaps the simplest means of reconciliation is to pry open a column between F and G, and enter in what is now G2:

=IF(F2=”Months”,E2/12,IF(F2=”weeks”,E2/52,E2))

The nestedIF expression identifies the time categories stored in the F column, and calibrates these into fractions of years, or whole years. Once we copy the formula down G we can execute a Copy > Paste Values into E, delete the temporary G column, and at the same time delete the Age Type field, containing as it does the now-inaccurate variable age categories. We could then in the interests of clarity retitle the field in E Age in Years.

We need at the same time to appreciate the aggregate understatement of ages the data convey. Because ages in years data were clearly rounded off, we cannot know if a three-year-old victim was in actuality three years and eleven months, for example. An important measure of precision could have been introduced with a date of birth column, and one assumes those data are available.

In any event, we can proceed to group the victim age, perhaps in bins of one years:

Rows: Age in Years (grouped by a value of 1)

Values: Age in Years (count)

Age (again, by % of Column Total)

I get:

state4

We see that more than half the victims were two years’ old or less, qualified by the caution entered above. A first, if distressingly obvious, interpretation would associate vulnerability negatively with age: the younger the child, the more susceptible to deadly abuse.

Another point, apropos the Fault assignment field: the pertinent Statesman graphic – a pie chart captioned Paramour at fault- implicates a relationship partner (Paramour) in 137 of the 779 deaths recorded. That total, which can be swiftly realized by turning on a standard filter and simply entering Paramour in the vacant search field – indeed counts 137 such instances. But you’ll observe that many of these simultaneously arraign a parent; enter that term in the search field and you’ll turn up 581 records (remember that the filter will dredge any record that features the search term anywhere within a cell). Search again for Relative, for example, and 85 more cases populate the filter. In other words, a totalling of all perpetrator types will exceed the 779 records – even as the actual, lower total of is 779 too many.

New York City Park Clean-up Data: Sweeping Through the Numbers

30 May

Thinking twice about cleaning the bedroom – you know, that bespoke chamber of horrors heaving in entropic fury, the one housing the bed on which you sleep – somewhere? If you are, then how many times would you think about cleaning New York playgrounds and parks?

How about 880,000, the number of tidy-ups the city’s Sanitation Department performed on its recreational spaces for the 2015-16 fiscal year? Mighty big number – but even so, they don’t do windows.

They do just about everything else, though, and a canvass of those 880,000 (more or less – please read on) contributions to New York’s civic betterment that have been carted into the data set on the city’s open data site may inspire you to consider haulage as a next-career move. (You’ll understand I’m much too nice a guy to force WordPress to absorb 130 MB worth of download from a greedy subscriber. You’ll need to get this one yourself, I’m afraid; just click the Download button on the above link, opting for CSV for Excel or CSV for Excel (Europe), the latter presumably flipping day-month entries in order to comport with regional defaults.)

But even if you put your CV on hold, the still data have something to say, though their fields call for a spell of reflection before you proceed, apart from the necessary auto-fitting of the several date fields. Indeed – your reflection may need to be rather sustained.

If, for example, you want to break out the task activity by New York’s five boroughs – and you probably will – you may want to extract the borough codes from the gispropnum entries in column A: B=Brooklyn, M=Manhattan, Q=Queens, R=Richmond (the official name for Staten Island), and X=Bronx (you may want to consult the helpful field descriptions here). It seems to me that convenience could be served by deleting the propid data in B (these code the city properties on which work was performed, but may not contribute substantively to your analysis, though you may decide otherwise), renaming the field Borough, and entering in B2

=LEFT(A2,1)

Copy the expression down B and the borough information readies itself for grouping, after which you could presumably run a Copy > Paste Values on the B results and delete A, again in the interests of economizing. You’ll also of course observe the Null entries, which presumably report on activities not carried out on site, e.g., Staff Pickup and Lunch.

You may opt for additional field deletions as well in light of the workbook’s size, and the corollary understanding that each such deletion will relieve the book of about 880,000 data-bearing cells. You may have little use for the vehicle_number information, for example, and perhaps route_id, though I suppose both dislodgings may have to be thought through.

But there are deeper concerns. One might want to begin simply, by pivot-tabling task totals by month of the fiscal year:

Rows: date_worked (grouped by months and years. Remember that this fiscal year was booted up on July 1, 2015)
Values: date_worked (% of Column total. You may want to turn off subtotals, and certainly Grand Totals, as the latter will ineluctably add to 100%).

I get:

Park1

Note the apparent, pronounced pullback of activity in September – though I don’t know why – and the orderly step-down in work from March through September, though of course that trajectory plots a seasonality course that runs backwards against the actual yearly grain.

But that read isn’t necessarily that simple, or that accurate. That’s because a good many of the records seem to cite tasks that don’t quite qualify as an actual cleaning chore, e.g., lunch, De-Mobilization. So here’s a plan B: install the activity field in both the Columns and Values area, and in the latter, filter all activities save the one termed Work, which seems to serve as a catch-all item for…actual work on the parks. Continue to impose % of Column total on the values, and I get:

park2

Once we’ve so trim the activity data – a retrenchment that abridges about 174,000 records from the whole – the monthly distribution redraws its curve. The conspicuous disparities bulge around July and August, each month here grabbing nearly two additional percentage points from the other months, even as the two September figures hold fast. Thus the obvious task-distributional suggestion – that non-work activity somehow features more numerously in July and August – remains obvious; but accountings for the numbers are not. I’m open to suggestions.

In any case, once we’ve put the work item in its place we could key the tasks to their boroughs, by exchanging the Borough field we fashioned earlier for date_worked. I get:

park3

(Remember that N signifies the NULL entry.) It is perhaps notable that the assignments seem to have been similarly distributed across four of the city’s boroughs, with the far-less-populous Richmond well back. But these likenesses call for a deeper think, however, because they need to reconcile population (and of course Manhattan’s effective daytime census enormously exceeds its residential cohort) to aggregate park/playground counts and areas.
And what about the time expended on the tasks, again proceeding with the work-item filter as above? Here a New York open data field legend enters an important qualification about the nhours field: “A count of the adjusted number of hours a given task was performed [my emphasis].” You’ll see that for some instances – say row 43, a task whose start-to-end times evaluate to 35 minutes but is yet recorded by nhours as .292 hours, or 17.52 minutes – the lower estimate may reflect the simultaneous efforts of two workers, even as compensation of course would require the higher figure.
Going with nhours, then, we could replace activity in Values with nhours and format the resulting sums accordingly:

park4

You’ll note that even as the task numbers stood as near-equivalents for Queens and the Bronx (the latter again represented by the letter X), work in the former borough seemed considerably more time-intensive.

Again, I don’t know why, but I’m just a spreadsheet guy, after all; if you want that Pulitzer you’ll have to take it from here.

The French Presidential Vote, Part 2: The Numbers from Paris

12 May

Ok – you think you know who won the election, but in the interests of confirmation and deep background, you want to know how the French presidential runoff vote broke in Paris, don’t you?

Well, if you’ve made it this far, one supposes you do. The May 7 mano a mano – kind of – between Emmanuel Macron and Marine Le Pen culminated predictably in a flat-out triumph of the former, in both country and city, but even the obvious is properly reducible to its details.

You don’t need my pivot tables to learn that Macron won 89.68% of the determinate Paris vote (i.e., some other ballot casters voted but threw their support to no one via an empty form, and other ballots were voided), sustaining his hegemony in the city over his right-inclining opponent.  But for the deeper background (and again thanks to the folks at Open Data Paris) the data are here:

Election Results – Paris May 7

You can also download a far larger compendium of Paris election votes dating back to 2007 here.

Our workbook is identically organized, really an extension, of the document at which we looked in the previous post; but since of course the candidate count in our present iteration comes to two, the ensuing sheets look correspondingly sparer. Thus we can rerun this pivot table, after I named the prolix header in N “Candidate Vote”:

Rows: numero d’arrondissement 01 a 20

Columns: nom du candidat ou liste

Values: Candidate Vote (by % of Row Total; turn Grand Totals off).

I get, this time:

run1

The table reports the overwhelming but irregular Paris mandate for Macron; considered in reverse, we see a comparative cresting of support for Le Pen in the well-heeled 16th arrondissement and the adjoining but Left-Banked 15th, and her bottoming out in the relatively hip, gentrifying 3rd.

How do these outcomes square with the election’s first phase, run through on April 23rd? Of course that competition distilled eleven candidacies into May 7th’s two, and so a like-for-like retrospection need be conducted with some care. Still, we can play a useful trick on the April 23rd data, by compiling an identically-conceived pivot to the one above, filtering the numbers for Macron and Le Pen alone. Thus slimmed, the table proceeds to regard the Macron-Le Pen vote as the operative whole, and computes its percentages accordingly.

That is, we can return to the previous (April 23) worksheet data, reconstruct the above table, and isolate Macron and Le Pen. I get:

run2

Remember that Macron won 34.83% of the Paris April 23 vote overall, with Le Pen’s haul figuring to 4.99%; here their head-to-head relationship restores a couple of percentage points to Le Pen relative to the second vote; presumably her loss of ground in the finale speaks to the greater support thrown to Macron by the sectarians of the nine also-rans in the first go-round, and the decrement is fairly consistent across arrondissements. One glaring, provocative exception: Le Pen’s fall-back from 16.33% to 9.85% and a similar differential in the “ethnic” 19th and 20th respectively, these heavily Jewish and Muslim quarters. In light of those demographics, it’s the April 23rd figures that give cause for surprise.

As for turnout, the analysis starts with a decision, or set of decisions, about definition. Does turnout emerge by simply dividing registered voters by all candidate votes, or do you toss in blank submissions (Nombre de blancs (uniquement) du bureau de vote), as well as ballots that, for whatever reason, were voided (Nombre de nuls (uniquement) du bureau de vote) along with proxy (presumably absentee, or Nombre de procurations du bureau de vote) ballots?

We broached this question in part in the previous post, but didn’t think the matter completely through. There we applied a formula that took into account the multiple counting of most of the parameters above, because each entry among the data recorded an individual candidate vote for a particular polling place, but at the same time repeatedly listed the number of blank votes, etc. – one listing for each of the eleven candidates on April 23.

In fact a simpler work-through appears to be available. On the May 7 sheet we can simply scoot to the bottom of the data set and enter a SUM function in cell N1794, thereby totalling all candidate votes. We can then enter SUMs in row 1794 for the other pertinent parameters and divide these results by 2, the number of times each polling place appears in the data set – that is, once for each remaining candidate (and yes, we could do the same for the April 23 data, dividing instead by eleven). But remember that the candidate-vote numbers in N are not subject to the divide-by-two restorative, because each record in N details a unique vote figure.

If then, we want quantify turnout via the number of candidate votes/number voters registered (nombre d’inscrits du bureau de vote, and yes; these are text entries that need to be converted back into values), we simply enter (somewhere) =N1794/H1794. That figure: 72.76%. The corresponding proportion for April 23: 82.71%, clearly, and notably, higher. The percentage of blank votes turned in this past Sunday: 4.44%, five times the .82% accumulated on the 23rd. A statement was thereby made.

Now if we want to break out turnout by arrondissement – say here by dividing actual candidate votes by registrants by district, I named the arrondissement-bearing range in C2:C1793 arr, jumped into the free cell in S2 and entered 1, notched a 2 in S3, and filled down to S21, yielding numbers 1 through 20. In T2, then, I tried this formula:

=SUMIF(arr,S2,N$2:N$1793)/(SUMIF(arr,S2,H$2:H$1793)/2)

The expression sums all candidate votes (in N) by the 1st arrondissement, per the entry in S2. It divides that total by all polling-place registrants in the 1st, again dividing itself by 2 to offset the double-counting of each place (the votes in N don’t’ require that adjustment, because each record report a unique vote figure). After copying the formula down, I realized these turnout percentages:

run3

In line with the April 23 data, the 19th and 20th exhibit lowest turnouts, even as we’ve operationalized them here.

For an index of sorts of disaffection, we can then calculate the blank-vote percentages, e.g.

=SUMIF(arr,S2,O$2:O$1793)/2/(SUMIF(arr,S2,H$2:H$1793)/2)

Note that here both numerator and denominator divide themselves by 2, because both elements need to fend off the double-count spectre (and note the placement of parentheses). I get:

run4

Correlation between candidate vote turnout and blank vote proportions: -.54, pretty substantial and not shocking. After all, a preference for he, or she, bespeaks by definition a non-preference for no one.

So just to confirm, then: he won.

The French Presidential Vote, Part 1: The Numbers from Paris

27 Apr

The French electorate has spoken, even as it has neglected to finish the conversation. Round one of the country’s double-staged presidential vote has left two candidates – Emmanuel Macron and Marine Le Pen – standing, and stage two’s denouement, calendered for May 7, will halve the count.

While several media have plotted enlightening cartographic breakouts of the votes, e.g. the New York Times, Bloomberg, and the Guardian – my query about spreadsheet-available data for the election was answered in part by the www.opendataparis.fr site, whose team kindly pointed me to a workbook detailing the vote in Paris, at least. I’m pointing it to you here:

 Presidential Results 2017 Paris

And detailed it is, perhaps to a fault. After you’ve clicked past the necessary column auto-fits you’ll probably determine that the relentlessly identical election date reference in column B can be ignored or deleted, for example, along with no-less-invariant “Présidentielle 2017 – 1er tour” legend in A (note: the data here were extracted from a far larger vote compendium archiving many elections, in which the data in A and B are importantly differentiating). It seems in addition that the Numéro de la circonscription field in E is blank, save its header, though again this field is populated with other election data in the larger workbook parenthesized above. Moreover, I can’t see the clarificatory virtues of the fields in F and G; they seem to merely re-present Paris’ arrondissement (district) numbers in elaborations that restate the plainer entries in the numero d’arrondissement 01 a 20 field in C. If I’m missing something, I’m in the book.

Of course even if you do avail yourself of the arrondissement data in F don’t fret about all those Number Stored as Text messages signaled by the pesky green rectangles; you’re not going to add or subtract arrondissments. But you may have a right, however, to be bothered by the text-number conflation in some other fields, namely O and P, which appear to record blank and otherwise voided ballots, if my French can be relied on. For those, you can select the columns (even simultaneously) and tick the Convert to Number option draped immediately beneath Number Stored as Text.

Once we’re provisionally happy with the data as we’ve reconstructed them, we can ask the preeminent question: how did each candidate do in Paris? This pivot table should conduce toward an answer:

Rows: nom du candidat ou liste

Values: nombre de voix du candidat ou liste….

Apply it again, % of Column Total

You may want to rename those stupendous Values titles, Anglicizing them perhaps to Number of Votes and Percent of All Votes.

I get:

Paris1

What shouts its headline at us, of course, is Paris’ extraordinary inattention to the candidacy of Marine Le Pen. The national runner-up shambled into a barely-perceptible fifth place in the city, outvoted seven-fold by Emmanuel Macron, whose 321,000 margin over Le Pen in Paris contributed one-third of his country-wide superiority (for the national numbers see this recapitulation).

But the data also support a closer look at the numbers, a look worth taking. We can, after all, size up the vote by Paris’ 20 arrondissements. :

Row Labels: nom du candidat ou liste

Columns: numero d’arrondissement 01 a 20

Values: Percent of all votes (by % of Column Total; remove Grand Totals, as they all add to 100%)

I get:

Paris2

The fact of inter-district variation is instructive and unsurprising, but in need of interpretation just the same. Note, for example, the domination of the posh 16th arrondissement by center-right aspirant François Fillon, and his fallback to 12.78% in the far more proletarian 20th, the latter (and neighboring 19th) captured in turn by the “hard-left’s” Jean-Luc Mélenchon, who at same time came away with but 5.40% in the 16th. Le Pen maxes out at 6.48% in the Left Bank’s 13th, but her weakest showing plays our in the not-far-away 6th, also on the Rive Gauche.

The vote considerations could also be scaled to the very different arrondissement sizes and turnouts, e.g.

Rows: numero d’arrondissement 01 a 20

Values: Percent of All Votes (here a simple Sum; yes, the heading could stand a rewrite. You may or may not want to keep the Grand Totals off-screen here).

I get:

Paris3

Note the rough association between arrondissement number and vote totals, remembering that arrondissements spiral outwards and clockwise, with the 1st and 2nd mapped in Paris’ center.

And in view of all the  intra-arrondissement, politico-demographic variation, a comparison of arrondissement turnout rates might be correspondingly informative; but that remit comes with a challenge – this one of spreadsheet, not political provenance. That challenge owes to the way in which the vote records are organized, because in addition to its candidate-respective total every record names the pertinent arrondissement, a polling-place id – numero de bureau de vote 000 a 999 – as well as the number of registrants enrolled at that place – nombre d’inscrits du bureau de vote (and yes, you’ll have to convert those data into numbers, too).

The problem, then, is to stem the repeated counting of the registrant total alongside the individual candidate votes per record. If, for example, ten candidates have won votes at a given ballot place, the registrant sum for that place will appear ten times in the data set, even as we want it to appear but once. Comprendez?

Here’s what I think qualifies as workaround. First, I ranged-named the arrondissement numbers in C arr, and the ballot-place ids in D bur. I next jumped into next-free-column R, named it Registrants, and entered, in R2:

=H2/COUNTIFS(arr,C2,bur,D2)

What’s this formula doing? It’s grabbing the registrant total for each record in H (once you copy the formula down that column) and in effect dividing it by the number of ballot places (bureaus) in the arrondissement, a count realized by the COUNTIFS. Thus, for example, if a given record reports 1300 registrants in a ballot place in which ten candidates competed, the formula divides 1300 by 10 in each record. Add these ten, and we’re back to the 1300 registrants – but appearing only once.

If we then proceed to this pivot-table:

Rows: numero d’arrondissement 01 a 20

Values: Registrants

Sum of Number of Votes

paris4

The registrants totals should properly sum all the unique voters enrolled in each arrondissement, as per the explanation above.

I next devised a calculated field, which I called arrtotal:

Paris5

Thrown into the pivot table, and percentage-formatted accordingly, I got:

Paris6

 The turnout proportions are impressively similar but not perfectly equivalent, ranging from the high of 86.63% in the tony 7th to the low of 77.49% in the 19th. There are surely correlations in there to be studied, but either way, the percentages far transcend the rates for US presidential contests.

Any thoughts about that, Monsieur Trump?

The World’s Airports: Checking In With the Data

9 Apr

Looking for a sure-fire ice-breaker to rustle that awkward silence? Try asking this one: how many airports roll their runways out for flyers worldwide?

I mean, if that question doesn’t electrify your partner into a conversational frenzy then…well I do go on.

Anyway, the answer as of this writing, is 50,863. That’s a pretty big number, gushing over and above a Quora’s respondent’s count of 43,983, tallied in 2016. My source, David Megginson’s Ottowa-based ourairports site, reports the larger figure, and had actually added 20 airports to the count by the time I accessed the data anew (the airports.csv workbook), and after my having performed the analyses below.  By the time you get to the site, who knows?.

The data for the most part seem ready to go, and as such are answerable to a number of standard but useful questions, starting with a pivot-tabled distribution of the types of airports:

Rows: Type

Values: Type

Type (again, % of Column Totals)

I get:

air1

We see that nearly 65% of all airports qualify as small, the sorts at which you’ve likely not disembarked (I’ve discovered a definition of “small” minted by the United States’ Federal Aviation Administration, denoting a facility receiving .05% and .25% of the nation’s boardings, but of course that understanding is US-specific). You also probably haven’t touched down at one of the 20 world’s balloonports, 17 of which get airplay in the US. And I for one am slightly surprised by the substantial representation of heliports in the census. But copters, remember, started hovering long before your mouse did.

You’ve also doubtless duly recorded the count of closed airports, begging the analytical/presentational question whether that not-insubstantial cohort should be tolerated by the dataset, or otherwise grounded. By running a standard data filter on the set, filtering both for “closed” and “CA” in ISO_country, it can be seen that 629, or more than a quarter, of shuttered airports are based in Canada. Whether that figure perpetrates a recording bias by the workbook’s Canadian authorship I don’t know, though I suspect something else is at work here. I suppose that’s where you come in. (ISO country codes are issued by the International Organization for Standardization, and are set forth here.) My inclination is to remove these dormant locales, but only impermanently, via the storied fall-back: run a Find and Replace at “closed” in the C column, replacing every such instance with ZZ, sorting C by A to Z, and threading a blank row between 48541 and 48542, thus ruling the erstwhile closed entries out of the data, but keeping then on call for retrieval if you need them.

We could then break out airport totals by country, for starters:

Rows: ISO_Country

Values: ISO_Country (sorted largest to smallest)

ISO_Country (again, % of Column Totals)

I get:

air2

The United States’ remarkable numerical dominance of the airport count (and that’s Brazil at number 2, a perhaps no less remarkable standing as well) requires another look, or set of looks. If we introduce a Slicer to the table and filter there for medium and large airports the picture changes:

air3

The American proportion slims greatly, reflecting the weeding of small airports from the set. In fact according to Wikipedia only 16 of the world’s 50 busiest airports are US-sited, and if we slice for small_airport instead we get:

air4

America’s rule is restored, trailed again by Brazil.

Now if you wanted to break out American airports only, and by state, you’ll have to direct your attention – and formula-writing acumen – the iso_region field in J. We see that two-character state codes sidle the right side of the region-keying hyphen, e.g. US-NJ. To extract state codes, then, we could try this expression in S2, the first pertinent cell in the nearest free column (you’ll also want to name this budding field, say US State):

=IF(I2=”US”,RIGHT(J2,2),”ZZ”)

The formula inspects the country-bearing column J for US; if it finds it, it tugs two character from the right of the kindred cell in J. Once put in place try, for starters:

Rows: US State

Columns: US State (Count, of necessity)

I get, in excerpt:

air5

Two loose ends among the data need to be tied. First, the -A item that’s sorted itself to the head of the labels clearly promulgates an error, or given its count, a pair of errors, likely of the data-entry kind. By double-clicking the count cell of 2 an insurgent sheet adduces the two wayward records, including the airports’ latitudes and longitudes. By copying and pasting those coordinates into this site, I was told that the airports in fact operate in Indiana and Georgia.

That realization in turns calls for a find-and-replace of the errant data, but it seems that feature just won’t work with formula results, presumably because in our case you’d wind up replacing a formula with a hard-coded textual substitute and Excel wants to protect you from the inconsistency. You thus could run a copy > paste values atop S and then replace -A with IN for the record listing a 39-degree latitude, and exchanging GA for the -A lined up with the 33-degree measure (you could also find the airports’ respective Idents in column B and replace the state names in the records themselves).

The second restorative is a simple one, a filtering out of the ZZ, non-American airports. You can then sort the airport totals, largest to smallest:

air6

Ten percent of all US airports, then, serve flyers in Texas, with runner-up California counting half as many venues. Appoint type to the Columns area and you’ll see, in part:

air7

Note the proportion of heliports to all airports in California, though in fact a scan down the numbers puts the ratios, if not the absolute numbers, in Massachusetts, Connecticut, and New Hampshire even higher, for example.

Guess those flyers are in it for the short haul.

NHS Prescription Costs: Big Money, Startling Consistency

30 Mar

Huge organizations, huge budgets, huge pressures, huge controversy. That chain of indissoluble links clenches around the latest pronouncement of the UK’s National Health Service (NHS), heralding a cutback on “low priority” items that are otherwise available, and affordable, at local pharmacies – e.g. gluten-free products and hay-fever drugs.

Of course that is wherein the controversy lies, and if you’re interested in deeper contextual information about NHS prescription practices you’ll find it among a set of NHS data sources, including this one. That site provides what it terms Commissioning Group Prescribing Data across four quarterly spreadsheets for 2016, and for England (but not Northern Ireland, Scotland, and Wales); and by opening each of the four and copying and pasting it into a solitary workbook, the year’s prescription activity is consolidated, in 154,000 rows worth of data.

A large data set, but perhaps not optimally so. Each record is summative, after a fashion, totalling the number of prescriptions – or really, the number of products or items prescribed – for a medical speciality (Chapter Name), by a particular UK area team, e.g. (note the Items field):

pre1

Thus the first record counts 36 product-item prescriptions for gastro-intestinal treatments; and as such one might want to see that entry drilled down into 36 rows of information, resolving to prescription/product-level granularity. The analytical advantage there: a date-stamping of each and every item and a corollary enabling of useful time breakouts, e.g., items by week and/or month. (In fact some of that data are here, but for 2015, apparently; its billion or so records would overwhelm a spreadsheet, of course. and while these could be routed into Power Pivot – an add-in we haven’t explored here – the process might also exact a prohibitive download time, and so perhaps could be considered at another point.)

Note as a consequence that the 154,000 rows don’t total the numbers of items; that figure emerges only after a simple summing data in the Items field, returning 1,091,518,463 items (or again, prescribed products) for the year. Note in addition that because the quarterly demarcations in the A column are geared to the UK fiscal year – April to March – and not to calendar 2016 – what positions itself here as 4th Quarter data in fact recalls prescription activity for the actual first quarter of 2016, with 3rd Quarter numbers keyed to October through December of the year. In the interests of clarification, then, I’d run a series of Find and Replaces at the field data, something like this:

pre2h

And so on, so as to map the data fitly to calendar 2016.

Once that bit of administration is concluded a number of substantive breakouts suggest themselves. You may want to inaugurate the analysis, for example, by breaking out item totals by the broad Chapter Names in G:

Rows:  Chapter Names

Values: Items

Items (again, % of Column Totals)

I get:

pre3

You may want to comma-format the values, but the percentages convey the message in any case. We see that Cardiovascular and Central Nervous System account for nearly 42% of all product items. By means of follow-up we could refer Chapter Names to a Slicer and roll Section Names – categorical subsets of each Chapter Name – into Rows. If I tick the Slicer for Cardiovascular System I get:

pre4

Not surprisingly Hypertension and Heart Failure product head the list, though only very slightly ahead of Lipid-Regulating Drugs, which include statins.

Slicing for Central Nervous System products I get:

pre5

Analgesics (pain killers) and antidepressants contribute about 65% of the product total; the paucity of Obesity-treatment items is perhaps striking, however, given the incidence of the condition across the UK.

For another instructive measure, albeit a crude one, we could distribute item totals by England Area Teams. First, I’d tick the Slicer’s Clear Filter button:

And then delete the Slicer. Because I may want to return Chapter Name to the pivot table, clearing the Slicer’s filter selection sees to it that all the Chapter Name items will reappear in the table when requested. Otherwise, the Slicer’s filter will remain in force.

Then we can move assemble this pivot table:

Rows: Area Team Name

Values: Items (Sum, % of Column Total, sorted Highest to Lowest)

I get:

pre15

The percentages are clear but crude, in view of missing population data that would serve as a necessary corrective. Drag Chapter Name into Columns and, in part, you’ll see:

pre8

You’d doubtless need to take a series of long hard looks at the percentages, which don’t always reflect the overall proportions throughout.

And as for a simple, slightly coarse assessment of item totals by quarter:

Rows: Quarter

Values: Items

Items (again, by % of column totals)

I get:

pre9

The numbers are notably constant, and hence just a little provocative. Are these comparable aggregates an expectable consequence of the outlier-smoothing might of very large numbers, or are doctors somehow phasing prescriptions evenly across the year? I suspect the former, but there’s an investigative tack for you.

And yes, the Row Labels haven’t filed themselves into chronological order, have they? They’ve rather assumed a purely alphabetical progression, simply because they’re labels. Excel can’t construe Apr-Jun as a unit of time, and so resorts instead to the standard sort protocol. If you want the entries to adopt a faux chronology you’ll need to right-click the Jan-Mar label (and take pains to click only that cell) and click Move > “Jan-Mar” Up.

And what of the item costs? That, after all, is what’s impelling the NHS to rethink its prescription allowances. As long as we’ve installed the quarterly labels we can substitute Actual Cost for Items in the Values area, formatting the results into British pound currency:

pre10

(Note the distinction in the dataset between Actual Cost and NIC, or Net Ingredient Cost. For a definitional clarification look here. Actual Costs nearly always tend to fall a bit below NICs.)

Again, the totals evince a remarkable sameness, no less provocatively. In any event, that’s 8 billions pounds worth of prescribed items.

Next we could substitute Chapter Name for Quarter, and drag Actual Cost into Values a second time and recondition via % of Column Total:

pre12

The totals are lucidly apparent, but of course don’t break down into the kind of items the NHS wants to exclude from their reimbursements. If you again cast Chapter Name into a Slicer and filter for the largest Central Nervous System category, we’ll get:

pre11

More detail needed, perhaps. But for cost accountants, it’s a start.

New York City Film Permits – Coming to a Neighborhood Near You

19 Mar

Want to call a bit of New York real estate all your own? Just request a film permit from the city and, once granted, you’ll be able to arrogate, as the city puts it, “the exclusive use of city property, like a sidewalk, a street, or a park” – at least for a day of two.

Quite a bit of this sort of short-terming goes on in Manhattan these days, and in the other four boroughs too; and you can arrogate the dataset devoted to all that activity, and for free, on the New York city Open Data site via this link:

https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p/data

Once you’ve gotten there, just click the blue Export button holding down the center-right of the screen, and select CSV for Excel once the Download As menu appears.

per1

The data commit to cell form (as opposed to celluloid), nearly 40,000 permits secured between January 1, 2012 and December 16, 2016 (as of this writing; the data are ongoingly updated) somewhere around the city, and not just of the cinematic oeuvre. Television, commercial shootings and what are termed theatrical Load in and Load Outs – presumably scenery and equipment installation work for on-stage productions – are likewise inventoried.

As per so many open data renderings, a quick autofit of the sheet’s columns need be clicked, and you’d also probably want to let the EventAgency and Country fields in F and M respectively fall to the cutting room floor; their cell entries are unvarying and as such, unneeded.

A first, obvious plan for the data would have us break out the number of permits by New York’s five boroughs. This straightforward pivot table should serve that end:

Rows: Borough

Values: Borough

Borough (again, show values as % of Column Total)

I get:

per2

Ok – the field headers need retitling, but the data speak clearly to us. What’s slightly surprising, at least to me, is the smallish majority accruing to Manhattan. This native New Yorker would have thought that the storied island’s mystique would have magnetized still more filmic attention to itself; but perhaps Brooklyn’s latter-day repute for hipness packs a measure of ballast into the weightings. A very quick read of the Brooklyn zip (or postal) codes (column N) attracting the cameras turns up neighborhoods in the borough that border the East River, e.g., the perennially photogenic Brooklyn Heights (i.e. lower Manhattan skyline looming large in the background) and trending Williamsburg, for example.

We could then proceed with a natural follow-on, this a distribution of permits by category:

Rows: Category

Values: Category (Count)

Category (again, by % of Column Total)

I get:

per3

(Note: for information on New York’s permit categories, look here.) Television permits rule decisively; presumably the repeated filming regimen demanded by weekly series explains the disparity.

Now let’s break out category by borough, something like this:

Row: Category

Column: Borough

Values: Category (% of Row Total)

And we should turn off grand totals for rows; they all add to 100%.

I get:

per4

Remember that the percentages read across. I’m particularly surprised by Manhattan’s relatively small plurality of movie and television shoots, and by extension Brooklyn’s relative appeal. What’s needed here, however, are comparative data from previous years; for all I know, nothing’s changed among the borough distributions. (Remember as well that the above table plots percentages, not absolute figures. Exactly one Red Carpet/Premiere was shot during the five years recorded here.) Note at the same time Manhattan’s huge differential among Theater permits, a predictable concomitant of its concentration of Broadway and off-Broadway venues.

And what of seasonality? We can answer that question easily enough by installing StartDateTime into Rows – grouping by Months and Years – and pitching the same field (or really any field, provided all of its cells are populated with some sort of data) into values via Count.

I get (the screen shots have been segmented in the interests of concision):

per5

Among other gleanings, October seems to present a particularly attractive month for filmmakers, though the reasons why would and probably could be searched. Note in addition the spike in permit activity in 2015, and the fairly pronounced retraction in permits last year, at least through December 16.

But permit counts don’t tell us about the duration of the shoots, which naturally vary. But those data are here, and are welcomingly precise. To calculate permit lengths, all we need do is fashion a new field in the next free column (its position depends on whether you’ve deleted the superfluous fields I identified above), call it Duration, and enter in row 2:

=D2-C2

That paragon of simplicity yields a decimal result, quantifying the proportion of the day awarded the permit holder. Copy it down the Duration column and then try this, for starters:

Row: Category

Values: Duration (Average, rounded to two decimals).

I get:

per6

Remember we’re working with fractions of days; if you wanted results expressed in hourly terms you’d need to supplement the formulas in Duration with a *24 multiple.

We see a notably consistent range of average permit time allotments across categories with the obvious exception of Theater, whose set-up needs appear to require substantial permit times. Remember that a duration of .63, for example, signifies about 15 hours.

And if you simply add the durations (at least through December 16), the aggregate permit day count evaluates to 32,843.97. Divide the result by an average 365.25-day year, and 89.92 years worth of permit time report themselves, across a jot less than five years. That’s a lot of New York that’s been declared temporarily off-limits to the public.

Now you may also want to distribute permit prevalence by New York’s zip codes, but here a pothole obstructs the process. Because the areas requisitioned by the permits often straddle multiple codes, that plurality is duly recorded by the ZipCode(s) field, e.g. 11101,11222,11378.

But a workaround is at hand, though not the one I’d first assumed would have done the deal. What does seem to work is this:  First, range-name the ZipCode(s) field Zip, and in an empty cell – say R1 (or somewhere on a blank sheet), enter a sample zip code, say 10001. Then, say in R2, enter

={COUNT(IF(FIND(R1,Zip)>0,1))}

That’s an array formula, its entry requiring Ctrl-Shift-Enter, programming into its result those telltale, curly braces. The formula conducts a FIND of every cell in Zip for evidence of 10001; and when it finds it – somewhere in the cell – it posts a 1 to the formula, after which all the 1’s are counted. In this case I get 1950 permits having freed up a smidgen of acreage somewhere in the 10001 code, in the neighbourhood of the Empire State Building.

You can next copy and paste all of New York’s zip code listings into a blank spreadsheet area from here, entering New York City in the search field. Enter the above formula with the appropriate cell reference in the first of the listed codes and copy all the way down. If you sort the results largest to smallest, these codes return more than 1000 hits (note I’ve retained the web site’s neighbourhood field):

per7

That generic “Brooklyn” 11222 code topping the list points to the borough’s Greenpoint neighborhood, one of those districts hard by the East River, as is Queens’ Long Island City.

The formula that doesn’t work, contrary to my original surmise is

=COUNTIF(Zip, “*”&R1&”*”)

That’s in effect the one that had served us so well in my various key-word assays of Donald Trump’s tweets. It doesn’t work here because, for example, the zip code 11101 entered singly is vested with a numeric format; 11101, 11222,11378, on the other hand, stands as a textual datum, and COUNTIF likewise regards the two entries as different; and because it does, it won’t find both instances of the 11101s in the same formula. But FIND does.

Got that? I think that’s a wrap.