English School Academies: A Data Examination

21 Apr

To be an American is to be clueless about cricket. That’s a given, but my cross-cultural obtuseness extends to the British educational system too, whose program I just can’t get with.

Ok, so I’ve just dangled a preposition, but I told you – I’m confused. For one thing, when I hear the word “course”, I’m thinking of a three-credit offering in which you get to sit in the back of the lecture hall and text for 50 minutes. But over here a course is the whole shebang – the three (no, not four) year university stint keyed to one, and only one, subject (no majors here, either).  Colleges here aren’t degree-granting institutions – they’re sort of post-high-school vocational trainers, so if you want a degree you go to university.  The public schools are private, and you don’t proctor exams, you invigilate them. And anyway, they call it “programme”.

You’re getting the idea. And that’s why all this talk nowadays about English school academies and the government’s espousal thereof has me thinking dreamy, intelligible thoughts about baseball instead. But I think it goes something like this: academies are schools that have formally freed themselves from the financial aegis of local governing authorities, rather receiving their monies instead from the national government and as such need not hew to the UK’s national syllabus (you don’t get that in the States either). Still, they’re subject to inspections by the government’s Ofsted agency (we’ve written about them before), and proponents say that by vesting more control in on-site head teachers and snapping the link to the national curriculum the learning experience will be better served. Naysayers contend that the academy idea (to which the government states all schools in England must commit by 2022) will in effect privatize a sector, with large, profit-making providers biting off more than they can chew (for a straightforward resume of the pros and cons see this piece).

Either way, there are spreadsheets out there that will teach you more about the academy experience than any American can, including this workbook enrolled here on the data.gov.uk site:

The Open_Academies sheet (which the Cover_sheet tab generically calls Sheet 1) catalogues, and with impressive recency, the 5,272 academies in force in England (and that identity attaches to the UK country called England, and not Wales, Scotland, or Northern Ireland). But if you plan on pivot tabling some of the data in it you’ll first need to put some space between header row 6 and the caption text sitting atop it in 5.

Among other things, the book defines what it calls an academy route (column M), its two alternatives denoting a Converter or Sponsored status. The former references schools that in effect opted for academy status on their own, while member of the latter cohort were bidden, by what Wikipedia terms a “government intervention strategy”, to submit themselves to the guiding organizational hand of a government-ratified sponsor (you’ll want to read the notes above the data for some edification here).

And once you’ve attended to the row 5-6 necessity, the data seem more-or-less ready to go. You could start by simply breaking out academy numbers by local authority via a pivot table, and sorting these from Largest to Smallest:

Row Labels: Local Authority

Values: Local Authority

Once sorted, I get this in excerpt:


It’s Essex and Kent, two local authorities close to London, that lead the academy pack. Next maneuver Academy Route into Column Labels and treat the Values to the % of Row Totals makeover (and turn off Grand Totals). You’ll pick up considerable variation in the Converter-Sponsored ratios across the authorities.

You may also want to learn something about average wait time between academies’ application submissions and their approval/opening date, with the understanding that not every school need formally apply (again, check out the notes topping the sheet. Note as well that every opening is dated on the first of a month).  But because many of the schools feature no application date and rather represent that absence with an asterisk, things start to get slightly messy. In fact, however, we’ve broached this issue and one workaround before, and just a few posts ago, with an array formula that averaged paired values, when in fact the values are there to be paired, of course. The problem peculiar to these data is the asterisks in the Application Date field; subtract an asterisk from a value and you get an error message that undermines the formula. But run a find-and-replace along these well-intended lines:


launched in the hopes of substituting nothing (i.e., a numeric value of zero) for the asterisks, and discover that it won’t work – because the asterisk behaves like a wild card at the data, and thus the find-and-replace above will replace every selected cell with nothing. What you want, then, is this:


(Thanks to this thread for the tip.) Here, the tilde orders Excel to treat the asterisks as a character to be searched in its own right, and not as a surrogate for any and all characters.

Once you’ve supplanted the asterisks among the application dates, this array formula awaits:


I get 222.65 days – the average wait time for an academy application to secure its approval. Again, remember that this formula really reprises the expression we detailed in the April 6 post.

And because I suspect the academization process might encourage a touch of school rebranding, here’s another array formula – a considerably simpler one, you’ll be pleased to hear – that counts the number of schools that underwent a name change when they won academy status:


The formula compares pairs of values in the C (Academy Name) with those in the L (Predecessor School Name(s) ) columns. If the entries are disparate, the comparison assigns a 1 to that outcome, whereupon the 1’s are summed, yielding 2599. That is, a bit less than half the schools came away with a new name after having earned academy status.

And now I’m thinking about opening my own academy – The Expat Institute for the Study of Cricket and Other Conundrums.

My branding expert is cool to the idea.

The Speed Musueum’s Recruitment Matrix: Exhibiting Now

14 Apr

When a New York Times story sports a headline with the word “spreadsheet” in its banner, it’s nothing less than man bites dog, I tell you; and so when that electrifying, once-in-an-eon comet streaked across my disbelieving screen…well, I just had to read on.

And so I did. True enough, the text beneath reported the diversity-driven recruitment strivings of the Speed Art Museum in Louisville, Kentucky spearheaded by its director Ghislain d’Humières. In that connection the museum has fashioned a spreadsheet-driven matrix for check-listing board of governor and trustees candidates; and a couple of emails later the museum kindly dispatched the sheet to me, and thus to you, here:

 Speed Matrix (1)

(Needless to say, candidate names have been displaced from their niches in the E and F columns.) I should that the accompanying note authored by museum Executive Assistant Lauren Baldwin allows that “Many of you have asked if we can share the matrix…”, a numerical aside that surprised me. I suspect that demand emanates chiefly from museum officials interested in how the Speed paddles across its application pool, and not from spreadsheet fanciers like your correspondent.

Be that as it may, I was given more than a little pause by a first sighting of the sheet, particularly the wide berth of white spread to the left of the A column. How, I wondered, does white space get behind the sheet’s leftmost edge – until I at last understood that the workbook opens to us in Print Layout view; click View > Normal in the Workbook Views button group and convention is restored. As for the precipitously-angled headers in row 1, they’ve been subject to the Format > Cells > Alignment treatment:


You’ll also perhaps note that cells E1:AR1 have been told to wrap their text, even as that instruction really only impacts E1:F1.

Moving along with matters presentational, the category headings atop the sheet – e.g., AGE, ETHNICITY, GENDER, are text-bearing shapes that have simply superimpose themselves upon the cells beneath them. One assumes in addition that the burgundy header color codes in M1, N1, AJ1, and AK1 communicate some special standing imputed to those fields, but my conjecture has to wind down right there. And the extension of the all-borders cell motif to row 1048576 seems to overstate the case. I’m also not sure what the OLE_LINKS range name ascribed to F94 is presently doing.

But what is the workbook meaning to teach us about these board of governor candidates? The matrix appears to simply record a 1 for any attribute an aspirant possesses, continue to leave any inapplicable qualities blank; but if you think that the follow-up intent, then, is to simply add the candidate numbers and compile a highest-total-wins table I’m not so sure. The answer depends in part on the plans the Museum have for the data.

For one thing, you’d be right to be struck by the absence of SUM calculations in the empty AS columns that would or could have drummed up a candidate’s aggregate score, and by the presence of SUMs in row 70 that realize column field totals (e.g., numbers of women and Hispanic Americans in the pool. Row 106 sums a second bracket of applicants, those placing themselves in consideration for board of trustee positions. Row 107 sum the columns in their entirety). So what are the 1’s doing?

For one thing, every position seeker must receive a 1 somewhere among each of the demographically basic parameters of Age, Ethnicity, and Gender, and so that ineluctably constant sum of 3 could be disregarded. At the same time, however, that very constancy does the vetting process no harm – again, because everyone scores 3. But if the Speed Museum wants to differentiate the attractiveness of men and women, for example (legal considerations aside), the undifferentiated gender 1 teaches the Museum nothing – be you an M or a W, you get one point either way.

That’s one issue. But I’m also bothered by the fission of gender codes in columns P and Q into Female and Male, a data-organizational issue we’ve taken up many times previously. Had Mr. d’Humières asked me – and he most assuredly didn’t – I’d have made room for M and W in the same, unitary Gender field, the better to pivot table the distribution of the two items. And I’d have done the same for ethnicity, e.g.


And as for Age, I don’t see why its four fields couldn’t have likewise been concentrated into but one, and ready to receive a candidate’s actual age. Why not enter 37 instead of a 1 beneath Age 36 to 50? Better pivot tabling awaits.

A different suggestion, this too decidedly unsolicited. The fields could be weighted, in the event some of them are deemed more contributory to hiring decisions than others. A new row could be jammed between the current rows 1 and 2 and, assuming we cling to the sheet’s existing field structure, we could assign a weighted value to each of the matrix fields in what is now G2:AR2 (and we’ll call that range Weight).

For example – if the Museum is in fact keen to amplify women’s board representation a 3 could be socked into P2 in the Female field of the Weight range, even as Q2 – the men’s parameter – could inherit a 1. In other words, enter the generic 1 in the Female candidate’s field and she gets 3 points. That device sounds politically crude, and while an organization might not want to commit that sort of maldistribution to document form, it’ll work operationally. By the same token, a 1 in a particular candidate’s Philanthropist field (in column AG) might evaluate to a 5 in AG1, with other relatively more or less desirable qualities weighted accordingly.

Each candidate’s score would then be entrusted to a SUMPRODUCT tabulation. Thus the candidate in row 3 would be portrayed in AS3 (the next available column) in these terms:


Copy the formula down AS and each candidate’s field evaluation is multiplied by that field’s weighted score in row 2 (the weight range), and then totaled (of course the blank candidate cells multiply the respective Weight cells by 0). That’s what SUMPRODUCT does (yes, it’s an array formula, albeit an off-the-shelf one), promoting a finer-grained portrait of each hopeful.

But remain with the matrix as it stands now and the sheet doesn’t do that much heavy lifting for the decision makers, because they’ll still need to winnow candidates with the same number of 1’s. But of course, that may be precisely what the decision makers want to do.

Either way, I’ll suggest another field for the Experience area – how about Spreadsheet Guy?




Presidential Appointments: Confirming the Data

6 Apr

As of this writing 2633 days have elapsed since Barack Obama at last received the go-ahead to bring about the change he thought we needed, and toward that end he’s proposed 1607 appointments to his administration, or one every 1.64 days.

That’s a lot of change, and some bright young things are doing a whole lot of background checks besides. And if you want to see if someone has snuck your name into the nomination pool you can make sure by downloading and vetting the


workbook sourced in the https://open.whitehouse.gov/ site, a none-too vast holding area for 27 datasets that’ll doubtless, eventually, have more to offer.

But what the appointment data has to offer is pretty interesting all by itself, once you’ve instituted the change you need – namely an autofit of the columns. For one obvious but enlightening take, we could pivot table a count of nominations by agency (note that nominations among the data run through March 3 of this year):

Row Labels: Agency Name
Values: Agency Name

Sort the results, and you’ll learn that more than 22% of the appointments were earmarked for the Federal Judiciary:


Dotting that roster are the two Obama Supreme Court nominees, Elena Kagan and Sonia Sotomayor (but not current nominee Merrick Garland).

And an equally unassuming table:

Row Labels: Name
Values: Name (Sort these highest to lowest)

will shuffle those 1607 into 1542 actual individuals, 56 of whom were thus nominated at least twice. The most peripatetic, current Secretary of Treasury Jacob Lew, turns up in the data set five times, though four of the mentions, for United States Governor (not to be misconstrued for the elected head of one of US’ fifty states) appear to have been simultaneous.

As for the timing of appointments, a simple consideration of the numbers by year should tell us something about the pace of nomination activity, along with perhaps the trajectory of position availability:

Row Labels: Nomination Date (the three blanks could be filtered)
Values: Nomination Date (grouped by Years)

(Note: Excel 2016 – which I’m currently using – will automatically group these data by Months, Quarters, and Years, whether you want it to or not. Clicking the ungroup button will return the data to the old default, by which each date staking the field will be uniquely cited.)

I get


Even given its rough distributional curve the numbers nevertheless skew indisputably across the second half of Obama’s tenure, that yet-to-be-completed interval accounting for more than 58% of all of the president’s appointments. Explanations, anyone? Again, those ups and downs may merely transcribe a record of as-needed, emerging position vacancies, but that one’s a call for the pundits, or maybe the statisticians. (And don’t ask me to explain why Excel saw fit to discharge its Show Values As button from its PivotTable Tools > Analyze ribbon. A space-saver? Not a time saver. Show Values As is available on the shortcut menu that a right-click atop the data will summon, though.)

And if you were thinking of grouping the years above into 2009-2012/2013-2016 shares, start rethinking. Clicking the standard pivot-tabling Group Selection option yields, not unexpectedly:


(Again, Excel 2016 defaults to that triple-parameter selection.)

What’s missing of course is the By field that makes itself available to conventional number grouping, e.g.:


Work with dates, however, and you don’t seem to able to get to that screen shot from here. The simplest (though not necessary the only) way out: Entitle the free, adjacent column J Year and enter in J2:


Copy down J and four digits’ worth of years resolve into their cells. But those yearly representations – 2009, 2010, etc. – are not bearers of the date-formatted quantities you may think they are. Format 2009 in date mode, for example, and what you get is January 7, 1905 – because that date stands 2009 days away from Excel’s date baseline January 1, 1900, the point of embarkation against which all subsequent dates are counted. January 1, 2009, on the other hand, is the number-formatted 39814 – its distance in days from the baseline. Bottom line, then: the YEAR function returns a useful number – but one that doesn’t return the year in its real, sequential space.

In any case, once the stand-in dates take their place in the J column you can move to this pivot table:

Row Labels: Year (Filter the 1900, signifying blank cells. Group thusly, overwriting the 1900 in the Starting at field to 2009:


Values: Year

I get


That’s what I was looking for.

And given the data’s bookended Nomination Date and Confirmation Vote fields, we should be able to say something about the average wait time between a nominee’s submission and disposition, once we acknowledge the latter field’s 179 blank cells (one assumes at least some of these denote pending candidacies – and you can get 179 by trying =COUNTIF(G2:G1608,””) ). Of course the AVERAGE function ignores blanks, and so these ciphers won’t impact the calculation.

The by-the-book answer to our wait-time question would have us mark out a new field in K, call it something like Vote Wait in Days, and enter in K2


 While one could entertain alternative logical tests, this one above upon me first. The expression asks in effect if both a record’s Nomination Date and Confirmation Vote fields are filled; if so, a simple subtraction of the earlier from the later date follows. If not, the cell receives nothing (and yes, it would be possible to simply omit the if false condition entirely from the formula, whereby any cell’s failure of the logical test would return the Boolean FALSE in the cell; but remember I’m trying to go by the book).

Copy that expression down K, and simply write, somewhere,


And you should realize an average nominee wait time of 143.35 days. You could then arrange this pivot table:

Row Labels: Agency Name
Values: Vote Wait in Days (Average)

Among other things you’ll meet up with ten agencies whose blank cells have inflicted a #DIV/0! alert upon their breakout, but if you sort the averages Highest to Lowest you’ll restore some visual order to table; and by doing so you’ll observe considerable variation in the wait times, while granting that the agency numbers can be very small.

But I had another idea – to devise an array formula, a one-celled power expression that would yield that same 143.35. Now because array formulas often model themselves after the syntax of the standard, record-by-record expressions such as we composed above, I originally advanced this follow-on:


In fact this formula looks a good deal like the record-specific


That we drew up above, a formula that worked. Here in the array context all the counts and subtractions operate on ranges, or arrays, but these are placed in the same relation as in the earlier conventional formula, with the AVERAGE function bracing the whole thing.

But my array formula didn’t work, (apparently) because this segment


subverts the array dynamic between the E and the G arrays, by dint of the trifling comma that seems to have split the two apart. That sabotaging detail dragged me to a Plan B:


This one works, and it works by restoring an operational relationship between E2:E1608 and G2:1608, one that satisfies the array formula’s peculiar processing needs. The formula in effect asks to average the wait times of rows in which its summed E and G entries exceed the entry in E – another way of asking if there’s anything in G.

It works, and I’m floating my name for that open Ambassador to Redmond position.

Ebola, After the Fact: Liberia Survey Data

30 Mar

You’ve probably haven’t terribly much about Ebola lately, and that is all to the good. The disease’s disappearance from the headlines has everything to do with the extraordinary, near-total defeat of the African scourge, a triumph leaving the fifth estate with nothing to report.

But data about Ebola remain on hand, notably on the Humanitarian Data Exchange (HDX) site, a venue to which we’ve turned in the past for information about the epidemic when the situation was considerably more parlous. One of the holdings of epidemiological note comprises a spreadsheet-driven compendium of survey data elicited from respondents in the Liberian capital of Monrovia between December 2014 and January 2015:


With 75 fields the data set is decidedly multi-variate, and you’ll want to turn to this clarification of its fields for some deep background:


You’ll also probably want to autofit the sheet’s columns and delete the empty SURVEY_INFO field in A, its heading serving in effect to merely title the actual survey. And in fact the BACKGROUND name in F appears to do nothing more than partition two phases of the survey’s questionnaire from each other, corresponding to the darkened row 4 in the Survey sheet. Indeed – that color code of sorts on the survey sheet references similarly blank phase-separator fields in columns P, AJ, AX, and BR; as such they could all be escorted away from the data set.

But I said could, not must; and that’s because, data lacunae notwithstanding, none of these fields do any appreciable harm to the data set and its pivot tabling prospects. It suffices for a field to merely declare its heading, even as its cells beneath remain closed to entry. Type a heading and you have a field. The fields are purposeless, to be sure, but a pivot tabler need simply exercise his/her discretionary powers to ignore them.

And once you’ve finalized that design decision you can move on to some more conventional and substantive considerations. I’m generally a pushover for date/time fields whose contents are truly dates and times, and the cell holdings in the esdate parameter in the B column qualify, in each and every of its 1572 rows. But the informational gain in drilling down or grouping interview dates and times is uncertain; what, for example, might we learn from the available fact that more interviews were conducted between 11 and 12 in the morning than during any other hourly stretch? Not a great deal, I suspect.

But there are of course other findings to be found. A simple pivot table assay of gender:

Row Labels: Gender (filter the 10 blanks)

Values: Gender (% of Column Total)

Tells us that 55.38% of respondents were women, a disproportion that raises the usual sampling questions (some of which are considered by HDX here. The Liberian sex ratio verges on 1-to-1, by the way; see this page).

Then direct the religion field to Column Labels (and turn off Grand Totals for columns. Note the percentages read down the columns):


Why male Muslims predominate among that faith’s sub-sample presents itself as a question – a good question – that I can’t answer, though the absolute Muslim total – 153 – is overwhelmed by the precisely 1,400 Christian respondents (the Other field comprises 9 respondents).

You also may want to do some thinking about the ehowget field, once you understand what it’s saying. Its associated survey question asks respondents to identify the ways in which they believe Ebola could be contracted – but It doesn’t assume the respondent had incurred Ebola, as I had originally supposed. Put that clarification in place and we see 126 causal permutations, along with 25 Others and 144 Don’t knows. (Code 3 appears truncated among the data, and reads in full Body fluids of infected person. And code 6 has lost the e in “surface”.) Because so many of the replies cite multiple causes, the data scan less than tidily; but even if we were to number the permutation proxies 1 through 126 in the interests of brevity, that retouching would be presentational, and not substantive; those numbers would remain incomparably nominal.

What could perhaps be brought to these data is a count of the number of times each causal attribution appears among the responses, e.g., the frequency with which “6-Contact with infected surface” is alleged as at least partially contributory. Here good old COUNTIF can be pulled off the shelf and put into useful service.

First, I named the ehowget range in Q2:Q1573 causes, and then settled into a blank cell, say BZ1. Here I’d enter 1- , the reference for the Vomiting symptom, including the hyphen in order to ward off any other 1 that might, for whatever reason, have found its way elsewhere in the cell. I’d then enter, in CA1:


We’ve been here before, and not so long ago, where I explained the workings of the above expression. The divisor realizes the proportion of all responses featuring coded symptom 1, 34.92%. I then entered 2-, 3-, etc. down BZ and copied the formula accordingly down CA. I get:


We see that codes 3 and 5, bodily fluids of infected person and touching dead bodies respectively, were most widely held by respondents to be implicated in Ebola transmission, at least in part. Code 7, Mother breastfeeding her child (that phrase also truncated in the data set) received the fewest attributions. Now one could go ahead and apply the same formulaic treatment to the next field, esignsymptom in R – that is, as detailed in A17 of the survey workbook, How can you know that someone has Ebola?

But take heart – it’s all past tense.

Canadian Bilingualism Data: Comme si, comme ca

16 Mar

They’ve opened their data in Canada, too, right here; and with many of the holdings committed to XLS/CSV mode you’ll have plenty to do if your investigative interests take to you to the northerly latitudes (but don’t be fooled by my facile directional allusions; here in London we’re hundreds of miles closer to the Arctic circle than Montreal).

In the course of a preliminary poke through the site’s shelves your trusty mystery shopper slid this item into his bag (not to worry – I paid for it):

English-French bilingualisms by different geographies

a census of French, English, and French/English speakers in that officially multi-lingual part of the world. (And you may want to think about the workbook title.)

The workbook exhibits a number of organizational curiosities, some of which we’ve seen before, more or less; but however familiar, they beg our attention.

First note the bipartite headings overlooking the data set’s fields, spurious tandems clearly forced upon the data by Canada’s determinedly bilingual stance; but politics aside, the stacked pairs cause problems for the data. You can’t have two field headings in two rows; the lower of the two will be treated as a record, for filtering, sorting, and pivot tabling. Couldn’t a unified heading have been agreed upon, e.g. nom géographique, anglais/geographic name, English for column A (and you’ll note the apparently misspelled “georaphie” in columns C and D)? Whatever the strategy, one of those rows needs to be deleted.

Now observe columns A and B, and follow with an additional look at C and D. It appears as if these twosomes comprise the same information, albeit in translation. That is, B furnishes a French rendition of the corresponding data in A, and nothing more – and D does the same for C. Again, politics have trumped design considerations here, because the data are overwhelmingly identical across the fields; indeed, scan the twinned cells in A and B and you’ll discover that only 37 of the location names are spelled disparately across the two languages – and that’s out of 4982 records.

And how do I know that, by the way? I’ve been so informed by the simple, neglected, and nifty Row Differences feature housed in the Go To dialog box, just an F5 away. Select columns A and B, tap F5, click the Special… button, and tick Row Differences:


That swift routine selects all the rows in which the A and B data don’t agree.
But at least these excesses perpetrate no substantive injury upon the data – because, for example, any pivot tabling of the records need simply call upon the fields expressed in your desired language, and just as simply ignore the others. Nevertheless, mark the all-but-redundant fields as one of those organizational curiosities.

But the next curiosity is injurious indeed, and is visited upon the selfsame fields above. The data set has inflicted a massive, repeated double count of the numbers, by recurrently conflating concentric territorial levels among the records. That is, by insisting on aggregating the data by different geopolitical strata – all in the same data set – the same numbers get counted over and over as a consequence, and we can’t really have that, can we?

For example, the totals associated with very first data row – identifying Canada nationwide – post a collection of what are in effect grand totals, broken out into the various lingual distributions. Thus the census reports that 28,360,245 Canadians all told speak English – but of course that very number is variously subdivided among the subsequent records. As a matter of fact, if you simply sum the values in the L (speakers of English or French) field, the total – a nice round 283,695,000 – rather enormously overstates Canada’s demographics, and by a factor of about eight. And the problem is bewilderingly compounded by row 315, which nearly precisely duplicates the data in that first row.

But a reasonably satisfactory way to sift the jurisdictional apples from the oranges is at hand. By initiating a pivot table and swinging the georaphie [sic] type, English field (I’m working with the English field titles) into a Slicer and ticking, by way of illustration, Canadian Heritage region, and loading the Values area with speakers of French and speakers of English, you’ll get


The resulting breakout captures the true respective populations, because we’ve confined the analysis to just one of those geopolitical levels. Click province this time in the Slicer and you should see


Note the equivalent grand totals. Again, by cordoning ourselves into only one demo-stratum we’ve again isolated the proper national figures (although note that that clicking the census metropolitan area item cedes smaller numbers, presumably because those data devote themselves to larger Canadian urban areas only and not all localities in the country).

A few other clarifications need be issued. First, don’t be confused by the relation between the speakers of English, speakers of French, and speakers of English or French fields. The first two don’t add up to the third because the intervening field, English-French bilinguals, counts Canadians whose proficiency is also enumerated in by the speakers of English or French. Thus, for example, this formula:


should return the value in the L column.

But if you’re looking for a full-bodied denominator that would allow you to figure the percent of English or French speakers as a fraction of the all denizens populating each row you need to add the respective L and M values, the latter field subsuming Canadians who speak neither English nor French. The problem – though granted, a far smaller one than the multi-strata cohort complication we described above – attaches to the more than 3,000 hyphens jamming the far left of scads of cells in the M column. These may signify NA, data-not-available counts, but very well might, given the smallness of the row populations, simply stand for zero. Either way, a simple find and replace should palliate the problem – once you realize that the actual entries in the relevant cells are [space]-[space][space][space]. Nail that down, name column N All, enter


in N3, and you would appear to be in business. You could then advance to a series of pivot table aggregates by this or that level, e.g.

Slicer: georaphie type, english (select say, province)

Row Labels: geographic name, english

Values: speakers of English


And in order to develop a province English-speaking aggregate, try this calculated field (which I’ve called Engsp):


I get


Note the smallness of the Quebec proportion. C’est tout.

Now does that make me a speaker of English and French?

Indianapolis Police Complaints: Doing Justice to the Data

9 Mar

Seekers after news about open data – and some of the data themselves – would do well to unstrap their backpacks at the OpenData-Aha site, a digest about what’s up in this multifarious, crackling, unbounded domain (transparency check: the OpenData-Aha has cited Spreadsheet Journalism).

Follow some of the Aha story links and the odds are highly encouraging that you’ll find yourself inside some governmental entity’s spreadsheet, e.g., the record of civilian complaints lodged against members of the Indianapolis, Indiana police force dating from July 2012 through September of last year:

Indianapolis citizen complaints

(You’ll note that the 2012 data describes incidents for July of that year only, and the 2013 complement features six, non-contiguous months of the year, for whatever reason).

The sheet, developed under the aegis of the law-enforcement data developer Project Comport, is informatively paired with a field-defining legend that should anticipate some of your questions about what the records are trying to tell us; but you’ll still need to perform a measure of house cleaning in order to neaten the dataset, e.g. the traditional column autofitting. And again – because we’ve seen this before – you may well want to cleanse the occuredDate (sic) entries of their superfluous times, all of which report midnight (the standard Short Date format dropping down from the Number Format menu in the Home ribbon would work to eradicate these). You’ll also have to contend with missing data here and there, but absent a compelling need-to-know submission to the Indianapolis police those blanks won’t go away (or at least not really, but read on). I also can’t account for the wholly unattended census tract field, which, pun unashamedly intended, remains unpopulated.

In addition, you’ll need to inspect some of the spokes in umbrella field categories such as district, which Project Comport understands as “…the District…or Branch, [my emphasis] such as Homicide or Robbery, that the officer was assigned to at the time of the incident”. Thus that field seems to subsume both locational and organizational demarcations, begging the question of how certain coding determinations in the field were finalized. For example – a complaint directed at the Covert Investigations section nevertheless and necessarily has a geographical referent – the contested event happened somewhere. By the same token, the “shift” field departs from chronology when it numbers the Crash Investigations Section among its items. We should be slightly bothered by the apples/oranges conflations here.

But a far broader conceptual concern need be confronted too, though we’ve met up with something like it in a previous post. Each complaint receives a 32-character id – but in a great many cases the complaint comprises multiple sub-complaints. Complaint 453f6ca045a7b7ea292cb3daebe83e7d, for example, evinces 12 individuated grievances filed by a 22-year-old black male against four different officers. Do we regard these multiple remonstrances in unitary or aggregated terms? In fact, the data register 407 different complaint ids among their 1,004 records, these pressing their allegations toward the 378 unique police officers represented in the officer identifier field. How are all these to be counted?

After all – if a particular officer is implicated in three offenses in the same incident, how does that plurality factor into any pivot-tabled summary? On the other hand, if the same officer is charged with the selfsame three delicts, but instead on three discrete occasions – and as such incurs three incident ids – how should that batch of events be reckoned in turn?

Of course it would be perfectly legal to try and play it both ways – to treat the data as they stand in their default, quasi-redundant id reportage for some analytical purposes, and at the same time to isolate a parameter in order to consider its entries singly.

For example – if we wanted to learn something about the demographics of officers against whom the complaints were preferred we could click anywhere in the data and conduct a Data > Remove Duplicates winnowing of the records, ticking only the last, officerIdentifier field. Go ahead, and those 378 officer records will have been pared from the source data set (I’d immediately run a Save As routine here, the better to preserve the slimmed set in a new workbook and thus leave the original data intact).

You could then try out this initial pivot table:

Row Labels: officerRace

Values: officerRace (% of Column Total; filter the four blank records; by definition they can’t help us. We also don’t need the Grand Total, and I’ve retitled the Values field name):

I get:


Of course, those proportions are like a stick with one end, absent comparative data of the general racial makeup of the Indianapolis police force. But those data do appear to be available, by clicking the Download Officer Demographic Information for the IMPD link here. Those police force percentages read thusly:


The black-white officer ratio is nearly and remarkably identical to the complaint breakouts, intimating that white officers are no more likely to be complained about (although you may want to complain about my dangled preposition).

Or by extension, one could run another Remove Duplicates at the original data, this time by checking incident id (and assuming that only one citizen is enumerated per id, a premise that appears to hold) in the interests of learning more about the age and racial identifications of complainants. But that intention is slightly weakened by the smallish but real possibility that some individuals had levelled more than one complaint; if that be the case, our duplicate removal might nevertheless count some persons multiple times. That’s because unlike the officer identifiers, each of which clearly point to a unique officer, incident ids distinguish unique events, and not the persons embroiled in them.
In any event, if we proceed with the above Remove Duplicates and indulge the assumption that each complaint id associates itself with a unique Indianapolis resident, we could try:

Row Labels: residentRace

Values: residentRace (% of Column Total; again, conceal the blanks and the Grand Total).

I get:


Yes, you’ve caught that superfluously-spaced White entry (the second one, totaling one record); again, you need to direct a find and replace at the delinquent datum, finding White[space] and replacing with White. One also has a right to wonder about the large Unknown bin, a proper matter for further investigation.

In any event, map the above numbers to Indianapolis’ black-resident proportion of around 28% (another source puts the number at around 24.5%) and we see the maldistribution is real but not quite overwhelming; not shocking, but surely notable.
Still, a look at all 1,004 incident records, irrespective of the doubling up of incident id, could develop other useful conclusions. After all, each of the 1,004 sub-complaints require some or other legal disposition (as identified in the “finding” field ; again, the Project Comport field legend explicates the four finding possibilities).

Now in this case, however, the blanks stand in a different relation to the empty cells in other fields. These signify complaints on which a ruling has yet to be issued, and as such might well merit counting. I’d thus select I2:I1005 (remember we’re working with all the records this time) and spin through a find and replace, finding the blank cells and replacing them with say, “Pending”.

Then a simple pivot table awaits:

Row Labels: finding

Values: finding (% of Column Total, no Grand Total)

I get:


Note only 10.56% of all complaints were sustained, resulting in a definitive incrimination of an officer, though a Not Sustained judgment does require a reporting of the officer’s name to what is called the Indianapolis department’s Early Warning System.

Lots of bouncing back, them, between data sets. But I’m not complaining.

London Bus Injury Data: Be Careful

1 Mar

The real-time, tweet-borne reportage filed by New York Times’ reporter Lucas Petersen on his ride on a Megabus that went aflame somewhere between Chicago and Milwaukee got me to thinking about bus incidents in general, and the data that might impart a quantified perspective on such events.

In an interview after the fact Petersen allowed that he couldn’t find terribly much information on the matter, and truth to be told neither could I – with the conspicuous exception of London’s Transport for London site, which makes a series of quarterly-compiled spreadsheets on bus accidents in its jurisdiction available. After copying-and-pasting the most detailed dataset from each of last year’s quarters into a solitary, all-year sheet and custom sorting the records by Date of Incident and then by Route and Operator, I’ve stored it here:

TFL bus incident data 2015

With 5715 injuries, or more than 15 per day – 14 of which were fatal – the numbers are substantial, though of course  I don’t know how they compare with those of other systems.

Turning to the data, note the set’s slightly ambiguous tile, recording incidents “…which required hospital attendance and treated on scene…” I would assume that “and” meant “or”, in view of the “Injuries treated on scene” item featuring in the Injury Result Description field and the attending explanation thereof posted in the Introduction tab.

But there are larger analytical concerns, apart from your interest in autofitting the columns and wondering why the text in the Injury Result Description field is variably sized. Note that the data on each accident victim receive a record all their own, they should. But at the same that necessary individuation drills the data down too far, de-coupling injuries which may have been incurred in the same accident – something an investigator might very well want to learn. For example, it’s entirely possible – even probable – that the two injuries detailed in rows 14 and 15, on route 11 in the borough of Westminster, were sustained in the same incident. But that surmise, however plausible, should not have been entrusted to any third party; had an incident ID been assigned to each incident instead, all ambiguity would have been allayed, enabling an accurate incident, in addition to injury, count as well.

On the other hand all the dates populating the Date of Incident appear in effect to have been rounded to the first of every reporting month, thus grouping the data there with undue coarseness. I would assume that actual incident dates are available somewhere, and I’m not sure why these would have been denied from the source worksheets. A correlation of incident counts by day of week would afford a standard, but instructive, chronology-driven finding; but it isn’t available here.

We could also ask about the very substantial number – about 32% – of victims whose sex is unidentified. One would have thought that so prominent and relatively unambiguous an attribute would have found its way into nearly every record, and I can’t explain why in, nearly a third of the cases, it didn’t.

Moreover the Victims Age field offers categorical, not numeric, entries, and again I’m not sure why. Apparently no accident site protocol here insists that investigators develop such information, though attending hospitals almost surely would have it.

In any case, and shortcomings granted, some simple but useful pivot tables are there to be set, e.g.

Row Labels: Victims Age

Values: Victims Age (% of Column Total)

I get:


Speaking ideally, the considerable Unknown population would need to be (somehow) plumbed for any skew that would demand that we not ignore it. Put otherwise, if it could be demonstrated that the demographics of the Unknowns emulate those of the knowns we would be able to subject them to a measure of disregard. But by definition, we can’t.

We could also try this:

Row Labels: Victim Casualty Category

Values: Victim Casualty Category (% of Column Total; and you don’t need a Grand Total)

I get:


(Note the rewritten Values title. Remember you can simply type in the Labels areas.)

You’ve noted the apparently redundant “Cyclist”; but again, Row Labels enumerate their items uniquely. In fact, the more populous of the two Cyclist entries is trailed by a superfluous space. The simple remedy, again: Run a find and replace down the Victim Casualty Category field, finding Cyclist[space] and replacing with Cyclist.

We see the great bulk of victims are passengers, their predominance prompting a corollary concern with victim category and Incident Event Type. Drag that field into Column Labels and:


No, I don’t know how either Activity Incident Event or Safety Critical Failure are defined, but passengers account for 96.22% of all slips/trips/falls – presumably in the course of getting on or off a bus, though anyone who’s ridden the London system knows that its double-deckers can lurch and bolt like bucking broncos in mid-journey, too. Pedestrians – 7.89% of all victims – account for 29.74% of collision victims, a not particularly surprising fraction. Nor, one supposes, is the 61 of 121 assaults directed at drivers, whose aggregate injury percentage stands at only 6.23%.

And for another simple metric we could break out incidents by London borough:

Row Labels: Borough

Values: Borough (again, % of Column Total)

In excerpt I get:


Westminster’s clear accident margin suggests, but only suggests, that the teeming borough simply supports more bus routes, but pull Victim Casualty Category into Column Labels and you’ll also learn that 10.42% of all pedestrian injuries were Westminster-sited, suggesting in turn that the foot traffic there is profuse (and again, I’m not sure about the None London Borough item).

All of which conduces toward this recommendation – watch where you’re going – on and off the bus.


Get every new post delivered to your Inbox.

Join 192 other followers