NY Regents Exam Data, Part 1: Multiple Choices

28 Mar

We haven’t met, but I can say with a jot of confidence that I’ve likely done something you haven’t – taken a Regents exam. Exams.

I’m not clipping the achievement to my lapel as a badge of honor, you understand, just stating a biographical matter of fact. The Regents – a staple of the New York State educational system in which I spent more than a little time – comprise a series of what are termed exit exams; pass enough of them and you walk away with a high school diploma of the same name. Your correspondent took his share of Regents, his scores emobdying the “scatter” in scattergram, but no matter; I took them, and I’m a better person for having done so.

But before you file a Freedom of Information Act request to verify those abnormally-curved results you may want to review a larger, ultimately more interesting record of Regents attainments, the dataset supplied by the New York City open data site that summarizes in grand form the Regents scores of students statewide for the years 2014-17. It’s a big file, needless to say – so big you’ll need to download it yourself – and its 212,000 or so records have a lot to say about the testing profile of New York’s high schoolers.

It also has a lot to say about spreadsheet organization, more particularly the juxtaposition of column G, Demographic Category, to H, Demographic Variable. Those columns/fields in fact identify a series of putative fields and field items respectively; and as such, G’s contents could, at least in theory, have been more conventionally structured into discrete parameters, each owning a column all its own.

But that prescription calls for an elaboration. Consider this fledgling pivot table drawn from the Regents data:

Rows: Demographic Category

Demographic Variable

Values: Total Tested

I get:


Note that that all the Total Tested subtotals are equivalent, intimating that the five Demographic Categories cleave the same population into an assortment of cross-cutting attributes, and thus summing the same student count five times. The by-the-book alternative, again, would have assigned each Category to an independent field, such that an interest in test achievements by Gender, for example, would require the user to simply drag Gender into Rows. As it stands, however, a Slicer (or filter) would have to grease the process, e.g.:


Moreover, casting the potential field-bearers into mere item status beneath that singular Demographic Category banner appears to obviate a good many cross-tabulating possibilities, e.g.: a breakout of tests by both Ethnicity and Gender. How, for example, are we to learn how black female students score on the Regents when both attributes are lodged in the same field, and so must occupy the same label area?

But at the same time, the spreadsheet designers had to contend with a supplementary problem that overrides my challenges – namely, that the upgrading of say, Ethnicity and Gender to field standing would appear to require that data present themselves in individual record form, e.g. each student’s performance on each test; and those 2,000,000 scores/records would burgeon beyond Excel’s data-accommodation space.
In any case, there is indeed lots to learn structural complications notwithstanding, and we could begin by starting coarsely – by calculating the average overall Regents scores by year:

Row: Year

Values: Mean Score (Average, formatted to two decimal points)

I get:


(Note that one record, attaching to the Island School in Manhattan, exhibits a nonsensical entry for its year. By filtering and comparing the Island School data, it appears that the record belongs to 2017.)

The averages are remarkably similar, though I’d venture that, given the 2,000,000-pupil universe, the one-point differential distancing the 2015 and 2017 scores is significant. Remember that the Regents passing score is pegged at 65, suggesting that the test designers got their threshold right.

But those averages aren’t quite definitive, and for a couple of connected reasons, one subtler than the other. The first recognizes that the student double-count pinpointed above proceeds in effect to compute the average scores multiple times, because the records operate under the steam of different demographic categories and numbers per record. Thus the mean average for the Gender category alone – which nevertheless contains all students – is likely to depart at least slightly from the mean average for Ethnicity, which likewise contains all students. If, for example, we reintroduce the Slicer for Demographic Category for the current pivot table and tick All Students, we’ll get:


The differences from the initial pivot table are very small but evident, again because the Total Tested numbers per the All Students records don’t perfectly line up with the Total Tested per-record numbers for Ethnicity, for example – because each record receives an equal weight, irrespective of its Total Tested value.

And it is the matter of weighting that points its arrow directly at the second question, one we’ve seen elsewhere (here, for example). The per-record mean averages ascribe an identical mean score input to each record, even as the test taker numbers vary. And that bit of record democratization vests greater, relative influence to the smaller numbers. The result, again: a possible skewing of the averages.

But because Column I enlightens us about the actual test taker numbers, we should be able to derive a simple calculated field to impart a corrective to the weighting problem – once we deal with the very large number of records – about 75,000 – that just don’t report their score results. These are the entries dotted with the “s” code, and our formula needs to ignore them as well as their test taker numbers, which are stated, after all.
So here’s what I did. I headed next-available column S TotalPts and entered, in S2:


That expression means to assign 0 for any “s” datum, and otherwise multiply the record’s mean score by its number of students. (Important note: absent Means Scores almost always associate themselves either with a Total Tested number of five or less, or with entries possessing the SWD or Non-SWD values. (SWD stands for Students With Disabilities.) One assumes that both types exclusions are justified by reasons of confidentiality; but remember that the five-way count of students in the data set should subsume most of the SWD takers anyway, via their inclusion in the All Students item.)

After copying the formula down I titled the T column CountedStu and dropped down a row, wherein I entered:


The formula asks if the relevant row in J contains an s. If so, a zero is returned; otherwise, the number of test takers cited in I is returned.

I next devised a simple calculated field, ActAvg (for actual):


That field can now be made to work with any other field whose items are banked in Row Labels, e.g. by substituting ActAvg for Mean Score in Values (and leaving the All Students Slicer selection in place). I get:


It’s clear that our weighting refinements have uncovered a “true” higher set of averages, and that continue to sustain the near-point improvement from 2015 to 2017.

But again, there’s quite a bit more to learn from the scores. But I’m asking now to recess for lunch – and yes, I raised my hand.


U. S. Vaccine Data: A Different Treatment

14 Mar

It’s alarmingly philistine of me to say so, but I know what I like; and I like the heat maps that track the vaccine-driven advances in the United States wrought against infectious diseases, schematized by graphics boffins Tynan DeBold and Dov Friedman and featured in the Wall Street Journal in 2015. One of them looks like this:


These are of course heat maps in the broadened sense, plotting epidemiological movement across time instead of territory, and tiling their data into tightly-bound mosaics, whose colors blanche as the diseases recede. In short, a well-told story of stirring medical progress, the data for which is contributed by Project Tycho of the University of Pittsburgh and available for download there (you can sign into it for free if you’re affiliated with an educational institution). By contrast, think what a line chart bearing 50 data series would look like.

It should be added that the Tycho site itself hardwires a heat-mapping utility into its pages, in what could be construed as an ancestral precursor of the DeBold-Friedman outputs, e.g.


It is worth asking, then, about the refinements DeBold and Friedman commended to the Tycho charts, and the respective reasons why.

But in any event the map designers don’t need my everyman’s encomium; their depictions have won at least two awards – one conferred by the Global Editor’s Network (GEN), the other by the Kantar Information is Beautiful judges. The former’s site declares the maps were “…wildly popular on social media as well as with statisticians and graphics editors weighing in on how they would’ve approached this project”.

Sounds good to me, but note for the record that the Y axis above records 26 state names, pulling up 24 (really 25; the District of Columbia – the nation’s capital that possesses extra-state status – is likewise counted) short of the full American complement. Those names – at least as they’re represented above – have been invested with a font size that misaligns them with the 51 rows of data; the names are simply too large for the row heights that capture the data, a mismatch that conduces toward the follow-on question as to why the particular states above were earmarked for display, to the exclusion of the others.

You’ll also observe that the maps’ cell widths vary. Contrast the DeBold-Friedman above with this one:


Because the above map – and its 26 states – time-stamps its data from a later point of inception its cells are dilated, a refit that widens it equivalently to the other maps. Do these discrepancies matter (along with the ten-year intervals marked in the first map that have been halved in the second)? I’m not sure.

And something else about the maps provoked a thought or two. On his website Dov Friedman tells us that he used Excel “to aggregate over 100,000 data points. The data was then plotted on heatmaps using Highcharts. All sections were templatized with handlebars [sic].”

Now Highcharts and Handlebars are two applications with which I am not familiar (I told you I was a philistine); But in the course of perusing and admiring the maps a renegade idea gatecrashed my cerebellum: could the maps be emulated – at least more or less – with Excel alone?

I think the answer is yes, at least more or less; it seems to me that, with the disease data dropped into the Values area, the heat maps could be made to emerge from a pivot table that would crosstab state names and years against the data, whose numbers could be comparatively scaled through a series of conditional formats.

In receipt of that self-issued marching order, I proceeded to download the measles data for five states to see how the heat maps might be framed in Excel-only mode. The numbers were redirected into a pivot table (not a straightforward program; the download assigned each state’s figures to an independent field, and as such had to be harnessed inside a single State parameter via the Get & Transform utility. In addition the “-“ cell entries marking absent data had to be quantified, but I’ll spare you the particulars).

Once the above reconstructive work was carried out the resulting pivot table proceeded pretty straightforwardly, something like (depending on how you’d name the fields):

Rows: States

Columns: Year

Values: Incidence

And once the table was put in place, the cells would be subjected to the color-graded conditional formats described above (in tandem with white borders), and while I didn’t do a primo job of replicating the DeBold-Friedman scheme it’s all about the concept, after all.

Now of course the pivot table will line its upper border with the year (Column area) entries, even as the maps in question underscore the values with those data. The simplest route toward emulation would call for a blue-on-blue formatting of the years above, or a simple hiding of their row, after which the years could be typed below. The Vaccine introduced line was almost surely drawn, and that’s what I did. The color legend


Is a tricky one; it could perhaps be rendered though an equally simple coloring of consecutive cells per the conditional formats, though it appears that the DeBold-Friedman band comprises tints of varying width. By way of resolution, four cells or so could be allotted per 1k, to support multiple colors spanning the same interval (note that the data for the states I selected don’t report numbers in the 2-4k range.)
In light of the preceding, I’ve gotten this, by way of a demo:


The larger point has been made, one hopes. Proper equivocations aside, the map approximates the representations of the DeBold-Friedman efforts, but again without recourse to applications beyond Excel. Indeed – for me the phase of the task that threw up the most resistance was the foundational pivot table itself. But once you toss off that gauntlet, the map beckons – because the map is the pivot table (and there’s no need by the way to resize the numbers in values for visibility’s sake, because the conditional formats will obscure them anyway).

Thus it seems to me that my infographical proposal works, albeit on a slight lower end from the DeBold-Friedman portrayals. And while I’m not claiming it’s the stuff for which awards are bestowed, I’m working on my acceptance speech anyway.

US Politicians’ Worth: A Wealth of Disclosures

20 Feb

If the citizen’s right to know shades into a gossipy prurience, the right to know nevertheless prevails; thus if the disclosures of the estimated net worth of American lawmakers compiled by the Center for Responsive Politics and reposed in the Enigma public data site titillates or outrages, one assumes the greater good is being served just the same. Those estimates are here:


First, don’t be fooled by the 2014 date stamp, as I was at first blush. After all, the worth list, sorted as it is highest-to-lowest, is headed by the irrepressibly well-heeled Mr. Trump, who if memory serves only pulled into the White House in 2017, whether he wanted to or not. The 2014, then, must point to the date of the estimates instead (and as such, the year field and its relentlessly uniform 2014s could be deleted, though keep in mind that Enigma stores Center worth data for previous years as well), and if a good many of the numbers appear to you in scientific-notational form (e.g. 1.07E+.08) that’s just one more way by which Excel means to tell you that their columns are presently too narrow. Just auto-fit away, as usual. (Note that the origin field in H is likewise uninformative, comprising nothing but the Center for Responsive Politics entry, and the source field in I is even sparser, amounting to nothing but its header.)

Get past those preliminaries and you’ll need to remind yourself about the codes informing the chamber field: E – Executive, S – Senate, H – House of Representatives, and J – Judiciary. That latter code estimates the worth of the nine now-serving Supreme Court Justices, an interesting set of disclosures about a coterie one might deem slightly above the financial fray; the S and H signifiers apparently name and enumerate each and every current Senator and Congressperson, a nicely comprehensive complement. The twelve executive entries, however, stand in a different relation to the dataset; they identify presidential hopefuls for the 2016 election and not office holders, with the exception of the winner.

Perhaps the first attention-grabber in the data to seize your lapels is the trio of worth estimates lined up for each politician. The Center for Responsive Politics explains that the data spring from disclosure forms completed by office holders which “do not require exact values”, but rather “the range of value into which an asset falls”. The Center thus filled ranges with lower and upper estimates and averaged the two into the avgvalue field, those computations in turn exhibiting considerable plus-minus variation. Indeed, 77 of the politicians declared a negative lower-end asset value, with another 11 submitting negative upper estimates. Of course, our data can’t take that question further.

In any case, working with the averages we could broach a first obvious question, namely how worth associates with chamber:

Rows: chamber

Values: avgvalue (formatted to two decimals with a comma)

I get:


Remember of course that the Executive universe is small, numbering but 12 cases. As such the 1.482- billion-dollar average estimate for Donald Trump slingshots the mean skyward, in spite of negative-asset claimers Scott Walker, Rick Santorum, and Martin O’Malley. Whether that means erstwhile Maryland governor O’Malley truly has less money than you and I is perhaps a query better put to his accountant, but the data maintain as much.

But the larger, if ingenuous, conclusion urged upon us is the reality that American national politicians – including Supreme Court justices – are overwhelmingly multi-millionaires. If the averages of the averages are credible, Senators possess over $10,000,000 worth of assets, with Representatives checking in a $7,500,000, accumulations that of course can’t ascribed to their $174,000 salaries. At the same time, however, variation within the Congress is massive, including outlier Darrel Issa, a Republican Congressman from southern California worth, per his average estimate, over $436,000,000.

For a sense of the dispersion of wealth, then, you can drag avgvalue in to values again, setting it to Stddevp (for the operational difference between Stddev and Stddevp, look here):


When the standard deviations exceed the averages you’re dealing with a lot of variation, even after discounting the ten-figured Mr. Trump.

Now for perhaps the juicier question: Might net worth vary palpably by party affiliation? Conventional wisdom would respond in the affirmative, i.e. Republicans and their putative sympathies for the haves could be expected to have more as well than the latter-day Blues. But either way, the dataset should enable an answer of sorts to the question.

The marching order here is to isolate party, as it’s referenced by its initial in the name field. Because that initial appears (almost) in every case immediately after the open parenthesis alongside each surname, we can take over the empty source field, rename it Party, and write in I2 (assuming you haven’t deleted the extraneous year field):


That expression seeks the character position of the open parenthesis and adds one to it, enabling MID to grab the next single character, which denotes political party. Copy the formula down the field and that should work, with nine exceptions – the Supreme Court justices, whose party memberships have been withheld from the data, an exclusion that imposes a #VALUE message upon their formulas. The workaround: rewrite the formulas thusly:


The nifty IFERROR function returns the default, nested formula result unless it evaluates to a error, in which case it’ll interject the defined if-false result, in the above case J, for Judiciary.

Once that revamped expression is copied down the field the pivot table follows:

Rows: Party

Values: avgvalue (Average, with appropriate formatting)

I get:



The thesis is confirmed, apparently (the I represents the two independent legislators informing the list, by the way, one of whom is Bernie Sanders). Republicans are worth on average 47% more than the (very) relatively proletarian Dems; but toss Stdevp back in to the table and we see:


The mammoth standard deviation among GOP sectarians has a good deal to do with the Trump billion, and Issa’s 436,000,000, too. Thus, if one deletes the row bearing Trump’s data and refreshes the pivot table, you’ll see:


Check it out.  The Republicans, we see, in actuality average a smaller net worth and standard deviation than Democrats. Go further and delete all 12 Executive entries – none of whom, with the exception of the incumbent, are election officials, after all – and run the pivot table once more:


Here, Democrats possess 9.94% more worth on average than Republicans.

Now that’s unconventionally wise.

Tweets of a Stable Genius

11 Feb

Notch November 7, 2017 as a watershed tickpoint on your presidential timeline. It was the day on which Twitter doubled its message character allotment to 280, an allowance that induced shudders in both the twittersphere along with vast terrestrial swaths: it mightily enriched the opportunities for the President of the United States to tweet his mind to you and me.

I’ll leave the rest of the editorials to you and me, but the character enhancement begs a few novel questions we could instructively put to the downloadable spreadsheet of Mr. Trump’s communiques offered up (for a small fee) by twdocs.com, the go-to site to which I’ve gone on more than one occasion. My latest requisition of @realDonaldTrump tweets (accept no substitutes) pumped out the incumbent’s latest 2805 entries through February 8, affording us exactly three months’ worth of post-280 streams of consciousness, with the earliest of these bearing the time stamp October 24, 2016. Again, I think you’ll have to pay for the sheet yourself if you want to click along.

A first question, then, an obvious one: has the widened message space encouraged Trump to issue longer tweets? And a second, slightly subtler follow-on: might 280 characters inspire longer, but fewer of them? (Note that this submission considers features of the commander-in-chief’s twitter proclivities over and above his preferred key words, a concern which predominates in my previous posts on the matter).
An answer to the former hinges itself on the LEN function, with which I’ve overwritten the contents of the H column, declaring nothing but the name Donald J. Trump in each of its populated cells. Renaming the field LEN I entered in H7:



and copied down. (I’m not quite sure what the field name a in the otherwise empty G column is doing. On the one hand, the name suffices to enable the pivot table to enfold the fields following G, but I’m not sure why the field need be there at all.)
The next move is slightly trickier, because the defining November 7, 2017 breakpoint doesn’t lend itself to a graceful pivot table grouping, or at least I don’t think it does. That is, grouping the date data in the A column by Months and Years will naturally turn out conventionally-demarcated months, e.g. Nov and Dec, and won’t group the months beginning with the 7th. A more elegant realization would tap into a pair of array formulas, i.e.: name the LEN field range Length and name the data in Created At (in the A column) Date. Enter the 11/7/2017 boundary date anywhere, say H1, and enter, say in G1:


The formula seeks out all dates prior to the benchmark November 7 and averages their tweet lengths. Extended to two decimal points, I get 126.80. Then enter in G2:


This companion expression averages the character length of the President’s tweets dating from November 7, 2017 and evaluates to 193.29, a none-too electrifying differential, I’d allow; ask Twdocs.com to download your tweets, after all, and your stats might look much the same. Survey a few million post-November 7 tweets and then, and only then, might we be able to say something distinctive about Trump’s relative outputs.
But what about the second question posed above – namely, if the heightened loquacity promoted by the 280 characters thus curtails Trump’s need for as many tweets? That finding could again be essayed in a number of ways, including some messier array formulas that would count the available tweets on either side of November 7, 2017 and divide those respective tallies by the number of days elapsed prior to and following that date. In fact, however, a slightly more imprecise but swifter pivot table could do a representative job:

Rows: Created At (UTC + 1h) (The parenthetical notation points to the French provenance of twdocs, at its recording of tweet transmission times per that time in France, an important qualification. The User Created At field, however, merely reports and repeats the date/time at which Trump initiated his Twitter account).

Values: Created At (UTC + 1h) (Count)

I get:


True, the Nov 2017 above conflates the pre and post-November 7 numbers a bit, but the larger thrust of the counts pushes ahead. We see an inconclusive range of distributions: while tweets across the last few months have indeed downturned in relation to October of last year (and that month stands as the most profuse among the table’s holdings), tweet counts in the Feb-Jun 2017 span were decisively lower than the latter-day entries. It’s thus not clear if Trump’s tweeting inclinations have been stunted by the new opportunity to say more at any one time.

And what about Trump’s average retweet count, proposed here as a proxy popularity index? Just substitute Retweet Count for Created At (UTC + 1) in Values and tick Average (with suitable formatting). Here I’d also turn off any subtotals, thus uprooting one higher level of slightly confounding aggregation from the counts. I get:


Commit the above to a rudimentary line chart, and I get:


Is that something worth writing home about, or at least submitting on deadline? Maybe. There’s something at least statistically significant about the line’s fitful locus, and it might be worth investigating April’s retweet dip and/or the more recent spike in RT’s, which does roughly correlate with the opening of the 280-character gateway, though I’m hard-pressed to advance beyond a simple correlation – unless one wants to maintain that longer tweets bespeak deeper policy insights that warrant stepped-up circulation across the twittersphere. Sounds like a reach.

Then add Favorite Count to Values (again, Average):


Favorites correlate most significantly with retweets (I get .8997), an unsurprising resonance. One supposes the prior question, rather a generic one, would ask why favorites reliably outnumber rewteets so enormously (look here, for example, for one take on the matter).

And to fit a final, relatively novel frame atop the Trump tweets, what about the president’s means of transmission? That is:

Rows: Created At (UTC + 1) (again, grouped by Years and Months)

Columns: Source

Values: Source (count)

I get:


Note among other things Trump’s late abjuring of Twitter for Android, and a corresponding uptick in his resort to Twitter for iPhone., his clear source of choice.
Does that mean that Mr. Trump will be recruited to spearhead Apple’s next Think Different campaign?

UK Immigration Raids: Migrating the Data

16 Jan

Designing a dataset falls on the complexity band somewhere between child’s play and rocket science, though I’m not sure to which of the poles the enterprise is closer. Rows and columns, you say; what else do you need, after all? And your rhetorical smirk is right, so far as it goes.

But sometimes the rows and columns require a think-through, preferably somewhere in that design stage – and particularly, for example, if the data are farmed out to more than a single sheet for starters.

Case in point: the look, or pair of looks, at what the Bureau of Investigative Journalism calls Immigration encounters and arrests, a breakdown of city and nationality between January 2012 and January 2017 of individuals either questioned or arrested by law enforcement officials, presumably for a putative trespass of immigration law. The Bureau makes much of the large numbers of British citizens bunched with the foreign detainees, stoking the suggestion that these nationals, by definition entitled to be where they were when they were apprehended, were ethnically profiled, an inference the UK government discounts. The BIJ data are gathered here:

FOI 42354 – Annex A

(Observe the notes stationed beneath the data. If I understand the qualifications of note 2 on the ENCOUNTERED NOT ARRESTED sheet, the “total number of persons” seems to provide for the possible counting of the same person multiple times, should he/she be questioned more than once. Thus, the data appear to tabulate all encounters, not discrete individuals. And note 2 on ENCOUNTERED ARRESTED apparently makes analytical room for multiple arrests of the same person.)

And those data comprise two identically organized worksheets; and so, it seems to me, the pair should be pared. The likenesses among the two argue for a consolidation, but with a particular end in mind: to calculate the ratios marking the encountered and arrested figures, both by city of apprehension and origin of detainee (note: the misspelled city names Cardiff and Sheffield in the ENCOUNTERED ARRESTED sheet must be emended).

But before the consolidation comes the reconstitution. It’s something we’ve seen many times before, but because the worksheets assign field status to the several cities in the data set, these really need to be reined by the yoke of a single field; and to achieve that end we can review and apply Excel’s Power Query capability to the task, more particularly its doubtfully-named Unpivot feature.

First, and this script speaks to both data sets, delete the Grand Total column and row, the better to stem a double-count of the numbers. Then click anywhere in the data set and click Data > From Table/Range in the Get & Transform Data button group (remember I’m working in Excel 2016).

Next, select the fields to be incorporated into what Excel terms termed the attribute-value pairs (use the shift-arrow keyboard combination to select the fields) e.g. in excerpt:


Then click Transform > Unpivot Columns:


And follow with Home > Close & Load. The outcome should look like this, more or less:


But that doesn’t conclude the process, even when you’d do much the same as above for the second data set. We need to differentiate the records per their determination -that is, we need to clarify whether or not the stop culminated in an arrest. Thus we need to append a Status field to the results, coding and copying non-arrest outcomes Enc (Encounter, or something like it), and arrest dispositions, e.g. Arr, down the tables that require this or that status.

And once all that work has been carried out we need to merge the data, by simply copying one set immediately beneath the other – but that amalgamation stirs a small problem, too.

Note that when Power Query succeeds in unpivoting the data (and that verb won’t concede that the data it unpivots was never a pivot table to begin with), it refashions them into a table; and so when we perform the copy described above the copied table continues to retain its now-irrelevant header row, now threaded among the genuine records. But attempt a Delete> Table Rows of the header and Excel resists; the delete option is grayed out, because the second table – the one we’ve pasted beneath the first – remains a table very much in its own right, its positioning notwithstanding, and a table requires a header row, apparently. A workaround is at hand, though: click successively anywhere within the two tables and click Table Tools > Design > Convert to Range, and reply Yes to the prompt. Apart from gifting you with the answer to a question that has recurrently rustled your sleep – namely, when would you ever want to downgrade a table back to what Excel judgmentally calls a normal range – we’ve just learned the answer to a companion question – why? Here, converting a table, or tables, into ranges enables the user to delete the second, now superfluous header row, and agglomerate the two erstwhile, contiguous tables into a single unitary data set. And there are pivot tables in there for the gleaning. (And once these deeds have been done we could rename the default Attribute field City, and Value could be identified anew as Number.­)

And one such table, per our stated intention about 500 words ago, looks to discover the proportion of detainees actually arrested, broken out here by city:

Rows: City

Columns: Number

Values: Count (% of Row Total) Turn off Grand Totals.

I get:


We see, for reasons that remain to be expounded, some notable differentials. It appears that Bristol’s cohort of suspects – 1143 in absolute terms – was by far most likely to experience arrest, more than 120% more vulnerable to arraignment than Sheffield’s 4687.  But accountings for those disparities await, of course. It is at least possible, for example, to imagine that Bristol’s immigrant authorities proceeded with a greater precision, perhaps being guided by more trusted information. Far larger London exhibits a detention rate of 24.24%, a touch lower than the aggregate 25.30%.

Substituting Nationality for City in the table and restoring Grand Totals yields a dossier of mixed findings, owing in part to the broad variation in the numbers of citizens for the respective countries. Sort by Grand Totals (the sort will work, even though they all display 100%; Excel sorts the values skulking behind the percentages), and the uppermost finding will give extended pause:

British citizens are practically never arrested when after being stopped for questioning, returning us to the proper question of profiling, and the kindred fact that the native British are stopped far more often – over 19,000 times that anyone else. On the other hand, we’re left to similarly explain Italy’s low arrest-to-encounter rate, or the fact the Norwegian nationals were stopped 32 times and never arrested.

Indeed, the mighty variation in rates begs a new round of questions; but the spreadsheet develops the first round of answers.

US Police Shooting Data, Part 2: A Civilian’s Review

29 Dec

There’s more tweaking to do inside the Vice data set of big-city police-shooting incidents, though some tweaks are more concessionary than ameliorative. The abundance of NAs flecking the sheet’s fields is not a matter of slipshod data collection; Vice’s preface to the data recalls the issues the site experienced with respondent cooperativeness and data quality. But missing data are missing, and so the fact that shooting victim ages (SubjectAge, in column H) don’t appear for more than 53% of incident cases remains a fact; and while of course the available ages of the other approximately 2000 victims amounts to a sizeable cohort, one wonders if the absent data inflict a sampling skew on the ages. I suspect they don’t, but the question should be recorded.

Pivot-tabling the age data, then, will deliver a mixed yield, e.g. NA along with N/A, U and UNKNOWN, two instances of the label Juvenile, 29 hard-coded #REF! entries that seem to mark a formula gone wrong, and a few hundred records reporting estimated ages in spanned swaths that are no less textual – 0-19 and 20-29, for example. Again, the data sources are uncontrollably heterodox, but a decision about these data needs to be made – which might consist of a determination to simply ignore the age parameter, or alternatively of course to proceed with the usable information in hand.

And missing data likewise compromise the SubjectRace field. Submitting this pivot table for consideration:

Rows: SubjectRace

Values: SubjectRace

SubjectRace (again, % of Column Total)

Churns out these distributions:


The disproportion of Black victims is clear, but of course remains to be explained by a larger context framed by prior variables, e.g. number of overall incidents, number of alleged perpetrators alleged to have been carrying a weapon, etc., but again note the raft of NAs, Us, and blanks disqualifying more than 30% of the entries from the analysis (the blanks simply aren’t counted above at all). (I’m assuming, by the way, that A signifies Asian and O Oriental. But even if so, I don’t know if those categorizations denote distinct or problematically redundant ethnic niches.)

But in any case, there are things to be learned from the data, their shortcomings notwithstanding. For example, we could look at the relationship of victims’ race to shooting incidents resulting in a fataility. In that connection we could assemble this pivot table:

Rows: SubjectRace

Columns: Fatal

Values: SubjectRace (% of Row Total; filter out (blank) and turn off Grand Totals)

I get:


The results are fairly startling.  We have already learned that the absolute number of black victims far outpaces the other groups, but the ratio of fatal police shootings by race involving a white victim – 51.27% – is far higher than that for black or Latinos, a less-than-“intuitive”, literally provocative outcome.

How might these proportions be accounted for? If in fact police may make swifter resort to their weapons in confrontations with black suspects – if – then perhaps white suspects are shot and killed for compelling, truly defensible reasons, i.e., more of them might be armed, or at least seen as such?

In view of that question we could thus introduce a third variable, SubjectArmed. Because three-variable pivot tables complicate readability, we could assign a Slicer to the field instead:

Rows: SubjectRace

Columns: SubjectArmed

Values: SubjectRace (count, % of Row Totals)

Slicer: Fatal

Ticking F in the Slicer for Fatal, I get:

vice 3a


Remember that the column entries above comprise the SubjectArmed items – No, Not Available, Unknown, and Yes.

We see that, so far as the data can take the question, white fatalities are actually less likely to be armed than black or Latino victims. But because the figures for B incur so many NAs and Unknowns, the conclusions are obscured. Indeed – as a consequence of the missing data, we see that white victims are also more likely to be armed.

It could also be noted that 153 fatal shootings received an NA race code, and another 80 were assigned a U. One wonders if these could be inferentially mapped against and compared with the known-race victim distributions, though that supposition is a highly adventurous one.

But turn to the OfficerRace and OfficerGender fields and you’ll encounter a new set of challenges, only one of which are the 450 OfficerRace cells containing no officer information at all, and the not-inconsiderable number of other cells bearing the code H.  Given the cities with which the code is associated I’d guess H stands for Hispanic, but how or if that letter means to distinguish itself  from L I’m not sure, particularly in light of the absence of H data from the SubjectRace field.

OfficerRace and OfficerGender accompany the straightforwardly quantified NumberOfOfficers field, and attempt to code the ethnicities of incident-participant officers – in a single cell.  That is, a cluster of items, each of which would typically populate its own record, here share one cell’s space.

Moreover, a good many cells replace the elliptical W and B entries with WHITE or BLACK respectively, a set of inconsistencies that have to be dealt with as well. We could dispose of these latter anomalies, however, via batch of finds and replaces, e.g., exchanging a W for WHITE.

But it’s the multi-item cells that call for a deeper think. If, for example, we wanted to break out the officer totals by ethnicity, say white officers for starters, we could select OfficerRace, name it offrace and enter, somewhere:


I get 2329 – but how? Those tell-tale curly braces can only mean one thing: an array formula is at work, and what it’s doing first is first summing the aggregate length of the cells committed to offrace, and from that default baseline substracting a revised aggregate length – achieved by substituting nothing, as it were, for every instance of a W throughout the offrace range (remember that SUBSTITUTE replaces every instance of the searched-for text, not merely the first one), and totalling the lengths these. The math, then, yields the 2329, and to empower the process you could do something like this:  enter these race identities in Q2:Q7 respectively: W,B,L,A,H,NA. Follow with


in R2, and copy down accordingly. I get

vice 4a

First, of course, the numbers need to be weighted alongside the aggregated ethnic representations of the police forces studied, but we don’t have that information here. But a more proximate operational concern looms above: At least two of the results are wrong.

For one thing, the formula’s search for NA (assuming that search term is in Q7) must read instead

{=SUM(LEN(offrace)-LEN(SUBSTITUTE(offrace,Q2,” “)))}

What’s different about the above emendation is the space “ “ supplying the final argument in the expression. Because we need to substitute here for two characters – the N and the A – their replacement need comprise one character, in order to foster a differential of one for each original NA. Thus my redone substitution returns 1174, or precisely half the original, errant NA total.

But the other false start is considerably trickier. The count of 1299 for A enormously overstates the incidence of that code, because my formula also registers appearances of NA (and other entries as well) which, after all, contain that selfsame A. The task, then, is to account only for isolated A’s, disassociated from any other entries in the field that also happen to contain the letter – and there are other entries, for example AI/AN, and even one instance of Multi-Racial. The formula as it stands will substitute those A’s too, and be duly counted.

One far-from optimum end run around the problem would be something like this:


That formula in effect calculates all A appearances and simply subtracts the NA total from them, which holds down cell R7. It yields a far more plausible 125 but remains inexact, owing to the other A-freighted cells that continue to be counted above. What could have been done, then: a seeing to it that every coded race/ethnicity consisted of unique letter(s).

But I’d allow I’ve done enough deep thinking for one post.

US Police Shooting Data, Part 1: A Civilian’s Review

18 Dec

You’ll want to read the fine print attending Vice.com’s database on US police shootings perpetrated between 2010 and 2016 for the country’s 50 largest police departments before you turn to the data themselves here, having been exported into Excel via its native Google sheet:


That expository lead-in on the Vice site recounts the provenance of the data, the mixed cooperativeness of departments Vice encountered when requesting incident information, and some of the coding decisions that shaped the workbook as it has emerged. In fact, Vice reports the figures for 47 police departments and for what it tabulates as 4,099 episodes, though the data set itself comprises 4,381 records. That discrepancy may owe to the fact that Vice appears to have assigned one record to each civilian party to an incident, even if multiple civilians were participative (note the NumberofSubjects field in column B records 1 for each row; that unrelieved redundancy suggests the field is dispensable).

Of course, the spreadsheet can’t tell us what truly happened in any of these 4,000-plus episodes, but that should discourage neither Vice nor us from evaluating its contents; but as usual, those contents need be adjudged for their data-worthiness first.

And a few judgements need to be made about some authentic issues. When I attempted to sort the NumberofSubjects field I was detained by this message:


In other words, some merged cells have edged their way into the set, congesting the sort prohibitively. The swiftest means for unmerging (lovely Microsoft verb) the culprit cells: click the empty Select all button to the immediate left of the A column header, thus selecting the entire sheet, and click Home ribbon > Unmerge Cells, the final option waiting behind the Merge & Center button in the Alignment group. (If you need to learn the specific addresses of merged cells, look here. The merged-cell adulteration is particularly pronounced among the data for Austin, Texas, by the way, these commencing in row 105).

There’s more. My customary date-validity check of the Date field, i.e. submitting a =COUNT(A2:A4382) formula to the column in search of a total of 4381 – the count of entries in the range that would roll up should all of them prove to be numeric – presented me with 4355 instead, netting, or nettling, a shortfall of 26. And that means that 26 date bluffers – textual flim-flammers hoping to con you and me into believing they’re for real – are up to no good somewhere in there.

If you click in an actual date in A and sort the date data Oldest to Smallest, those scruffy 26 settle to the bottom of the field. They’re all phrased in European regional syntax (dd/mm/yyyy) (as they are in the native Google Sheet), and they all emanate from Baltimore County, as it turns out. But origins aside, their numeric character needs to be reclaimed.

Something you can do, then: clamber into Q4357 – the next free cell in the first row bearing the phony dates – and enter:


DATE is the function that gathers individuated year, month, and day references and compounds them into a usable date. Here I get 41134; format the result in Short Date terms (use the Number format dropdown in the Number button group) and you wind up with a for-real 8/13/2012. Copy down Q, then Copy > Paste Value the results to A4357:A4382, and format accordingly (and now delete the Q data).

That works, but now ride the elevator to A2 and consider the value – and it is a value – that sits in that cell. That 2010 you see is not some stripped-down cipher for a fuller expression, e.g. 6/5/2010. Rather, it’s nothing but, and only 2010; and if you format those four digits into Short Date terms you get 7/2/1905 – which after all, is the 2010th day distanced from Excel’s date baseline of January 1, 1900. Without the supplementary day and month data, that 2010 won’t be transposed into anything later in the calendar, and there are 593 cells in the A column – or 13.5% of all of its holdings – comprising four-digit entries that won’t evaluate to anything like the years they purport to portray.  And even if you pivot table the dates and hope to pivot them by year only, that won’t work.

These four-digit discordances inflict both a spreadsheet and a journalistic/ethical problem of sorts upon the analysis. The former is spoken to above, but the latter asks about a workaround. What can we do with these data? If we adjoin some place-holding day-month place holder – e.g. 1/1 to the numbers in the interests of lifting them to usability, is that something we’re allowed to do? Perhaps we could, if we indeed want to group the dates by year, and only by year. In that case, since our singular interest is in grouping the years that we in effect already know, any fictional days and months would play a merely technical, facilitating role in reconciling all the year data to the table. But ascribing any substantive role to the days and months, e.g., grouping the data by the latter parameter, would appear to abuse the system.

If that rationale is deemed supportable, we could then remake the 2010 in A2 via


And do the same for its fellow 592 cells (followed by Copy > Paste Values and reformatting to Short Date). Complete that task and you could engineer this pivot table:

Rows: Date (grouped by Years only)

Values: Date (Count)

 I get:


The preliminary count points to a marked diminution in police shootings since 2013 – encouraging, perhaps, but a suggestion that needs to be inspected for the completeness and comparability of the data across the 2010-2016 span.

Now if you want to enumerate shooting incidents by city – an aim bulking high on any investigative agenda (factoring population size, of course) you could proceed here:

Rows City

Values: City (Count)

The values could be sorted Largest to Smallest. I get, in excerpt:


Chicago – a city less than half as populous as New York – records a surpassingly high incident total, in fact accounting for 12% of all the reported shootings among the cities. The city’s spike in gun violence has been no stranger to media coverage, and so resort by the police to that weapon is not shocking in and of itself. Return Date to the pivot table, this time to the Columns area, and


Note Chicago police have actually pulled back on gun interventions (but note the missing data for Detroit for 2014-16, contributory in part to the lowered recent numbers).

One other point, here: I think the BaltimoreCity/BaltimoreCounty and MiamiDade/City of Miami entries are properly distinguished, and not instances of spelling inconsistencies. But there are other data issues, too.