NY Regents Exam Data, Part 2: Multiple Choices

12 Apr

Numbered among the additional conclusions we can draw from the New York Regents data is a natural next question from the aggregate test averages we reckoned last week, answered by kind of reciprocal finding: namely, the aggregate fail rates. Guided by the concerns about weighting we sounded in the previous post, I’d allow that a calculated field need be applied to the task here too, a field I’ve called PctFail.

But before we proceed we again need to contend with the not insignificant number of records who, for apparent reasons of confidentiality, won’t count their fewer-than-five students, replacing the totals with an “s”. Thus I tapped into column N, called it NFail, and entered in N2:


The formula assays the relevant cell in J for an “s”; if it’s there, a 0 is supplied. Otherwise the value in K – the Number Scoring Below 65 – is returned.

Again, we’ll copy that formula down N and proceed to calculate the PctFail field:


Once effected, this simple pivot table, abetted by the All Students Slicer selection we ticked last post, opens the story:

Rows:  Year

Values: PctFail (formatted here in Percentage mode to three decimals)

I get:


The failure rates are substantial, a near-ineluctable follow-on from the overall score averages settling in the 68 range (remember that 65 passes a Regents exam).

But you’ll want to know about failures by Regents subjects, too. Sight unseen, you’d expect appreciable variation among the test areas, and that drill down can tool its way into the data via the Regents Exam field, e.g. something like this:

Rows: Regents Exam

Columns: Year

Values: PctFail (formatted similarly to the table above)

I get:


And variation there is, some more provocative than others. You’ll note the massive leap in failure rates for English and Geometry from 2015 to 2016, a determined, ascending slope of failures for Algebra2/Trigonometry, and a restitutive, noteworthy shrinkage in failures for Common Core Algebra. (The Common Core tests are controversial, in part because of their redesign; see this report, for example).

You’ll also want to do something about those error messages. In some cases, the #DIV/0! outcomes simply key the absence of data for the exam, owing to an exam’s discontinuation or later introduction, while the (blank) label appears in virtue of the ten rows that bear no exam name. Should you want to pave over the errors, and you probably do, click anywhere in the pivot table and proceed to PivotTable Tools > Analyze > Options > Options > Layout & Format tab > and tick For error values show:. Enter some appropriate stand-in for #DIV/0!, e.g. — , and click OK. Because the dashes in the 2015 column push far left and look almost as unseemly as the original error message, you may want to select all the Values and align them right. (You could also filter out the blanks.)

Now if you want to crunch failure rates by ethnicity, for example, you’ll again have to reconcile the double-counting character of the fields we described last post. The ethnicities – Asian, Black, Hispanic, Multiple Race Categories Not Represented, and White – have been quartered in the Demographic Variable field, but so have a potpourri of other, disconnected items bound to other Variables, e.g. Female to Gender, English Proficient to ELL Status.

We’ve stubbed our toe against this odd problem in the previous post, in which Excel’s million-record limit has forced records otherwise deserving of their own field into one, messy cosmopolitan column – the one called Demographic Category, itself dispersed into no-less-heterogeneous items in Demographic Variable. It’s a confusing issue, but I think we need to tick Ethnicity in the Slicer now and slide Demographic Category –confined by the Slicer to its ethnic item entries – into Rows. Sweep Year into Columns and you get:


The disparities here are dramatic, and rather self-explanatory – the results, that is, not the accountings of them.

Next opt for Gender in the Slicer:


Women outdo men, a finding that more-or-less jibes with current understandings of gender performance differentials. The female margin, nearly equivalent across 2015 and 2016, pulls away slightly in the following year.

And what of gender outcomes by exam? Slip Regents Exam atop Demographic Category (which has been sliced to Gender) in Rows, and (in excerpt):


And (in second, remaining excerpt):


You’re looking for palpable divergences, of course, but palpable congruences mean something here, too.  The decisive female advantages in the Common Core English scores are perhaps notable but not arresting; but their edge in Common Core Algebra does a fair share of stereotype busting, even as males emerge the stronger in Common Core Algebra2. (Important note, the Grand Total pass rates vary by Demographic Category even as the Total Tested remains neatly constant across all Demographic Variables. That’s because the distribution of “s” entries across the Categories isn’t constant.)

There are plenty of other permutations in there, but let’s try one more. Column Q quantifies the number of students in the record whose score achieves what the Regents calls College Readiness (CR), i.e., a 75 in the English Language Regents or an 80 on any Math exam in the system.

And here’s where I have to own up to a bit of spreadsheeting excess. In the previous post I implemented what I termed an NFail field, embodied by its foundational formula – an expression that replaced “s” entries with a 0, the better to factor these into a calculated field. I now understand that those exertions were unnecessary, because Excel will completely ignore an “s” or any other label in any case. Thus here (and last week, too) we can work directly with the Number Scoring CR field in Q. But because we do need to acknowledge the large number of “s” and “NA” entries in Q (NA, because only some Regents’ qualify as CR exams) that will impact any denominator we also need what I call here a CRCounted field to be put in place in the next available column, punctuated by this formula that gets copied down:


We then need compose that calculated field, which I’m calling CRPass:


Remember here, and for the first time, we’re computing pass rates. This pivot table among others, awaits, under the aegis of the Demographic Category Slicer – Gender:

Rows: Demographic Variable

Columns: Year

Value: CRPass

I get:


Provided I’ve done my homework correctly the results point to a striking spike in CR attainments, a set of findings that calls for some journalistic deep-backgrounding. (Note that the absolute CR numbers are far smaller than the global Total Tested figures, because as indicated above only certain exams march under the CR banner.) We see a small-scaled but real stretching of the female advantage in the pass rates between 2015 and 2017, one that also needs to be sniffed by some nose for news.

Now let my take a break while I double-check my homework. I hear this blogster is one nasty grader.


NY Regents Exam Data, Part 1: Multiple Choices

28 Mar

We haven’t met, but I can say with a jot of confidence that I’ve likely done something you haven’t – taken a Regents exam. Exams.

I’m not clipping the achievement to my lapel as a badge of honor, you understand, just stating a biographical matter of fact. The Regents – a staple of the New York State educational system in which I spent more than a little time – comprise a series of what are termed exit exams; pass enough of them and you walk away with a high school diploma of the same name. Your correspondent took his share of Regents, his scores emobdying the “scatter” in scattergram, but no matter; I took them, and I’m a better person for having done so.

But before you file a Freedom of Information Act request to verify those abnormally-curved results you may want to review a larger, ultimately more interesting record of Regents attainments, the dataset supplied by the New York City open data site that summarizes in grand form the Regents scores of students statewide for the years 2014-17. It’s a big file, needless to say – so big you’ll need to download it yourself – and its 212,000 or so records have a lot to say about the testing profile of New York’s high schoolers.

It also has a lot to say about spreadsheet organization, more particularly the juxtaposition of column G, Demographic Category, to H, Demographic Variable. Those columns/fields in fact identify a series of putative fields and field items respectively; and as such, G’s contents could, at least in theory, have been more conventionally structured into discrete parameters, each owning a column all its own.

But that prescription calls for an elaboration. Consider this fledgling pivot table drawn from the Regents data:

Rows: Demographic Category

Demographic Variable

Values: Total Tested

I get:


Note that that all the Total Tested subtotals are equivalent, intimating that the five Demographic Categories cleave the same population into an assortment of cross-cutting attributes, and thus summing the same student count five times. The by-the-book alternative, again, would have assigned each Category to an independent field, such that an interest in test achievements by Gender, for example, would require the user to simply drag Gender into Rows. As it stands, however, a Slicer (or filter) would have to grease the process, e.g.:


Moreover, casting the potential field-bearers into mere item status beneath that singular Demographic Category banner appears to obviate a good many cross-tabulating possibilities, e.g.: a breakout of tests by both Ethnicity and Gender. How, for example, are we to learn how black female students score on the Regents when both attributes are lodged in the same field, and so must occupy the same label area?

But at the same time, the spreadsheet designers had to contend with a supplementary problem that overrides my challenges – namely, that the upgrading of say, Ethnicity and Gender to field standing would appear to require that data present themselves in individual record form, e.g. each student’s performance on each test; and those 2,000,000 scores/records would burgeon beyond Excel’s data-accommodation space.
In any case, there is indeed lots to learn structural complications notwithstanding, and we could begin by starting coarsely – by calculating the average overall Regents scores by year:

Row: Year

Values: Mean Score (Average, formatted to two decimal points)

I get:


(Note that one record, attaching to the Island School in Manhattan, exhibits a nonsensical entry for its year. By filtering and comparing the Island School data, it appears that the record belongs to 2017.)

The averages are remarkably similar, though I’d venture that, given the 2,000,000-pupil universe, the one-point differential distancing the 2015 and 2017 scores is significant. Remember that the Regents passing score is pegged at 65, suggesting that the test designers got their threshold right.

But those averages aren’t quite definitive, and for a couple of connected reasons, one subtler than the other. The first recognizes that the student double-count pinpointed above proceeds in effect to compute the average scores multiple times, because the records operate under the steam of different demographic categories and numbers per record. Thus the mean average for the Gender category alone – which nevertheless contains all students – is likely to depart at least slightly from the mean average for Ethnicity, which likewise contains all students. If, for example, we reintroduce the Slicer for Demographic Category for the current pivot table and tick All Students, we’ll get:


The differences from the initial pivot table are very small but evident, again because the Total Tested numbers per the All Students records don’t perfectly line up with the Total Tested per-record numbers for Ethnicity, for example – because each record receives an equal weight, irrespective of its Total Tested value.

And it is the matter of weighting that points its arrow directly at the second question, one we’ve seen elsewhere (here, for example). The per-record mean averages ascribe an identical mean score input to each record, even as the test taker numbers vary. And that bit of record democratization vests greater, relative influence to the smaller numbers. The result, again: a possible skewing of the averages.

But because Column I enlightens us about the actual test taker numbers, we should be able to derive a simple calculated field to impart a corrective to the weighting problem – once we deal with the very large number of records – about 75,000 – that just don’t report their score results. These are the entries dotted with the “s” code, and our formula needs to ignore them as well as their test taker numbers, which are stated, after all.
So here’s what I did. I headed next-available column S TotalPts and entered, in S2:


That expression means to assign 0 for any “s” datum, and otherwise multiply the record’s mean score by its number of students. (Important note: absent Means Scores almost always associate themselves either with a Total Tested number of five or less, or with entries possessing the SWD or Non-SWD values. (SWD stands for Students With Disabilities.) One assumes that both types exclusions are justified by reasons of confidentiality; but remember that the five-way count of students in the data set should subsume most of the SWD takers anyway, via their inclusion in the All Students item.)

After copying the formula down I titled the T column CountedStu and dropped down a row, wherein I entered:


The formula asks if the relevant row in J contains an s. If so, a zero is returned; otherwise, the number of test takers cited in I is returned.

I next devised a simple calculated field, ActAvg (for actual):


That field can now be made to work with any other field whose items are banked in Row Labels, e.g. by substituting ActAvg for Mean Score in Values (and leaving the All Students Slicer selection in place). I get:


It’s clear that our weighting refinements have uncovered a “true” higher set of averages, and that continue to sustain the near-point improvement from 2015 to 2017.

But again, there’s quite a bit more to learn from the scores. But I’m asking now to recess for lunch – and yes, I raised my hand.

U. S. Vaccine Data: A Different Treatment

14 Mar

It’s alarmingly philistine of me to say so, but I know what I like; and I like the heat maps that track the vaccine-driven advances in the United States wrought against infectious diseases, schematized by graphics boffins Tynan DeBold and Dov Friedman and featured in the Wall Street Journal in 2015. One of them looks like this:


These are of course heat maps in the broadened sense, plotting epidemiological movement across time instead of territory, and tiling their data into tightly-bound mosaics, whose colors blanche as the diseases recede. In short, a well-told story of stirring medical progress, the data for which is contributed by Project Tycho of the University of Pittsburgh and available for download there (you can sign into it for free if you’re affiliated with an educational institution). By contrast, think what a line chart bearing 50 data series would look like.

It should be added that the Tycho site itself hardwires a heat-mapping utility into its pages, in what could be construed as an ancestral precursor of the DeBold-Friedman outputs, e.g.


It is worth asking, then, about the refinements DeBold and Friedman commended to the Tycho charts, and the respective reasons why.

But in any event the map designers don’t need my everyman’s encomium; their depictions have won at least two awards – one conferred by the Global Editor’s Network (GEN), the other by the Kantar Information is Beautiful judges. The former’s site declares the maps were “…wildly popular on social media as well as with statisticians and graphics editors weighing in on how they would’ve approached this project”.

Sounds good to me, but note for the record that the Y axis above records 26 state names, pulling up 24 (really 25; the District of Columbia – the nation’s capital that possesses extra-state status – is likewise counted) short of the full American complement. Those names – at least as they’re represented above – have been invested with a font size that misaligns them with the 51 rows of data; the names are simply too large for the row heights that capture the data, a mismatch that conduces toward the follow-on question as to why the particular states above were earmarked for display, to the exclusion of the others.

You’ll also observe that the maps’ cell widths vary. Contrast the DeBold-Friedman above with this one:


Because the above map – and its 26 states – time-stamps its data from a later point of inception its cells are dilated, a refit that widens it equivalently to the other maps. Do these discrepancies matter (along with the ten-year intervals marked in the first map that have been halved in the second)? I’m not sure.

And something else about the maps provoked a thought or two. On his website Dov Friedman tells us that he used Excel “to aggregate over 100,000 data points. The data was then plotted on heatmaps using Highcharts. All sections were templatized with handlebars [sic].”

Now Highcharts and Handlebars are two applications with which I am not familiar (I told you I was a philistine); But in the course of perusing and admiring the maps a renegade idea gatecrashed my cerebellum: could the maps be emulated – at least more or less – with Excel alone?

I think the answer is yes, at least more or less; it seems to me that, with the disease data dropped into the Values area, the heat maps could be made to emerge from a pivot table that would crosstab state names and years against the data, whose numbers could be comparatively scaled through a series of conditional formats.

In receipt of that self-issued marching order, I proceeded to download the measles data for five states to see how the heat maps might be framed in Excel-only mode. The numbers were redirected into a pivot table (not a straightforward program; the download assigned each state’s figures to an independent field, and as such had to be harnessed inside a single State parameter via the Get & Transform utility. In addition the “-“ cell entries marking absent data had to be quantified, but I’ll spare you the particulars).

Once the above reconstructive work was carried out the resulting pivot table proceeded pretty straightforwardly, something like (depending on how you’d name the fields):

Rows: States

Columns: Year

Values: Incidence

And once the table was put in place, the cells would be subjected to the color-graded conditional formats described above (in tandem with white borders), and while I didn’t do a primo job of replicating the DeBold-Friedman scheme it’s all about the concept, after all.

Now of course the pivot table will line its upper border with the year (Column area) entries, even as the maps in question underscore the values with those data. The simplest route toward emulation would call for a blue-on-blue formatting of the years above, or a simple hiding of their row, after which the years could be typed below. The Vaccine introduced line was almost surely drawn, and that’s what I did. The color legend


Is a tricky one; it could perhaps be rendered though an equally simple coloring of consecutive cells per the conditional formats, though it appears that the DeBold-Friedman band comprises tints of varying width. By way of resolution, four cells or so could be allotted per 1k, to support multiple colors spanning the same interval (note that the data for the states I selected don’t report numbers in the 2-4k range.)
In light of the preceding, I’ve gotten this, by way of a demo:


The larger point has been made, one hopes. Proper equivocations aside, the map approximates the representations of the DeBold-Friedman efforts, but again without recourse to applications beyond Excel. Indeed – for me the phase of the task that threw up the most resistance was the foundational pivot table itself. But once you toss off that gauntlet, the map beckons – because the map is the pivot table (and there’s no need by the way to resize the numbers in values for visibility’s sake, because the conditional formats will obscure them anyway).

Thus it seems to me that my infographical proposal works, albeit on a slight lower end from the DeBold-Friedman portrayals. And while I’m not claiming it’s the stuff for which awards are bestowed, I’m working on my acceptance speech anyway.

US Politicians’ Worth: A Wealth of Disclosures

20 Feb

If the citizen’s right to know shades into a gossipy prurience, the right to know nevertheless prevails; thus if the disclosures of the estimated net worth of American lawmakers compiled by the Center for Responsive Politics and reposed in the Enigma public data site titillates or outrages, one assumes the greater good is being served just the same. Those estimates are here:


First, don’t be fooled by the 2014 date stamp, as I was at first blush. After all, the worth list, sorted as it is highest-to-lowest, is headed by the irrepressibly well-heeled Mr. Trump, who if memory serves only pulled into the White House in 2017, whether he wanted to or not. The 2014, then, must point to the date of the estimates instead (and as such, the year field and its relentlessly uniform 2014s could be deleted, though keep in mind that Enigma stores Center worth data for previous years as well), and if a good many of the numbers appear to you in scientific-notational form (e.g. 1.07E+.08) that’s just one more way by which Excel means to tell you that their columns are presently too narrow. Just auto-fit away, as usual. (Note that the origin field in H is likewise uninformative, comprising nothing but the Center for Responsive Politics entry, and the source field in I is even sparser, amounting to nothing but its header.)

Get past those preliminaries and you’ll need to remind yourself about the codes informing the chamber field: E – Executive, S – Senate, H – House of Representatives, and J – Judiciary. That latter code estimates the worth of the nine now-serving Supreme Court Justices, an interesting set of disclosures about a coterie one might deem slightly above the financial fray; the S and H signifiers apparently name and enumerate each and every current Senator and Congressperson, a nicely comprehensive complement. The twelve executive entries, however, stand in a different relation to the dataset; they identify presidential hopefuls for the 2016 election and not office holders, with the exception of the winner.

Perhaps the first attention-grabber in the data to seize your lapels is the trio of worth estimates lined up for each politician. The Center for Responsive Politics explains that the data spring from disclosure forms completed by office holders which “do not require exact values”, but rather “the range of value into which an asset falls”. The Center thus filled ranges with lower and upper estimates and averaged the two into the avgvalue field, those computations in turn exhibiting considerable plus-minus variation. Indeed, 77 of the politicians declared a negative lower-end asset value, with another 11 submitting negative upper estimates. Of course, our data can’t take that question further.

In any case, working with the averages we could broach a first obvious question, namely how worth associates with chamber:

Rows: chamber

Values: avgvalue (formatted to two decimals with a comma)

I get:


Remember of course that the Executive universe is small, numbering but 12 cases. As such the 1.482- billion-dollar average estimate for Donald Trump slingshots the mean skyward, in spite of negative-asset claimers Scott Walker, Rick Santorum, and Martin O’Malley. Whether that means erstwhile Maryland governor O’Malley truly has less money than you and I is perhaps a query better put to his accountant, but the data maintain as much.

But the larger, if ingenuous, conclusion urged upon us is the reality that American national politicians – including Supreme Court justices – are overwhelmingly multi-millionaires. If the averages of the averages are credible, Senators possess over $10,000,000 worth of assets, with Representatives checking in a $7,500,000, accumulations that of course can’t ascribed to their $174,000 salaries. At the same time, however, variation within the Congress is massive, including outlier Darrel Issa, a Republican Congressman from southern California worth, per his average estimate, over $436,000,000.

For a sense of the dispersion of wealth, then, you can drag avgvalue in to values again, setting it to Stddevp (for the operational difference between Stddev and Stddevp, look here):


When the standard deviations exceed the averages you’re dealing with a lot of variation, even after discounting the ten-figured Mr. Trump.

Now for perhaps the juicier question: Might net worth vary palpably by party affiliation? Conventional wisdom would respond in the affirmative, i.e. Republicans and their putative sympathies for the haves could be expected to have more as well than the latter-day Blues. But either way, the dataset should enable an answer of sorts to the question.

The marching order here is to isolate party, as it’s referenced by its initial in the name field. Because that initial appears (almost) in every case immediately after the open parenthesis alongside each surname, we can take over the empty source field, rename it Party, and write in I2 (assuming you haven’t deleted the extraneous year field):


That expression seeks the character position of the open parenthesis and adds one to it, enabling MID to grab the next single character, which denotes political party. Copy the formula down the field and that should work, with nine exceptions – the Supreme Court justices, whose party memberships have been withheld from the data, an exclusion that imposes a #VALUE message upon their formulas. The workaround: rewrite the formulas thusly:


The nifty IFERROR function returns the default, nested formula result unless it evaluates to a error, in which case it’ll interject the defined if-false result, in the above case J, for Judiciary.

Once that revamped expression is copied down the field the pivot table follows:

Rows: Party

Values: avgvalue (Average, with appropriate formatting)

I get:



The thesis is confirmed, apparently (the I represents the two independent legislators informing the list, by the way, one of whom is Bernie Sanders). Republicans are worth on average 47% more than the (very) relatively proletarian Dems; but toss Stdevp back in to the table and we see:


The mammoth standard deviation among GOP sectarians has a good deal to do with the Trump billion, and Issa’s 436,000,000, too. Thus, if one deletes the row bearing Trump’s data and refreshes the pivot table, you’ll see:


Check it out.  The Republicans, we see, in actuality average a smaller net worth and standard deviation than Democrats. Go further and delete all 12 Executive entries – none of whom, with the exception of the incumbent, are election officials, after all – and run the pivot table once more:


Here, Democrats possess 9.94% more worth on average than Republicans.

Now that’s unconventionally wise.

Tweets of a Stable Genius

11 Feb

Notch November 7, 2017 as a watershed tickpoint on your presidential timeline. It was the day on which Twitter doubled its message character allotment to 280, an allowance that induced shudders in both the twittersphere along with vast terrestrial swaths: it mightily enriched the opportunities for the President of the United States to tweet his mind to you and me.

I’ll leave the rest of the editorials to you and me, but the character enhancement begs a few novel questions we could instructively put to the downloadable spreadsheet of Mr. Trump’s communiques offered up (for a small fee) by twdocs.com, the go-to site to which I’ve gone on more than one occasion. My latest requisition of @realDonaldTrump tweets (accept no substitutes) pumped out the incumbent’s latest 2805 entries through February 8, affording us exactly three months’ worth of post-280 streams of consciousness, with the earliest of these bearing the time stamp October 24, 2016. Again, I think you’ll have to pay for the sheet yourself if you want to click along.

A first question, then, an obvious one: has the widened message space encouraged Trump to issue longer tweets? And a second, slightly subtler follow-on: might 280 characters inspire longer, but fewer of them? (Note that this submission considers features of the commander-in-chief’s twitter proclivities over and above his preferred key words, a concern which predominates in my previous posts on the matter).
An answer to the former hinges itself on the LEN function, with which I’ve overwritten the contents of the H column, declaring nothing but the name Donald J. Trump in each of its populated cells. Renaming the field LEN I entered in H7:



and copied down. (I’m not quite sure what the field name a in the otherwise empty G column is doing. On the one hand, the name suffices to enable the pivot table to enfold the fields following G, but I’m not sure why the field need be there at all.)
The next move is slightly trickier, because the defining November 7, 2017 breakpoint doesn’t lend itself to a graceful pivot table grouping, or at least I don’t think it does. That is, grouping the date data in the A column by Months and Years will naturally turn out conventionally-demarcated months, e.g. Nov and Dec, and won’t group the months beginning with the 7th. A more elegant realization would tap into a pair of array formulas, i.e.: name the LEN field range Length and name the data in Created At (in the A column) Date. Enter the 11/7/2017 boundary date anywhere, say H1, and enter, say in G1:


The formula seeks out all dates prior to the benchmark November 7 and averages their tweet lengths. Extended to two decimal points, I get 126.80. Then enter in G2:


This companion expression averages the character length of the President’s tweets dating from November 7, 2017 and evaluates to 193.29, a none-too electrifying differential, I’d allow; ask Twdocs.com to download your tweets, after all, and your stats might look much the same. Survey a few million post-November 7 tweets and then, and only then, might we be able to say something distinctive about Trump’s relative outputs.
But what about the second question posed above – namely, if the heightened loquacity promoted by the 280 characters thus curtails Trump’s need for as many tweets? That finding could again be essayed in a number of ways, including some messier array formulas that would count the available tweets on either side of November 7, 2017 and divide those respective tallies by the number of days elapsed prior to and following that date. In fact, however, a slightly more imprecise but swifter pivot table could do a representative job:

Rows: Created At (UTC + 1h) (The parenthetical notation points to the French provenance of twdocs, at its recording of tweet transmission times per that time in France, an important qualification. The User Created At field, however, merely reports and repeats the date/time at which Trump initiated his Twitter account).

Values: Created At (UTC + 1h) (Count)

I get:


True, the Nov 2017 above conflates the pre and post-November 7 numbers a bit, but the larger thrust of the counts pushes ahead. We see an inconclusive range of distributions: while tweets across the last few months have indeed downturned in relation to October of last year (and that month stands as the most profuse among the table’s holdings), tweet counts in the Feb-Jun 2017 span were decisively lower than the latter-day entries. It’s thus not clear if Trump’s tweeting inclinations have been stunted by the new opportunity to say more at any one time.

And what about Trump’s average retweet count, proposed here as a proxy popularity index? Just substitute Retweet Count for Created At (UTC + 1) in Values and tick Average (with suitable formatting). Here I’d also turn off any subtotals, thus uprooting one higher level of slightly confounding aggregation from the counts. I get:


Commit the above to a rudimentary line chart, and I get:


Is that something worth writing home about, or at least submitting on deadline? Maybe. There’s something at least statistically significant about the line’s fitful locus, and it might be worth investigating April’s retweet dip and/or the more recent spike in RT’s, which does roughly correlate with the opening of the 280-character gateway, though I’m hard-pressed to advance beyond a simple correlation – unless one wants to maintain that longer tweets bespeak deeper policy insights that warrant stepped-up circulation across the twittersphere. Sounds like a reach.

Then add Favorite Count to Values (again, Average):


Favorites correlate most significantly with retweets (I get .8997), an unsurprising resonance. One supposes the prior question, rather a generic one, would ask why favorites reliably outnumber rewteets so enormously (look here, for example, for one take on the matter).

And to fit a final, relatively novel frame atop the Trump tweets, what about the president’s means of transmission? That is:

Rows: Created At (UTC + 1) (again, grouped by Years and Months)

Columns: Source

Values: Source (count)

I get:


Note among other things Trump’s late abjuring of Twitter for Android, and a corresponding uptick in his resort to Twitter for iPhone., his clear source of choice.
Does that mean that Mr. Trump will be recruited to spearhead Apple’s next Think Different campaign?

UK Immigration Raids: Migrating the Data

16 Jan

Designing a dataset falls on the complexity band somewhere between child’s play and rocket science, though I’m not sure to which of the poles the enterprise is closer. Rows and columns, you say; what else do you need, after all? And your rhetorical smirk is right, so far as it goes.

But sometimes the rows and columns require a think-through, preferably somewhere in that design stage – and particularly, for example, if the data are farmed out to more than a single sheet for starters.

Case in point: the look, or pair of looks, at what the Bureau of Investigative Journalism calls Immigration encounters and arrests, a breakdown of city and nationality between January 2012 and January 2017 of individuals either questioned or arrested by law enforcement officials, presumably for a putative trespass of immigration law. The Bureau makes much of the large numbers of British citizens bunched with the foreign detainees, stoking the suggestion that these nationals, by definition entitled to be where they were when they were apprehended, were ethnically profiled, an inference the UK government discounts. The BIJ data are gathered here:

FOI 42354 – Annex A

(Observe the notes stationed beneath the data. If I understand the qualifications of note 2 on the ENCOUNTERED NOT ARRESTED sheet, the “total number of persons” seems to provide for the possible counting of the same person multiple times, should he/she be questioned more than once. Thus, the data appear to tabulate all encounters, not discrete individuals. And note 2 on ENCOUNTERED ARRESTED apparently makes analytical room for multiple arrests of the same person.)

And those data comprise two identically organized worksheets; and so, it seems to me, the pair should be pared. The likenesses among the two argue for a consolidation, but with a particular end in mind: to calculate the ratios marking the encountered and arrested figures, both by city of apprehension and origin of detainee (note: the misspelled city names Cardiff and Sheffield in the ENCOUNTERED ARRESTED sheet must be emended).

But before the consolidation comes the reconstitution. It’s something we’ve seen many times before, but because the worksheets assign field status to the several cities in the data set, these really need to be reined by the yoke of a single field; and to achieve that end we can review and apply Excel’s Power Query capability to the task, more particularly its doubtfully-named Unpivot feature.

First, and this script speaks to both data sets, delete the Grand Total column and row, the better to stem a double-count of the numbers. Then click anywhere in the data set and click Data > From Table/Range in the Get & Transform Data button group (remember I’m working in Excel 2016).

Next, select the fields to be incorporated into what Excel terms termed the attribute-value pairs (use the shift-arrow keyboard combination to select the fields) e.g. in excerpt:


Then click Transform > Unpivot Columns:


And follow with Home > Close & Load. The outcome should look like this, more or less:


But that doesn’t conclude the process, even when you’d do much the same as above for the second data set. We need to differentiate the records per their determination -that is, we need to clarify whether or not the stop culminated in an arrest. Thus we need to append a Status field to the results, coding and copying non-arrest outcomes Enc (Encounter, or something like it), and arrest dispositions, e.g. Arr, down the tables that require this or that status.

And once all that work has been carried out we need to merge the data, by simply copying one set immediately beneath the other – but that amalgamation stirs a small problem, too.

Note that when Power Query succeeds in unpivoting the data (and that verb won’t concede that the data it unpivots was never a pivot table to begin with), it refashions them into a table; and so when we perform the copy described above the copied table continues to retain its now-irrelevant header row, now threaded among the genuine records. But attempt a Delete> Table Rows of the header and Excel resists; the delete option is grayed out, because the second table – the one we’ve pasted beneath the first – remains a table very much in its own right, its positioning notwithstanding, and a table requires a header row, apparently. A workaround is at hand, though: click successively anywhere within the two tables and click Table Tools > Design > Convert to Range, and reply Yes to the prompt. Apart from gifting you with the answer to a question that has recurrently rustled your sleep – namely, when would you ever want to downgrade a table back to what Excel judgmentally calls a normal range – we’ve just learned the answer to a companion question – why? Here, converting a table, or tables, into ranges enables the user to delete the second, now superfluous header row, and agglomerate the two erstwhile, contiguous tables into a single unitary data set. And there are pivot tables in there for the gleaning. (And once these deeds have been done we could rename the default Attribute field City, and Value could be identified anew as Number.­)

And one such table, per our stated intention about 500 words ago, looks to discover the proportion of detainees actually arrested, broken out here by city:

Rows: City

Columns: Number

Values: Count (% of Row Total) Turn off Grand Totals.

I get:


We see, for reasons that remain to be expounded, some notable differentials. It appears that Bristol’s cohort of suspects – 1143 in absolute terms – was by far most likely to experience arrest, more than 120% more vulnerable to arraignment than Sheffield’s 4687.  But accountings for those disparities await, of course. It is at least possible, for example, to imagine that Bristol’s immigrant authorities proceeded with a greater precision, perhaps being guided by more trusted information. Far larger London exhibits a detention rate of 24.24%, a touch lower than the aggregate 25.30%.

Substituting Nationality for City in the table and restoring Grand Totals yields a dossier of mixed findings, owing in part to the broad variation in the numbers of citizens for the respective countries. Sort by Grand Totals (the sort will work, even though they all display 100%; Excel sorts the values skulking behind the percentages), and the uppermost finding will give extended pause:

British citizens are practically never arrested when after being stopped for questioning, returning us to the proper question of profiling, and the kindred fact that the native British are stopped far more often – over 19,000 times that anyone else. On the other hand, we’re left to similarly explain Italy’s low arrest-to-encounter rate, or the fact the Norwegian nationals were stopped 32 times and never arrested.

Indeed, the mighty variation in rates begs a new round of questions; but the spreadsheet develops the first round of answers.

US Police Shooting Data, Part 2: A Civilian’s Review

29 Dec

There’s more tweaking to do inside the Vice data set of big-city police-shooting incidents, though some tweaks are more concessionary than ameliorative. The abundance of NAs flecking the sheet’s fields is not a matter of slipshod data collection; Vice’s preface to the data recalls the issues the site experienced with respondent cooperativeness and data quality. But missing data are missing, and so the fact that shooting victim ages (SubjectAge, in column H) don’t appear for more than 53% of incident cases remains a fact; and while of course the available ages of the other approximately 2000 victims amounts to a sizeable cohort, one wonders if the absent data inflict a sampling skew on the ages. I suspect they don’t, but the question should be recorded.

Pivot-tabling the age data, then, will deliver a mixed yield, e.g. NA along with N/A, U and UNKNOWN, two instances of the label Juvenile, 29 hard-coded #REF! entries that seem to mark a formula gone wrong, and a few hundred records reporting estimated ages in spanned swaths that are no less textual – 0-19 and 20-29, for example. Again, the data sources are uncontrollably heterodox, but a decision about these data needs to be made – which might consist of a determination to simply ignore the age parameter, or alternatively of course to proceed with the usable information in hand.

And missing data likewise compromise the SubjectRace field. Submitting this pivot table for consideration:

Rows: SubjectRace

Values: SubjectRace

SubjectRace (again, % of Column Total)

Churns out these distributions:


The disproportion of Black victims is clear, but of course remains to be explained by a larger context framed by prior variables, e.g. number of overall incidents, number of alleged perpetrators alleged to have been carrying a weapon, etc., but again note the raft of NAs, Us, and blanks disqualifying more than 30% of the entries from the analysis (the blanks simply aren’t counted above at all). (I’m assuming, by the way, that A signifies Asian and O Oriental. But even if so, I don’t know if those categorizations denote distinct or problematically redundant ethnic niches.)

But in any case, there are things to be learned from the data, their shortcomings notwithstanding. For example, we could look at the relationship of victims’ race to shooting incidents resulting in a fataility. In that connection we could assemble this pivot table:

Rows: SubjectRace

Columns: Fatal

Values: SubjectRace (% of Row Total; filter out (blank) and turn off Grand Totals)

I get:


The results are fairly startling.  We have already learned that the absolute number of black victims far outpaces the other groups, but the ratio of fatal police shootings by race involving a white victim – 51.27% – is far higher than that for black or Latinos, a less-than-“intuitive”, literally provocative outcome.

How might these proportions be accounted for? If in fact police may make swifter resort to their weapons in confrontations with black suspects – if – then perhaps white suspects are shot and killed for compelling, truly defensible reasons, i.e., more of them might be armed, or at least seen as such?

In view of that question we could thus introduce a third variable, SubjectArmed. Because three-variable pivot tables complicate readability, we could assign a Slicer to the field instead:

Rows: SubjectRace

Columns: SubjectArmed

Values: SubjectRace (count, % of Row Totals)

Slicer: Fatal

Ticking F in the Slicer for Fatal, I get:

vice 3a


Remember that the column entries above comprise the SubjectArmed items – No, Not Available, Unknown, and Yes.

We see that, so far as the data can take the question, white fatalities are actually less likely to be armed than black or Latino victims. But because the figures for B incur so many NAs and Unknowns, the conclusions are obscured. Indeed – as a consequence of the missing data, we see that white victims are also more likely to be armed.

It could also be noted that 153 fatal shootings received an NA race code, and another 80 were assigned a U. One wonders if these could be inferentially mapped against and compared with the known-race victim distributions, though that supposition is a highly adventurous one.

But turn to the OfficerRace and OfficerGender fields and you’ll encounter a new set of challenges, only one of which are the 450 OfficerRace cells containing no officer information at all, and the not-inconsiderable number of other cells bearing the code H.  Given the cities with which the code is associated I’d guess H stands for Hispanic, but how or if that letter means to distinguish itself  from L I’m not sure, particularly in light of the absence of H data from the SubjectRace field.

OfficerRace and OfficerGender accompany the straightforwardly quantified NumberOfOfficers field, and attempt to code the ethnicities of incident-participant officers – in a single cell.  That is, a cluster of items, each of which would typically populate its own record, here share one cell’s space.

Moreover, a good many cells replace the elliptical W and B entries with WHITE or BLACK respectively, a set of inconsistencies that have to be dealt with as well. We could dispose of these latter anomalies, however, via batch of finds and replaces, e.g., exchanging a W for WHITE.

But it’s the multi-item cells that call for a deeper think. If, for example, we wanted to break out the officer totals by ethnicity, say white officers for starters, we could select OfficerRace, name it offrace and enter, somewhere:


I get 2329 – but how? Those tell-tale curly braces can only mean one thing: an array formula is at work, and what it’s doing first is first summing the aggregate length of the cells committed to offrace, and from that default baseline substracting a revised aggregate length – achieved by substituting nothing, as it were, for every instance of a W throughout the offrace range (remember that SUBSTITUTE replaces every instance of the searched-for text, not merely the first one), and totalling the lengths these. The math, then, yields the 2329, and to empower the process you could do something like this:  enter these race identities in Q2:Q7 respectively: W,B,L,A,H,NA. Follow with


in R2, and copy down accordingly. I get

vice 4a

First, of course, the numbers need to be weighted alongside the aggregated ethnic representations of the police forces studied, but we don’t have that information here. But a more proximate operational concern looms above: At least two of the results are wrong.

For one thing, the formula’s search for NA (assuming that search term is in Q7) must read instead

{=SUM(LEN(offrace)-LEN(SUBSTITUTE(offrace,Q2,” “)))}

What’s different about the above emendation is the space “ “ supplying the final argument in the expression. Because we need to substitute here for two characters – the N and the A – their replacement need comprise one character, in order to foster a differential of one for each original NA. Thus my redone substitution returns 1174, or precisely half the original, errant NA total.

But the other false start is considerably trickier. The count of 1299 for A enormously overstates the incidence of that code, because my formula also registers appearances of NA (and other entries as well) which, after all, contain that selfsame A. The task, then, is to account only for isolated A’s, disassociated from any other entries in the field that also happen to contain the letter – and there are other entries, for example AI/AN, and even one instance of Multi-Racial. The formula as it stands will substitute those A’s too, and be duly counted.

One far-from optimum end run around the problem would be something like this:


That formula in effect calculates all A appearances and simply subtracts the NA total from them, which holds down cell R7. It yields a far more plausible 125 but remains inexact, owing to the other A-freighted cells that continue to be counted above. What could have been done, then: a seeing to it that every coded race/ethnicity consisted of unique letter(s).

But I’d allow I’ve done enough deep thinking for one post.