Archive | Uncategorized RSS feed for this section

The Verdict on Cook County Court Sentencing Data, Part 2

18 Jun

There’s plenty to learn from the feature-rich Cook County sentencing dataset, and so let the learning commence. We could, by way of first look at the data, examine what the set calls SENTENCE_TYPE, a self-evident header whose entries could be additionally broken out by another field, SENTENCE_DATE:


Columns: SENTENCE_DATE (grouped by year; see the previous post’s caution about the date-grouping challenges peculiar to these data.

Values: SENTENCE_TYPE (% of Column Total, as you turn Grand Totals off for the table’s column, which must necessarily come to 100%.)

I get:


(Bear in mind that the numbers for the 2010 and 2018 are very small, a concomitant of their fraction-of-the-year data representation.) Not surprisingly, Prison – the lengths of its terms unmeasured in this field, at least – accounts for the bulk of sentences, and rates of imprisonment verge towards virtual constancy across the represented years.

Probations, however, describe a bumpier curve, peaking in 2014 but exhibiting markedly lower proportions in both 2010 and 2018 (to date). Those wobbles, doubtless both statistically and criminologically significant, call for a look beyond the numbers we have.

I’ve been unable to turn up a denotation for the Conversion sentence type, but I can state that the zeros, first appearing in 2011, are actual nullities, and not an artifact of the two-decimal roundoff. Slightly more bothersome to this layman, though, was the Jail sentence; I didn’t know if that event discreetly named a particular mode of punition, or merely, and wrongly, reproduces Prison sentences under a second heading. It turns out that the former conjecture is in point, and that jail time is typically imposed for shorter sentences or individuals awaiting trial (see this clarification, for example. For an explication of the Cook County Boot Camp, see this piece).

The natural – and plausible – follow-on would associate sentence types and their distributions with types of crimes, but the presentational challenge proposed by a keying of the sentences to more than 1500 offense titles very much calls for a considered approach, indeed, one that could perhaps essayed with a Slicer populated with the offense titles. And while that tack will “work”, be prepared to scroll a long way until the offense about which you want to learn rises into the Slicer window. But such is the nature of the data.

And in view of that profuseness one could, perhaps, engineer a more practicable take on the matter, for example, by inaugurating a new pivot table, filling the Rows area with say, the top 20 offense types (right-click among Rows and click Filter > Top 10, entering 20 in the resulting dialogue box). Sort the types highest to lowest, line the Columns area with the Sentence Types, drop Offense_Type again into Values, and view them via the % of Column Total lens.

A screen shot here would be unacceptably fractional in view of the table’s width, but try it and you’ll see, for example, that 93.41% of the convictions for unlawful use or possession of a weapon by a felon resulted in a prison sentence, whereas perpetrators guilty of theft – however defined in this judicial context – incurred prison but 36.41% of the time, along with a 45.69% probation rate.

More troubling, however, is the small but measurable number of death sentences that appear to have been imposed on individuals sentenced for crimes not typically deemed capital. For example, .04% of the convictions for the possession of cannabis with intent to deliver/delivery of cannabis have drawn the death penalty, as have .05% of forgery convictions. These legal reprisals don’t ring true, and surely obligate the journalist in a redoubled scrutiny, if only to confirm their accuracy, and/or their propriety.

If you’re looking for other fields to irrigate, sentence length, expressed here as COMMITMENT_TERM, likewise begs for correlation with other fields; but here a new roadblock stalls the journey. Inspect the field and the adjoining COMMITMENT_UNIT column and the obstruction will loom large. Because the units present themselves in a raft of different durations, the correlations can’t proceed until some manner of unit reconciliation is brought to the data.

My early sense about the COMMITMENT_UNIT field told me that the disparities merely juxtaposed years to months; and were that the case, a simple formula could be stamped atop an new field, one in which either years could in effect be multiplied by or months divided by that value, e.g. a six-year sentence could be translated into 72 months.
But in fact, the units are a good deal more numerous and qualitatively varied than that. Turn a filter on the data set and click the arrow for COMMITMENT_UNIT. You’ll see:


While it would be possible to construct an equivalence lookup table for the chronological units enumerated above, e.g., one month rephrased as 720 hours, a sentence delivered in monetary terms – dollars – can’t be subjected to a like treatment. And a sentence of Natural Life – presumably an indefinite, open-ended prison stay – is similarly unavailable for equating. Moreover, I have no idea what the two records declaring sentences of “pounds” – 30 of them for a criminal trespass of a residence, and 2 for driving with a suspended or revoked license, and both pronounced in Cook County District 5 – Bridgeview – can mean. And you may note that 19 sentences comprising 365 days each were issued as well; how these distinguish themselves from one-year terms is unclear to me. Nor do I understand the 1526 sentences consisting of what are described as Term.

On the one hand, of course, the data set can’t be faulted for admitting all these “discrepancies” into its fold; they’re perfectly valid and pertinent records. On the other hand, they cannot, by definition, be forced into comparability with the other entries; they’re oranges to the predominating crop of apples.

The simple way out, of course, would be to sort out and excise the non-chronologicals and proceed, and on a blunt practical level that stratagem might work. But it would work here for the simple empirical reason that those incongruities are few, and as such, would not compromise the greater body of data. But what if these irregulars were formidably populous, and hence unavoidable? What would we do with them?

That is a good and interesting question.


The Verdict on Cook County Court Sentencing Data, Part 1

29 May

You can trace the course of justice in Chicago, including the direction and speed at which it travels, at the fledgling Cook County Government Open Data portal, a site brought to its URL at the behest of Kimberly Foxx, Illinois State’s Attorney for the county in which the city of the big shoulders shrugs. Four of the portal’s holdings –  Initiation, Dispositions, Sentencing, and Intake – chronologize the dispositions of cases processing through the system; I chose Sentencing for my look here.

It’s a big data set for a big city, recalling as it does sentencing records dating back to January 2010 and pulling through December 2017. With 189,000 cases and a field complement stretching to column AJ, don’t even think about calling it up in Google Sheets (the data-supporting capacity there: two million cells), but Excel is agreeable to its 41 megabytes if you are, and it’s available for download from the second above.

And at 41 megs, the minimalist in you will be avid to put your scissors to fields that might be rightly deemed dispensable. Cases in point: the four ID parameters fronting the data set in columns A through D, none of which are likely to advance your reportorial cause (note, by the way the interjection of commas into the large-sized identifiers, an unusual formatting fillip). Deleting the fields and their750,000 or so entries actually slimmed my workbook down to a lithe 29.7 mb, and that’s a good thing.

You may also note the slightly extraneous formatting besetting the INCIDENT_BEGIN_DATE, RECEIVED_DATE, and ARRAIGNMENT_DATE fields, their cells bearing time stamps all reading 0:00. I suspect these superfluities owe their unwanted appearances to the data in the ARREST_DATE field, which do exhibit meaningful times of suspect apprehension. We’ve seen this kind of excess before, but again it’s proper to wonder if any of it matters. If, after all, it’s your intention to re-present the data in pivot table form, for example, you’ll attend to any formatting disconnects there, and not here. If so, a reformatting of the data source may be no less superfluous.

But whatever you decide we can proceed to some analysis, acknowledging at the same time the scatter of blank cells dotting the records. Given the welter of substantive fields in there, quite a few possibilities beckon, and we could start by breaking out types of offenses by year, once you answer the prior question submitting itself, i.e. which of the available date parameters would be properly deployed here? I’d opt for ARREST_DATE, as it affords a kind of read on Chicago’s crime rate at the point of commission – or at least the rate of crimes culminating in arrest, surely a different and smaller-sized metric.

But if you’re thinking about installing the arrest dates into the column area, think twice – because the dates accompanied by their time-stamps are sufficiently granulated that they surpass Excel’s 16,384- column frontier. You’ll thus first have to swing these data into the Rows area, group them by Year, and only then can you back them into Columns, if that’s where you want them stationed.

And that’s what I did, only to be met up with a surprise. First, remember that Excel 2016 automatically decides upon a (collapsible) default date grouping by year, like it or not; and when I corralled the arrest dates into Rows I saw, in excerpt:


Now that ladder of years seems to be fitted with a column of rickety rungs. Remember that the sentence data appear to span the years 2010-2017, and so the aggregates above hint data entry typos, and at least some of them – e.g. the 1900 and 1915 citations – doubtless are.

The additional point, however, is that some of these putative discrepancies might tie themselves to crimes that were in fact brought to the attention of the justice system well in the past, and that took an extended while before they were actually adjudicated. Remember that our data set archives sentences, and some criminal dispositions take quite some time before a sentence is definitively pronounced.

For example, the 12 sentences associated with arrests made in 1991 reference serious crimes – seven murder or homicide charges, one armed robbery, one unlawful use of a weapon charge, one robbery and two thefts. One of the thefts, however, records an incident-began date (a separate field) of November 17, 2013, and thus appears to be erroneous.

But in any event, since our immediate concern is with arrests carried out in the 2010-17 interval I could click anywhere among the dates and proceed to group the data this way:


Note that I’ve modified the Starting at date to exclude the pre-2010 arrests, be they errantly captured or otherwise. Now after I click OK I can drag the years into the Columns area, after filtering out the residual <1/1/2010 or (blank) item.

Now I can drag OFFENSE_TITLE into Rows.

Surprise. With 1268 Offense categories cascading down the area you’ll have your work cut out for you, once you decide what to do next. Do you want to work with the data as they stand, or collapse near-identical types, and vet for misspellings along the way? Good questions – but in the interests of exposition we’ll leave them be.

How about something more immediately workable then, say age at incident? Exchange AGE_AT_INCIDENT for OFFENSE_TITLE, filter out the 2300 blanks, and group the ages by say, 3 years. Reprise AGE_AT_INCIDENT into Values (count). I get:


We see an extremely orderly negative association between age and arrests, with only the 20-22 tranche exceeding its predecessor bracket among the grand totals and only slightly. You’ll also observe that the numbers for 2017 are far smaller than the previous years, a likely function of incomplete data. In addition, track down to the Grand Totals row and behold the very significant ebbing of overall arrest totals from 2013 to 2016. Again, our table records arrest, and not crime totals, but the two likely point the same way – unless one wants to contend that the downturn in the former owes more to policing inefficiencies that any genuine diminution in crime – a not overwhelmingly probable development.

I’d then move to a Show Values As > % of Column Total look to learn how the brackets contribute differentially to arrests:


(The zeroes at the lowest reaches of the table reflect round-offs.)

Among other things, note the considerable, relative pull-back in arrests of suspects in the 17-19 range.

No, I don’t have an explanation at the ready for that, but perhaps you do.

Airbnb Data, Part 2: A Tale of Three Cities

14 May

There’s such a thing as overstaying your welcome, but Airbnb landlords – and in effect that’s what they sometimes appear to be – may be prepared to happily abide your long-term self, or selves, in their abode.

And that heartening show of hospitality may be illegal. They take a dim view of Airbnb’s good neighbor policy in London, for but one example, where the powers-that-be aren’t thrilled about the kind of lucrative serial subletting some Airbnbers perpetrate, straitening as it does the market for folks who’d prefer to hang tight in an actual, lease-driven apartment.

The long and the short of it then, is that it a review of Airbnb property availabilities – defined as the number of days in a year in which a given room remains on offer – could prove instructive, and our data for New York, London, and Paris devotes a field to just that question.

The analysis, then, should proceed pretty straightforwardly, once we do something about the startingly sizeable count of rooms – about 55,000 – that presently aren’t to be had. That is, their value in their dataset’s availability_365 field states 0, indicating that, for now at least, that room has been withheld from the market. An email from Inside Airbnb compiler Murray Cox informed me that zero means the property’s next 365 days in its calendar (presumably its start date is a moveable inception point, which Cox finds in the the scrape_date field in a different dataset) aren’t in play, at least temporarily.
And as such, those zeros – which are, after all, values that would contribute to and very much confound any formula result – have to be barred from the data set. Here I rely on my venerable highest-to-lowest sort of the availability_365 field, relegating the zeros to the bottom of the set; and once put in their place, an interpolated blank row immediately above the first zero will detach them from the usable data, for now (of course they can be recalled if needed via a simple deletion of the blank row).

And all that enables us to proceed here:

Rows: City

Values: availability_365 (Average, formatted to two decimals)

I get:


Real city variation is in force; and recall the linked article above, the one reporting the December 2016 London-approved bill “limiting Airbnb hosts to renting their property for only 90 days”. Looks as if a few thousand room owners in that city haven’t read the news lately.

We could next cross-tab the results by room type, by rolling room_type into Columns:


All the cities trend in the same direction, though not identically – itself a differentiation worth pursuing, perhaps. Availability widens as the rental space constricts, with shared rooms – defined by Airbnb as those in which “Guests sleep in a bedroom or a common area that could be shared with others”, presumably humans, who presumably might or might not be actually residents of the property – freed up for a considerably larger fraction of the year.

And the results make sense – even common sense, perhaps. Entire homes and apartments need be empty, by definition, and if so, where would their owners be expected to go for the duration of the rental?

That’s a good question, one that directs itself to one of flashpoints of the Airbnb controversy. Are its hosts the kinds of proprietors who hoard multiple listings that might otherwise be released to the conventional rental/purchase market?

A few sure-footed steps toward an answer would require us to divide all the rentals in a city by its number of hosts, i.e., an average of properties per host; and that simple division exercise needs to fill its denominator with a unique count of hosts, thus returning us to a problem with which we’ve tangled before. To reiterate it: an owner of multiple properties will naturally appear that many times in the data set, that is, once each for each holding, even as we want him/her here to appear once. In light of that complication I think the neatest way out this time is to conduct a Remove Duplicates maneuver (Data ribbon > Data Tools), and ticking the host_id field, the parameter whose entries contains the duplicates we want to shake out (again, you may want to save these results to a new workbook precisely because you’re shrinking the original data set).

But how do the host ids, once in their respective, solitudinous states facilitate a calculation of the number of properties they own, on average? Here’s how: once we’ve identified each host id singly, we can average the calculated_host_listings_count in column P via a city pivot table breakout. That field, which restates the number of each host’s holdings in each of his/her record entries, is one I would have deemed redundant to the data set’s design. After all, the owner-property count could otherwise be derived when needed, for example, via a pivot tabling of the hosts id, delivering the field to both the Rows and Values areas. But because we’ve removed all host id duplicates, that plotline has to be red-penciled – and that’s where the calculated_host_listings_count comes to salvage the script:

Rows: Country

Values: calculated_host_listing_count (Average, to two decimals)

I get:


We see then, that Airbnb hosts are for the most part single-property marketers, at least for the cities we’ve gathered. For those interested in more detail, we could try this:

Row Labels: calculated_host_listing_count

Columns: City

Values: calculated_host_listing_count (Count, % of Column Total, formatted in percentage terms to two decimals)

I get, in excerpt:


You get the idea, though we see London owners are notably more likely to offer multiple properties.

Speaking of which, croll to the bottom of the table above and you’ll find a 711, signifying the voluminous, apparent holdings of a fellow named Tom in London. But When I returned to our original, entire Airnbnb dataset, including the rooms for which availability was set at 0 days, I discovered but 350 properties strewn about London associated with his name.

Now Tom’s the kind of person Murray Cox wants us to know about; he owns so many properties that he’s lost track of half of them.

Airbnb Data, Part 1: A Tale of Three Cities

27 Apr

Would you rent your apartment to me? I have references from real people, don’t smoke, clean up after myself (is my nose growing?), and probably can be counted on not to trash your living room and adjoining vicinities.

Still don’t want to take my scratch for your pad? See if I care; there are plenty of other flats out there where yours came from.

Too many, in fact, according to Murray Cox, the self-identified “data activist” whose researches into Airbnb’s rental listings opened the door on a warehouse of dodgy practices, in numerous localities, e.g. property owners who market multiple dwellings, a clear and present violation of New York state law. Cox maintains that, among other things, the outsized scale of Airbnb offerings can worrisomely constrict a city’s available rental stock, and has published a city-by-city inventory (brought to my attention by a student) of Airbnb listings that you and I can download in most convenient spreadsheet form (look for the Summary Information and metrics for listings in… link attaching to each city).

It occurred to me that, among other potential takeaways, an intercity comparison of Airbnb activity might advance the journalistic cause a mite. I thus downloaded the data for New York, London, and Paris, all nicely exhibiting the same fields. With the intention of copying and pasting the data to a single workbook I cleared out a new column to the left of A, called it City, and entered and copied down the respective names of the three locations, properly lined up with their data once pasted, culminating in 162,701 rows of data, its 20 megabytes just itching to tell me what Airbandb has been up to.

Of course, the three-city amalgamation means to prime the data for a range of comparisons, but some comparisons don’t avail. I’m thinking in the first instance about the price field in what is now column K. These entries presumably cite daily rental rates, but express themselves in disparate currencies – dollars, pounds, and euros. One supposes an exceedingly determined investigator could mobilize and apply a round of currency equivalences to the records, a prospect that would require a vast compiling of date-specific rate fixes in short, a challenge likely to a real-world, deadline-mindful journo. I’ve thus elected to leave the numbers as they stand, and if that touch of laissez-faire works against the analysis I have no one to blame but myself. The buck stops here – and maybe the euro, too.

In any case, before we get fancy, we can think about this self-effacing pivot table:

Rows: City

Values: City (Count, by definition for a textual field)

I get:


We see that Paris – by far the smallest of the three cities – nevertheless advertises the largest number of Airbnbs. An accounting for that disjuncture would probably justify a deeper look. Might tourist cachet or friendlier legalities spur the Paris margin? I don’t know. But consider that, juxtaposed to Paris’ population of around 2.25 million and its average household size of approximately 2.3 persons, the city’s Airbnb stock could house around 6% of its residents – with the point, of course, that the inventory is apparently being withheld from the permanent-residence rental market.

Other incomparables have their place among the data, too. There’s little comparing to do as such among the three cities’ neighborhoods, and indeed – the neighbourhood group (UK spelling) field for Paris and London is utterly empty (the field for New York comprises the city’s five boroughs).

But of course other workable comparisons are available. What, for example, about average minimum stay requirements by city and type of rental? We could try this:

Rows: City

Columns: room_type

Values: minimum_nights (Average, formatted to two decimals)

I get:


We see that diffident London Airbnbers expect notably briefer stays at their places on average, with those uppity Parisians insisting that you agree to set down your knapsack – and debit card – more than a day-and-a-half longer before they let you in. At the same time, New York’s shared-room minimum is disruptively anomalous.

And for more evidence of cross-cultural heterogeneity – if that’s what it is – flip the values into Count mode and hand them over to the Show Values As > % of Row Total, ratcheting the decimals down to zero and switching the Grand Totals off (because the rows must invariably figure to 100%). I get:


The overwhelming Paris proportion devoted to the Entire home/apt offering is not, I would submit, proof positive of yet one more Gallic quirk, but rather a simple function of the Paris housing stock, in which apartments predominate.
For additional, if glancing, corroboration, try this pivot table:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (tick New York)

Values: neighborhood_group (Count, % of Row Total)

I get:


Recall that New York is the only city among our trio whose neighborhood group field is actually occupied with data – the names of its five boroughs. Note the relative Manhattan tilt towards Entire home/apt, even as the other boroughs, whose housing range features far more private homes, incline towards Private room – that is, presumably one private room among the several bedrooms in a home.

And what of daily price by city, keyed as it doubtless is to room type? It looks something like this:

Rows: City

Columns: room_type

Values: price (Average, to two decimals)

I get:


Again, that imperative qualification – namely, that the prices reflect evaluations per their indigenous currencies – need be kept in mind. As such, the New York tariffs verge much closer to the London figures when the appropriate, albeit variable, pound-to-dollar conversion is applied. With those understandings in place, the Paris Entire home/apt average seems strikingly low – because the Euro consistently exhibits a “weaker” relation to the pound, the former as of today equaling .88 of the latter. Yet at the same time, Paris’ private room charge would appear to be effectively higher.

Now again, because the data are there, we could compare average prices for New York’s boroughs:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (New York)

Values: price (Average)

I get:


No astonishments there, not if you’re a New Yorker. Manhattan expectably heads the rate table, though Staten Island’s second-place Entire home/apt standing may issue a momentary pause-giver, along with its basement-dwelling (pun intended) shared room rate.

That’s $7,300 a month for an entire place in Manhattan. And wait until you see the interest on the mortgage.

NY Regents Exam Data, Part 2: Multiple Choices

12 Apr

Numbered among the additional conclusions we can draw from the New York Regents data is a natural next question from the aggregate test averages we reckoned last week, answered by kind of reciprocal finding: namely, the aggregate fail rates. Guided by the concerns about weighting we sounded in the previous post, I’d allow that a calculated field need be applied to the task here too, a field I’ve called PctFail.

But before we proceed we again need to contend with the not insignificant number of records who, for apparent reasons of confidentiality, won’t count their fewer-than-five students, replacing the totals with an “s”. Thus I tapped into column N, called it NFail, and entered in N2:


The formula assays the relevant cell in J for an “s”; if it’s there, a 0 is supplied. Otherwise the value in K – the Number Scoring Below 65 – is returned.

Again, we’ll copy that formula down N and proceed to calculate the PctFail field:


Once effected, this simple pivot table, abetted by the All Students Slicer selection we ticked last post, opens the story:

Rows:  Year

Values: PctFail (formatted here in Percentage mode to three decimals)

I get:


The failure rates are substantial, a near-ineluctable follow-on from the overall score averages settling in the 68 range (remember that 65 passes a Regents exam).

But you’ll want to know about failures by Regents subjects, too. Sight unseen, you’d expect appreciable variation among the test areas, and that drill down can tool its way into the data via the Regents Exam field, e.g. something like this:

Rows: Regents Exam

Columns: Year

Values: PctFail (formatted similarly to the table above)

I get:


And variation there is, some more provocative than others. You’ll note the massive leap in failure rates for English and Geometry from 2015 to 2016, a determined, ascending slope of failures for Algebra2/Trigonometry, and a restitutive, noteworthy shrinkage in failures for Common Core Algebra. (The Common Core tests are controversial, in part because of their redesign; see this report, for example).

You’ll also want to do something about those error messages. In some cases, the #DIV/0! outcomes simply key the absence of data for the exam, owing to an exam’s discontinuation or later introduction, while the (blank) label appears in virtue of the ten rows that bear no exam name. Should you want to pave over the errors, and you probably do, click anywhere in the pivot table and proceed to PivotTable Tools > Analyze > Options > Options > Layout & Format tab > and tick For error values show:. Enter some appropriate stand-in for #DIV/0!, e.g. — , and click OK. Because the dashes in the 2015 column push far left and look almost as unseemly as the original error message, you may want to select all the Values and align them right. (You could also filter out the blanks.)

Now if you want to crunch failure rates by ethnicity, for example, you’ll again have to reconcile the double-counting character of the fields we described last post. The ethnicities – Asian, Black, Hispanic, Multiple Race Categories Not Represented, and White – have been quartered in the Demographic Variable field, but so have a potpourri of other, disconnected items bound to other Variables, e.g. Female to Gender, English Proficient to ELL Status.

We’ve stubbed our toe against this odd problem in the previous post, in which Excel’s million-record limit has forced records otherwise deserving of their own field into one, messy cosmopolitan column – the one called Demographic Category, itself dispersed into no-less-heterogeneous items in Demographic Variable. It’s a confusing issue, but I think we need to tick Ethnicity in the Slicer now and slide Demographic Category –confined by the Slicer to its ethnic item entries – into Rows. Sweep Year into Columns and you get:


The disparities here are dramatic, and rather self-explanatory – the results, that is, not the accountings of them.

Next opt for Gender in the Slicer:


Women outdo men, a finding that more-or-less jibes with current understandings of gender performance differentials. The female margin, nearly equivalent across 2015 and 2016, pulls away slightly in the following year.

And what of gender outcomes by exam? Slip Regents Exam atop Demographic Category (which has been sliced to Gender) in Rows, and (in excerpt):


And (in second, remaining excerpt):


You’re looking for palpable divergences, of course, but palpable congruences mean something here, too.  The decisive female advantages in the Common Core English scores are perhaps notable but not arresting; but their edge in Common Core Algebra does a fair share of stereotype busting, even as males emerge the stronger in Common Core Algebra2. (Important note, the Grand Total pass rates vary by Demographic Category even as the Total Tested remains neatly constant across all Demographic Variables. That’s because the distribution of “s” entries across the Categories isn’t constant.)

There are plenty of other permutations in there, but let’s try one more. Column Q quantifies the number of students in the record whose score achieves what the Regents calls College Readiness (CR), i.e., a 75 in the English Language Regents or an 80 on any Math exam in the system.

And here’s where I have to own up to a bit of spreadsheeting excess. In the previous post I implemented what I termed an NFail field, embodied by its foundational formula – an expression that replaced “s” entries with a 0, the better to factor these into a calculated field. I now understand that those exertions were unnecessary, because Excel will completely ignore an “s” or any other label in any case. Thus here (and last week, too) we can work directly with the Number Scoring CR field in Q. But because we do need to acknowledge the large number of “s” and “NA” entries in Q (NA, because only some Regents’ qualify as CR exams) that will impact any denominator we also need what I call here a CRCounted field to be put in place in the next available column, punctuated by this formula that gets copied down:


We then need compose that calculated field, which I’m calling CRPass:


Remember here, and for the first time, we’re computing pass rates. This pivot table among others, awaits, under the aegis of the Demographic Category Slicer – Gender:

Rows: Demographic Variable

Columns: Year

Value: CRPass

I get:


Provided I’ve done my homework correctly the results point to a striking spike in CR attainments, a set of findings that calls for some journalistic deep-backgrounding. (Note that the absolute CR numbers are far smaller than the global Total Tested figures, because as indicated above only certain exams march under the CR banner.) We see a small-scaled but real stretching of the female advantage in the pass rates between 2015 and 2017, one that also needs to be sniffed by some nose for news.

Now let my take a break while I double-check my homework. I hear this blogster is one nasty grader.

NY Regents Exam Data, Part 1: Multiple Choices

28 Mar

We haven’t met, but I can say with a jot of confidence that I’ve likely done something you haven’t – taken a Regents exam. Exams.

I’m not clipping the achievement to my lapel as a badge of honor, you understand, just stating a biographical matter of fact. The Regents – a staple of the New York State educational system in which I spent more than a little time – comprise a series of what are termed exit exams; pass enough of them and you walk away with a high school diploma of the same name. Your correspondent took his share of Regents, his scores emobdying the “scatter” in scattergram, but no matter; I took them, and I’m a better person for having done so.

But before you file a Freedom of Information Act request to verify those abnormally-curved results you may want to review a larger, ultimately more interesting record of Regents attainments, the dataset supplied by the New York City open data site that summarizes in grand form the Regents scores of students statewide for the years 2014-17. It’s a big file, needless to say – so big you’ll need to download it yourself – and its 212,000 or so records have a lot to say about the testing profile of New York’s high schoolers.

It also has a lot to say about spreadsheet organization, more particularly the juxtaposition of column G, Demographic Category, to H, Demographic Variable. Those columns/fields in fact identify a series of putative fields and field items respectively; and as such, G’s contents could, at least in theory, have been more conventionally structured into discrete parameters, each owning a column all its own.

But that prescription calls for an elaboration. Consider this fledgling pivot table drawn from the Regents data:

Rows: Demographic Category

Demographic Variable

Values: Total Tested

I get:


Note that that all the Total Tested subtotals are equivalent, intimating that the five Demographic Categories cleave the same population into an assortment of cross-cutting attributes, and thus summing the same student count five times. The by-the-book alternative, again, would have assigned each Category to an independent field, such that an interest in test achievements by Gender, for example, would require the user to simply drag Gender into Rows. As it stands, however, a Slicer (or filter) would have to grease the process, e.g.:


Moreover, casting the potential field-bearers into mere item status beneath that singular Demographic Category banner appears to obviate a good many cross-tabulating possibilities, e.g.: a breakout of tests by both Ethnicity and Gender. How, for example, are we to learn how black female students score on the Regents when both attributes are lodged in the same field, and so must occupy the same label area?

But at the same time, the spreadsheet designers had to contend with a supplementary problem that overrides my challenges – namely, that the upgrading of say, Ethnicity and Gender to field standing would appear to require that data present themselves in individual record form, e.g. each student’s performance on each test; and those 2,000,000 scores/records would burgeon beyond Excel’s data-accommodation space.
In any case, there is indeed lots to learn structural complications notwithstanding, and we could begin by starting coarsely – by calculating the average overall Regents scores by year:

Row: Year

Values: Mean Score (Average, formatted to two decimal points)

I get:


(Note that one record, attaching to the Island School in Manhattan, exhibits a nonsensical entry for its year. By filtering and comparing the Island School data, it appears that the record belongs to 2017.)

The averages are remarkably similar, though I’d venture that, given the 2,000,000-pupil universe, the one-point differential distancing the 2015 and 2017 scores is significant. Remember that the Regents passing score is pegged at 65, suggesting that the test designers got their threshold right.

But those averages aren’t quite definitive, and for a couple of connected reasons, one subtler than the other. The first recognizes that the student double-count pinpointed above proceeds in effect to compute the average scores multiple times, because the records operate under the steam of different demographic categories and numbers per record. Thus the mean average for the Gender category alone – which nevertheless contains all students – is likely to depart at least slightly from the mean average for Ethnicity, which likewise contains all students. If, for example, we reintroduce the Slicer for Demographic Category for the current pivot table and tick All Students, we’ll get:


The differences from the initial pivot table are very small but evident, again because the Total Tested numbers per the All Students records don’t perfectly line up with the Total Tested per-record numbers for Ethnicity, for example – because each record receives an equal weight, irrespective of its Total Tested value.

And it is the matter of weighting that points its arrow directly at the second question, one we’ve seen elsewhere (here, for example). The per-record mean averages ascribe an identical mean score input to each record, even as the test taker numbers vary. And that bit of record democratization vests greater, relative influence to the smaller numbers. The result, again: a possible skewing of the averages.

But because Column I enlightens us about the actual test taker numbers, we should be able to derive a simple calculated field to impart a corrective to the weighting problem – once we deal with the very large number of records – about 75,000 – that just don’t report their score results. These are the entries dotted with the “s” code, and our formula needs to ignore them as well as their test taker numbers, which are stated, after all.
So here’s what I did. I headed next-available column S TotalPts and entered, in S2:


That expression means to assign 0 for any “s” datum, and otherwise multiply the record’s mean score by its number of students. (Important note: absent Means Scores almost always associate themselves either with a Total Tested number of five or less, or with entries possessing the SWD or Non-SWD values. (SWD stands for Students With Disabilities.) One assumes that both types exclusions are justified by reasons of confidentiality; but remember that the five-way count of students in the data set should subsume most of the SWD takers anyway, via their inclusion in the All Students item.)

After copying the formula down I titled the T column CountedStu and dropped down a row, wherein I entered:


The formula asks if the relevant row in J contains an s. If so, a zero is returned; otherwise, the number of test takers cited in I is returned.

I next devised a simple calculated field, ActAvg (for actual):


That field can now be made to work with any other field whose items are banked in Row Labels, e.g. by substituting ActAvg for Mean Score in Values (and leaving the All Students Slicer selection in place). I get:


It’s clear that our weighting refinements have uncovered a “true” higher set of averages, and that continue to sustain the near-point improvement from 2015 to 2017.

But again, there’s quite a bit more to learn from the scores. But I’m asking now to recess for lunch – and yes, I raised my hand.

U. S. Vaccine Data: A Different Treatment

14 Mar

It’s alarmingly philistine of me to say so, but I know what I like; and I like the heat maps that track the vaccine-driven advances in the United States wrought against infectious diseases, schematized by graphics boffins Tynan DeBold and Dov Friedman and featured in the Wall Street Journal in 2015. One of them looks like this:


These are of course heat maps in the broadened sense, plotting epidemiological movement across time instead of territory, and tiling their data into tightly-bound mosaics, whose colors blanche as the diseases recede. In short, a well-told story of stirring medical progress, the data for which is contributed by Project Tycho of the University of Pittsburgh and available for download there (you can sign into it for free if you’re affiliated with an educational institution). By contrast, think what a line chart bearing 50 data series would look like.

It should be added that the Tycho site itself hardwires a heat-mapping utility into its pages, in what could be construed as an ancestral precursor of the DeBold-Friedman outputs, e.g.


It is worth asking, then, about the refinements DeBold and Friedman commended to the Tycho charts, and the respective reasons why.

But in any event the map designers don’t need my everyman’s encomium; their depictions have won at least two awards – one conferred by the Global Editor’s Network (GEN), the other by the Kantar Information is Beautiful judges. The former’s site declares the maps were “…wildly popular on social media as well as with statisticians and graphics editors weighing in on how they would’ve approached this project”.

Sounds good to me, but note for the record that the Y axis above records 26 state names, pulling up 24 (really 25; the District of Columbia – the nation’s capital that possesses extra-state status – is likewise counted) short of the full American complement. Those names – at least as they’re represented above – have been invested with a font size that misaligns them with the 51 rows of data; the names are simply too large for the row heights that capture the data, a mismatch that conduces toward the follow-on question as to why the particular states above were earmarked for display, to the exclusion of the others.

You’ll also observe that the maps’ cell widths vary. Contrast the DeBold-Friedman above with this one:


Because the above map – and its 26 states – time-stamps its data from a later point of inception its cells are dilated, a refit that widens it equivalently to the other maps. Do these discrepancies matter (along with the ten-year intervals marked in the first map that have been halved in the second)? I’m not sure.

And something else about the maps provoked a thought or two. On his website Dov Friedman tells us that he used Excel “to aggregate over 100,000 data points. The data was then plotted on heatmaps using Highcharts. All sections were templatized with handlebars [sic].”

Now Highcharts and Handlebars are two applications with which I am not familiar (I told you I was a philistine); But in the course of perusing and admiring the maps a renegade idea gatecrashed my cerebellum: could the maps be emulated – at least more or less – with Excel alone?

I think the answer is yes, at least more or less; it seems to me that, with the disease data dropped into the Values area, the heat maps could be made to emerge from a pivot table that would crosstab state names and years against the data, whose numbers could be comparatively scaled through a series of conditional formats.

In receipt of that self-issued marching order, I proceeded to download the measles data for five states to see how the heat maps might be framed in Excel-only mode. The numbers were redirected into a pivot table (not a straightforward program; the download assigned each state’s figures to an independent field, and as such had to be harnessed inside a single State parameter via the Get & Transform utility. In addition the “-“ cell entries marking absent data had to be quantified, but I’ll spare you the particulars).

Once the above reconstructive work was carried out the resulting pivot table proceeded pretty straightforwardly, something like (depending on how you’d name the fields):

Rows: States

Columns: Year

Values: Incidence

And once the table was put in place, the cells would be subjected to the color-graded conditional formats described above (in tandem with white borders), and while I didn’t do a primo job of replicating the DeBold-Friedman scheme it’s all about the concept, after all.

Now of course the pivot table will line its upper border with the year (Column area) entries, even as the maps in question underscore the values with those data. The simplest route toward emulation would call for a blue-on-blue formatting of the years above, or a simple hiding of their row, after which the years could be typed below. The Vaccine introduced line was almost surely drawn, and that’s what I did. The color legend


Is a tricky one; it could perhaps be rendered though an equally simple coloring of consecutive cells per the conditional formats, though it appears that the DeBold-Friedman band comprises tints of varying width. By way of resolution, four cells or so could be allotted per 1k, to support multiple colors spanning the same interval (note that the data for the states I selected don’t report numbers in the 2-4k range.)
In light of the preceding, I’ve gotten this, by way of a demo:


The larger point has been made, one hopes. Proper equivocations aside, the map approximates the representations of the DeBold-Friedman efforts, but again without recourse to applications beyond Excel. Indeed – for me the phase of the task that threw up the most resistance was the foundational pivot table itself. But once you toss off that gauntlet, the map beckons – because the map is the pivot table (and there’s no need by the way to resize the numbers in values for visibility’s sake, because the conditional formats will obscure them anyway).

Thus it seems to me that my infographical proposal works, albeit on a slight lower end from the DeBold-Friedman portrayals. And while I’m not claiming it’s the stuff for which awards are bestowed, I’m working on my acceptance speech anyway.