Data Stories: NY City Property Sales

14 Aug

You’re wondering what to do with that discretionary $2,000,000 you have laying around – you know, that sheaf of bills that’s been angling your pillow up to untenable heights. In the interests of a good night’s sleep, then, how about dropping those two big ones on some New York real estate?

It sounds like a prudent bet; even the president of the United States has been known to commit some rocks to the idea, after all, and if that example doesn’t inspire you you may find yourself on the wrong end of an all-caps tweet from you-know-who. So now that you’re properly inspired and keen to know what you can get for your seven digits, you’ll want to consult the data set of the New York’s property sales for the year Feb 1, 2017 through January 31, 2018 – 60295 of them, or 165 a day – brought to you by the folks at Enigma public data, who’ve joined the sales records from a pair of City agencies. You can download the set over there, once you’ve signed into their site (for free).

Shift the default Save As format from CSV over to standard Excel and you’ll wind up with a 13 MB workbook – not a prohibitively weighty file, but with 62 fields that pull all the way into column BJ you may find yourself grabbing onto the scroll bar with prohibitive pique. You may thus want to abolish fields not likely to serve your analysis, starting with borough_code in Column A; you’d doubtless prefer to work instead with the five actual borough names filed in the curiously named borough_code_definition field in B. Other candidates for excision, at least in my view and at least for starters: community_district_definition in column Z, school_district_defintion, and city_council_district_definition. All three in effect reiterate the New York neighborhood names promulgated in the neighborhood field in what was, at least at the outset, housed in column B. Those curtailments kick back a megabyte to my hard drive, and every million counts, after all.

We can next move on to some standard, obvious need-to-knows, e.g, the number of properties sold in each of the city’s boroughs, understanding at the same time that obvious questions don’t always yield obvious answers. We could try this:

Rows: borough_code_definition

Values: borough_code_definition

I get:


You may be surprised by the smallish sales total for Manhattan, as I was. But the boroughs feature different kinds of properties in different proportions. Drag borough_code_definition into Columns and substitute building_class_category_code_definition in Rows, and you’ll get the picture, which looks something like this in excerpt:


Observe the One Family Dwellings entries, and their deeply understated Manhattan representation. While these standard residential homes contribute few jags to the island’s storied skyline they loom rather larger among the byways of New York’s more suburban boroughs.

But what about property values? We could plausibly guess that Manhattan sales will lift themselves atop the listings, even as a thorough appraisal would control for building class, but for a preliminary assay:

Rows:  borough_code_definition

Values: sale_price (average, formatted to two decimals with the comma)

That yields:


Our guess is ratified, and unsurprisingly. Now augment the Values area with these two fields, both formatted as above: gross_square_feet and floors_number_of:


Manhattan properties are clearly larger – and higher (note, however, the blanks in the floor field).  A 19-floor apartment can have a penthouse; the top one in a three-floored one is called the third floor.

But it next occurred to me that those averages may have absorbed multiple apartment purchases in the same properties, a possibility that could skew the numbers – provided one is concerned to isolate unique parcels.

With that thought in mind I ran a Remove Duplicates at the data, identifying address and zip code as sources of potential redundancy (zip code, in the event properties in two different boroughs had the same address, a possibility). 12251 records were thus uprooted, after which I reran the above pivot table. This time I got:


I was startled by the Manhattan decrement, so much so that I initiated the table twice for verification. Sort the floor numbers largest to smallest in the original data set, however, and you’ll indeed discover repeated sales in the same, very tall properties. Note as well the heightened average sale price for the borough, a figure that needs to be considered carefully; after all, if a single property enjoyed multiple sales at conceivably various price points, only one of them remains here.

And since the data span precisely twelve months, a look at sales by month might prove instructive (I’m back in the original, full dataset here). But here, again, we’ve run across – or afoul – of an Engima protocol, the decision to portray dates, the ones stored in the sale_date field, in textual terms, e.g.


In fact, because all the dates possess equal character length, a formula such as =VALUE(LEFT(W2,10)) could be put to the task, refashioning the entries into authentic values that could then be formatted in date terms. But a slightly swifter if less elegant workaround might be the thing to do here: since each entry in sale_date bears the suffix T00:00:00 (for whatever reason), selecting the sale_date column and targeting the data with a Replace All Find and Replace routine, e.g.


And that restores quantitative standing to the field’s labels. Subtract T00:00:00 from 2017-02-01T00:00:00, and you’re left with 2017-02-01, a usable date.  And it seems to work all 60,295 times.

And once that deed is done, try this:

Rows: sale_date (grouped by Month)

Values: sale_date (count).

sale_date (again, count > % of Column Total)

(Note that January data emanate from 2018.)

I get:


Note the conspicuous shortfall of sales in January, which again, is the most recent of the months. Apart from that discrepancy I’m not sure if a pattern could be properly declared, though the tycoons among you may have something to say on that count.

Now to take the analysis slightly off-piste, what about the average antiquity of the properties sold, that is, years of construction? The information is there, courtesy of the year_built field. Here’s one look at the question:

Rows:  borough_code_definition

Values: year_built (average, to two decimals, sans comma)

I get:


Note the relative recency of the Manhattan structures, befitting the borough’s protean impulses. Yet the collective provenance of Staten Island builds are far newer, a likely signifier of its relatively late-in-coming appeal as a residential habitat. By extension, the notably older properties of the Bronx appear to reflect that area’s comparative loss of appeal – but being New York, watch that space.

But bear in mind that the year_built data calls for a measure of review. Sort the field Lowest to Highest, and the commercial garages on 7th Avenue in Manhattan’s Chelsea district rise to the top. If in fact the facilities were completed in the reported year 1111 – about 800 years before the advent of the car – we need to commend the owners for their foresight.

Also clearly wrong is the attribution for the property in sort position 2 – the two-story home hammered together at 509 Seguine Avenue in Staten Island. Its 1680 time stamp should qualify it for landmark status, but a couple of web sites let us know that the building in fact became available in 1920.

In addition, some pause should be given to the 28 sales of properties dated 1800, most of these ascribed to a batch of condos on offer at 365 Bridge Street in Brooklyn Heights – an elevator building. In fact, it seems it was built in 1929.

And I thought the elevators were autographed by Mr. Otis himself.

World Cup Data: Goal Seeking- and Finding

31 Jul

Like American presidential elections, the World Cup doesn’t seem to end; the four-year interregnum separating those two events seems ever more ostensible; and because some prognosticators have instated Brazil as winners of the 2022 Cup it may already be time to wonder if the tournament should be held at all.

But my latest information is that it’s all systems go, and anyway, Qatar is lovely in November; so in the interests of limning some helpful deep backgrounding of the picture, you may want to download and kick around the data set of all World Cup match outcomes extending through the 2014 go-round, niched here in Kaggle’s repository of holdings.

The set then records all 836 matches contested from the Cup’s inception in 1930 through the immediately previous competition, in relatively self-evident fields, with the exceptions of the Round and Match IDs in columns Q and R. Kaggle maintains that the identifiers are unique, but the Round IDs exhibit a qualified uniqueness, i.e., they appear to signify a certain stage in the tournament (e.g. semi-final) that by definition would have been reached by multiple teams bearing the same id. And the match ids display curiously variable lengths, suggesting a mid-data shift in their coding protocol. The 2014 matches, for example, sport nine-digit identifiers; in 1998 their lengths have shrunk to four characters.

More troublesome is the small but measurable number of redundant game records, signaled by shared match ids. A Remove Duplicates routine earmarking MatchID as the offending field discovered 16 doubled records, which were promptly shown a red card.

Once you’ve stretched all the columns via the requisite auto fit, you can begin to think about what’s interesting in here. What, for example, about putative home field advantage? That vaunted residential edge is something of a legal fiction here; the first record, for example, names France as the home team and Mexico as the visitors, in a 1930 game set in host country Uruguay. But that only nominal imputation spices the question; might even a desultory home team standing impact game outcomes?

Let’s see. Any formula presuming to calculate win percentages needs to reckon with a common soccer/football eventuality – the fact that many games culminate in a draw. As such, we can take over next-available column U, title it Home Win Pct., and enter in U2:


That simple expression means to ask: if the goal entry in G exceeds the figure in H – that is, if the home team wins, then enter 1 the appropriate U cell. If, however, the values in G and H are identical – signifying a draw – then assign .5 to the cell, the standard evaluation of an outcome in which each team in effect arrogates half a win. Once we copy the formula down U, we can simply add all the numbers and divide the total by 836, the number of records in the data set (remember we deleted 16 of them). The result: a home-team win percentage of 68.42, a disproportion that piques the question as to exactly how home teams are nominated.

For example: in the 1930 debut Cup, Argentina received home-team standing for four of its five matches, its sole “away” status assigned to its role in the final, which it lost to the authentic home team, Uruguay. Mexico, on the other hand, competed under the away rubric for all three of its games that year. And the home team in 1930 – however denoted – won all 18 matches during the tournament.

Explanations notwithstanding – though they do need to be pursued – we can calculate Cup-by-Cup home-team winning percentages via a rather neat deployment of the AVERAGEIFS function.

First, enter a bare section of the spreadsheet and list the years in which the Cup was held, starting with 1930 and coming to a stop at 2014 (I’m commencing in Y3). Once you type 1934, of course, you can drag the remaining years down their column by autofilling their four-year interval, remembering, however, that the Cup was suspended in 1942 and 1946. Then name the Year field in A yr, the winner field in U win, and enter in Y3:


And copy down the Y column.

How does simply averaging the win data – which after all, comprise either a 1, a .5, or a 0 – forward our objective? Contemplate this example: a team winning two games and losing one receives 1, 1, and 0 points for its exertions. Average the three values and the resulting .6667 returns the winning percentage for two wins and one loss.

If we’re happy with that understanding and then proceed to format the results in percentage terms, I get:


It is clear that somewhere, perhaps in the 70s, the idea of a home team underwent a rethink; I’m not sure what drove the apparent definitional overhaul, but it seems to have been put into place (for a possible partial accounting see this discussion). We even see an away-team edge gained in the 2010 Cup. I’m happy to entertain surmises about these disparities.

In any case, what about goals – e.g., have their outputs ebbed or surged across the Cups? If we want to figure a winning-to-losing team metric, say the average winning and losing goal total – or really, the average score  – by game by Cup, we’ll have to improvise, because those data aren’t expressed in existing fields. A couple of simple formulas should be able to answer our question, however. I’ve moved into column V, called it Win Goals, and jotted in V2:


That expression simply declares that if the goal total in G exceeds the one in the corresponding H cell, then return the value in G; otherwise report the number in H. If a game was drawn the logical test will not have been met, of course, but no matter; since in such a case the G and H figures are identical it matters not which one the formula returns.

I next head into to column W, label it Lose Goals, and write what is in effect the flip side of the above formula in W2:


Both formulas are copied down their respective columns, of course, and conduce toward this pivot table:

Rows: Year

Values: Win Goals (average, formatted to two decimals)

Lose Goals (same treatment as above)

I get:


The marked downturn in goal scoring is associated with the recency of the Cups; indeed, the overall winning-game average of 2.18 goals was last attained in the 1970 tournament, and the average victory margin of three goals in the 1954 contests exceeds the average per-game combined goal total for the last 14 Cups. Average winning margin for all games: 1.51 goals.

And let’s see VAR verify that .51 goal.

Hacks of Stacks of Wax: Billboard Top 100 Data, Part 2

16 Jul

The hits keep coming on the Billboard 100 dataset, and its mighty chorus of voces populi (it’s the plural; I checked) sounds an arpeggio of questions our spreadsheet is prepared to answer. Topping the survey, perhaps, is one that is both obvious and most compelling: who’s the most prodigious hit maker? The answer, again, should emerge from the trenchant Data Model Distinct Count query we described last week. It informs a pivot table that should look something like this:

Rows: Performer

Values: Song (Distinct Count)

Sort the results Highest to Lowest.

The listings, at least to this way-behind-the-curve listener, were cause for surprise:


Pulling away from the pack, and by a couple of orders of magnitude, is the vast vocal catalogue of ditties crooning your way from the Glee television show, its cover versions of other person’s hits splattering all over the charts, but with a curious aggregate brevity. Its 183 unique hits resounded through the rankings for a total of but 223 weeks, if I’ve gotten my filter right; not one-hit wonders, then, but one-week.

But those counts call for a measure of refinements. In addition to the generic Glee Cast appellation, a filtered scan of the data for the artists bearing the name Glee somewhere in their handle reports:


Filter-confining our pivot table to that expanded Glee complement, I get


Apart from the fact that I haven’t heard of half of the above collaborators, we’ve boosted the Glee count to 206 unique tracks that some time, somehow, succeeded in booking their place in the top 100.

And of course, the multi-name problem is no idiosyncrasy of the Glee phenomenon. You’ll note a Mr. Presley, whose 53 chart visits essayed in conjunction with his antiquarian colleagues the Jordanaires combine with his 49 solo efforts (we’re apparently not counting his backup singers here). That’s 102 appearances for the troubadour from Tupelo, but we’re not finished. Filter for Elvis Presley, and


I’m all shook up. (And like you, I have no idea who the Carol Lombard Trio/Quartet was. The movie star was spelled Carole, but so is one of the listings up there.) And by the way, remember that the Billboard 100 data tracks back to August, 1958; but Elvis’ debut hit, “Heartbreak Hotel”, bears a time stamp of February 1956, and so won’t be found here (though four renditions of a song with the same name by others will).

Aim a like filter at the Beatles – that is, the words Beatles, and –


Or take James Brown. Soul brother number 1 has 47 entries per his name in stand-alone mode, but filter for all instances of the original disco man and we see:


You’ll appreciate the problem; a proper census of each and every artist’s top 100 total would appear to require a filter of the sort we’ve applied above, a necessity that, if plied, can’t be directly pivot tabled, in part because a great many songs would need to be counted more than once. You’d need to allot an entry, after all, to each artist enumerated in a tandem hit, e.g. you’d be bidden to assign one hit each to the Beatles and Billy Preston for “Don’t Let Me Down” and “Get Back”. Remember them?

Now the route to another need-to-know metric, the total number of weeks an artist’s offerings have informed the top 100, offers a smoother ride, particularly if you simply need the total:

Rows: Performer

Values: Songs (Count)

Each appearance of a song in the data set amounts to a 1, after all, or one week’s visit to the top 100. Sort the outcomes by Highest to Lowest, and I get, in excerpt:


Your cultural weltanschauung will dictate your adjective, i.e., the results are interesting, surprising, confirmatory, or dismaying. I am in addition either embarrassed or proud to say I’ve never heard of Kenny Chesney, Keith Urban, and Brad Paisley; that these titans are country and western singers explains my arrant illiteracy in this character-defining matter.

But the complication we uncovered earlier reappears here. If you’re asking after the missing Elvis Presley in the above screen shot, for example, run the Performer filter for the generic Elvis Presley – again, filter for all instances of his name:


And you’ll see:


That’s 19 years’ worth of time spent in the top 100. Next filter for all mentions of Elton John:



A remarkably comparable durability, but again, we haven’t accredited Presley’s pre-August 1958 incursions into the chart.

And just for the record, here’s some other all-mentions-of-an-artist/top-100 week tenures:

The Beatles: 608

Michael Jackson: 726 (none of which factor in the Jackson Five’s 212 weeks, however)

James Brown: 682

U2: 397

Kelly Clarkson: 542

Diana Ross: 626 (but the Supremes qua Supremes, sans any official allusion to Ross, contribute another 299)

Barry White: 175

The erstwhile Beatles in solo-act capacity:

Paul McCartney: 344 (but Wings brings another 201 weeks to the count)

John Lennon: 161

George Harrison: 161

Ringo Starr: 129

But just don’t ask me what any of it means.

And still another need-to-know follow-on takes our analysis to its interrogative crescendo: Which tracks have enjoyed the longest stays (welcome or not) on the Billboard 100?

That question seems to admit of a pretty straightforward answer:

Rows: SongID

Values: SongID (count, of necessity; the field is text)

(Remember that SongID, and not Song, need be applied to the pivot table. SongID imparts a unique identifier to each song, in order to disambiguate multiple versions of the same song.)

I get, in excerpt:


Remember that SongID concatenates title and artist; and so oblivious am I to all these next big things that I wasn’t sure if the week leader above is entitled Radioactive Imagine by the Dragons, or Radioactive by the Imagine Dragons. I have since learned the latter formulation properly parses song and group; described by Wikipedia as a sleeper hit, Radioactive nevertheless somnambulated across the charts for 87 weeks (a figure Wikipedia corroborates), or about  1 2/3 years. That’s a long snooze; but don’t overlook their Demons, chart-resident for another 61 weeks. In fact, a scan down the list counts 55 songs that persisted somewhere in the top 100 for at least a year.

And I think the only one I know is Unchained Melody, by the Righteous Brothers. Are you amused?

Hacks of Stacks of Wax: Billboard 100 Data, Part 1

2 Jul

You don’t read big data, you analyze it. No one unrolls themselves into their hammock, reaches for their mint julep, and thrills to that page-turner of a 300,000-row data set they’ve been craving to get at all winter. Big data is meant to revel in its bigness, favoring the curious with its possibilities for aggregated, patterned and macro-leveled largesse, and largeness.

But sometimes the revel is in the details. Now and then a big data set comprises a gigantic compound of molecular bits whose very protons might be of sufficient interest to make you put your julep on hold – and I’m thinking about the 59-years of variously memorable hits filling 309,000 rows of the Billboard top 100 workbook, playing its enormous medley here on the site.

As indicated, the Billboard set recollects its chart-toppers all the way back to August, 1958, and if you’re just bursting to know for exactly how many weeks “She Loves You” oooed its way into the listings – and you probably are – or precisely when the epochal “Rapper’s Delight” first hip-hopped onto the rankings and your consciousness (15, and the week of November 10, 1979, respectively; but remember that the Beatles’ own German cover version “Sie Liebt Dich” also checked in for a week at 97 in June, 1964), you’ve assuredly come to the right place.

I don’t know about you, but I think the Billboard data – all 21.6 megabytes of it (i.e., you’ll have to download it yourself) – makes for a cracking good read – but it’s a spreadsheet, after all, and so some intriguing global findings should be in there, too. But as usual, the data need some inspecting before the work gets underway.

Note, for example, that the Peak Position and Weeks on Chart fields installed in columns I and J are, at least in theory, dispensable; one could derive both findings from a pivot tabling of the songs, subjecting Peak Position to a Min in Values, and then applying the song titles themselves to Values, realizing a count that would deliver a Weeks on Chart equivalent. That sparer approach would relieve the data of a slew of redundant entries, e.g. a song’s peak position appears identically for each week in which it appears.

If you’re wondering about the Instance field and what it means, you’re not alone. I originally supposed that it counts the number of times the same chart-bound song was performed by different artists (I use the term loosely), but that conjecture proved a false take. Rather, Instance seems to number a given version’s non-contiguous revisits to the charts. For example, Nicky Jam’s El Amante – a performer and song whose identities draw a pair of blanks in my uncomprehending mind – exhibits six instances; its debut at position 99 in the week of February 18, 2017 was succeeded by its disappearance the following week, only for the tenacious ditty to stage a three-week comeback dating from the week of March 4. Continuing to loll in the high 90s, El Amante submerged once again, before clambering back into 98 on April 4, etc. It last held its place in the rankings until the week of September 2, 2017, concluding its sixth instance – before it fell back into the oblivion it likely deserved.

Note in addition the SongID field, a unique identifier crafted by a concatenation of the entries in Song and Performer. Slightly curious is the retention of the formulas in their cells; their work has been completed, and could be paved over with a Paste > Values routine, an austerity move that serves to reduce the file’s size to 19.5 MB.

And if you’re wondering what purpose a song id might fulfill – that is, what analytical need would spur the assignment of an id to each song – I can think of at least one, one that returns us to an exigency with which I’ve contended before, and not optimally, as it turns out.

If we want to learn how many discrete songs clambered into the top 100 for any particular year we need – again – to do something about the recurring weekly appearances of the same songs, songs we want to count exactly once. I had expressed a similar wish, for example, in my posts on the Supreme Court Voting data, in which I wanted to count unique cases heard by the Court per year. I developed the count by embedding case data into the Rows area, where of course they’re enumerated but one time each. I then moved to analyze that satellite table instead.

But I’ve since learned that the above exertion is unnecessary, thanks to Excel frontliner Chandoo. He inclined my attention to an unassailably more elegant maneuver, that works like this:

But before I demonstrate, recall what I’m aiming to do: I want to pivot table a tabulation of the number of unique songs crashing the charts by year, and as such a prior step need be enacted upon the data before I set the table – I need to release year information from the WeekID field in B. That intention can be effected in several ways, but in the interest of simplicity I’ll scamper to next-available column K, call it Year, and enter in K2:


And copy down the column. That simple device releases the first four characters from each week id, which in every case offers up the year of the song’s chart entry (WeekId is text-formatted, by the way).

When the venerable Create Pivot Table dialog box opens, tick the Add this data to the Data Model box at its lower left (and someone tell the folks in Redmond it should be these data):


That tick activates Excel’s Data Model (which first made itself freely available in the 2013 release), an add-in that enables a number of data-querying enhancements, including the potential for building relational pivot tables. But our interest here is in those unique song titles, and so once you’ve executed the tick and the Data Model loads, advance to the pivot table (notice the slightly modified field list drawn up by the Data Model) and earmark Year for the Rows area. Next show SongID into Values, right-click into Summarize Values by, click More Options… scroll down and…


Wow – Distinct Count; what a concept. Click it, click OK, and I get (in excerpt):


(Note that the 1958 data are partial, encompassing only the last five months of that year. The 2017 listings extend to the end of October.) Subjected to a rudimentary line chart, the slope looks like this:


I’m not sure what sociological conclusions beg our attention, but the peak in song numbers in the 60s is marked, as is the decided slump in the ensuing years.

Put it this way: There’s something happening here/What it is ain’t exactly clear.

“For What It’s Worth”; hit the charts the week of January 28, 1967, and stayed there for 15 weeks. Peak position: 7.

The Verdict on Cook County Court Sentencing Data, Part 2

18 Jun

There’s plenty to learn from the feature-rich Cook County sentencing dataset, and so let the learning commence. We could, by way of first look at the data, examine what the set calls SENTENCE_TYPE, a self-evident header whose entries could be additionally broken out by another field, SENTENCE_DATE:


Columns: SENTENCE_DATE (grouped by year; see the previous post’s caution about the date-grouping challenges peculiar to these data.

Values: SENTENCE_TYPE (% of Column Total, as you turn Grand Totals off for the table’s column, which must necessarily come to 100%.)

I get:


(Bear in mind that the numbers for the 2010 and 2018 are very small, a concomitant of their fraction-of-the-year data representation.) Not surprisingly, Prison – the lengths of its terms unmeasured in this field, at least – accounts for the bulk of sentences, and rates of imprisonment verge towards virtual constancy across the represented years.

Probations, however, describe a bumpier curve, peaking in 2014 but exhibiting markedly lower proportions in both 2010 and 2018 (to date). Those wobbles, doubtless both statistically and criminologically significant, call for a look beyond the numbers we have.

I’ve been unable to turn up a denotation for the Conversion sentence type, but I can state that the zeros, first appearing in 2011, are actual nullities, and not an artifact of the two-decimal roundoff. Slightly more bothersome to this layman, though, was the Jail sentence; I didn’t know if that event discreetly named a particular mode of punition, or merely, and wrongly, reproduces Prison sentences under a second heading. It turns out that the former conjecture is in point, and that jail time is typically imposed for shorter sentences or individuals awaiting trial (see this clarification, for example. For an explication of the Cook County Boot Camp, see this piece).

The natural – and plausible – follow-on would associate sentence types and their distributions with types of crimes, but the presentational challenge proposed by a keying of the sentences to more than 1500 offense titles very much calls for a considered approach, indeed, one that could perhaps essayed with a Slicer populated with the offense titles. And while that tack will “work”, be prepared to scroll a long way until the offense about which you want to learn rises into the Slicer window. But such is the nature of the data.

And in view of that profuseness one could, perhaps, engineer a more practicable take on the matter, for example, by inaugurating a new pivot table, filling the Rows area with say, the top 20 offense types (right-click among Rows and click Filter > Top 10, entering 20 in the resulting dialogue box). Sort the types highest to lowest, line the Columns area with the Sentence Types, drop Offense_Type again into Values, and view them via the % of Column Total lens.

A screen shot here would be unacceptably fractional in view of the table’s width, but try it and you’ll see, for example, that 93.41% of the convictions for unlawful use or possession of a weapon by a felon resulted in a prison sentence, whereas perpetrators guilty of theft – however defined in this judicial context – incurred prison but 36.41% of the time, along with a 45.69% probation rate.

More troubling, however, is the small but measurable number of death sentences that appear to have been imposed on individuals sentenced for crimes not typically deemed capital. For example, .04% of the convictions for the possession of cannabis with intent to deliver/delivery of cannabis have drawn the death penalty, as have .05% of forgery convictions. These legal reprisals don’t ring true, and surely obligate the journalist in a redoubled scrutiny, if only to confirm their accuracy, and/or their propriety.

If you’re looking for other fields to irrigate, sentence length, expressed here as COMMITMENT_TERM, likewise begs for correlation with other fields; but here a new roadblock stalls the journey. Inspect the field and the adjoining COMMITMENT_UNIT column and the obstruction will loom large. Because the units present themselves in a raft of different durations, the correlations can’t proceed until some manner of unit reconciliation is brought to the data.

My early sense about the COMMITMENT_UNIT field told me that the disparities merely juxtaposed years to months; and were that the case, a simple formula could be stamped atop an new field, one in which either years could in effect be multiplied by or months divided by that value, e.g. a six-year sentence could be translated into 72 months.
But in fact, the units are a good deal more numerous and qualitatively varied than that. Turn a filter on the data set and click the arrow for COMMITMENT_UNIT. You’ll see:


While it would be possible to construct an equivalence lookup table for the chronological units enumerated above, e.g., one month rephrased as 720 hours, a sentence delivered in monetary terms – dollars – can’t be subjected to a like treatment. And a sentence of Natural Life – presumably an indefinite, open-ended prison stay – is similarly unavailable for equating. Moreover, I have no idea what the two records declaring sentences of “pounds” – 30 of them for a criminal trespass of a residence, and 2 for driving with a suspended or revoked license, and both pronounced in Cook County District 5 – Bridgeview – can mean. And you may note that 19 sentences comprising 365 days each were issued as well; how these distinguish themselves from one-year terms is unclear to me. Nor do I understand the 1526 sentences consisting of what are described as Term.

On the one hand, of course, the data set can’t be faulted for admitting all these “discrepancies” into its fold; they’re perfectly valid and pertinent records. On the other hand, they cannot, by definition, be forced into comparability with the other entries; they’re oranges to the predominating crop of apples.

The simple way out, of course, would be to sort out and excise the non-chronologicals and proceed, and on a blunt practical level that stratagem might work. But it would work here for the simple empirical reason that those incongruities are few, and as such, would not compromise the greater body of data. But what if these irregulars were formidably populous, and hence unavoidable? What would we do with them?

That is a good and interesting question.

The Verdict on Cook County Court Sentencing Data, Part 1

29 May

You can trace the course of justice in Chicago, including the direction and speed at which it travels, at the fledgling Cook County Government Open Data portal, a site brought to its URL at the behest of Kimberly Foxx, Illinois State’s Attorney for the county in which the city of the big shoulders shrugs. Four of the portal’s holdings –  Initiation, Dispositions, Sentencing, and Intake – chronologize the dispositions of cases processing through the system; I chose Sentencing for my look here.

It’s a big data set for a big city, recalling as it does sentencing records dating back to January 2010 and pulling through December 2017. With 189,000 cases and a field complement stretching to column AJ, don’t even think about calling it up in Google Sheets (the data-supporting capacity there: two million cells), but Excel is agreeable to its 41 megabytes if you are, and it’s available for download from the second above.

And at 41 megs, the minimalist in you will be avid to put your scissors to fields that might be rightly deemed dispensable. Cases in point: the four ID parameters fronting the data set in columns A through D, none of which are likely to advance your reportorial cause (note, by the way the interjection of commas into the large-sized identifiers, an unusual formatting fillip). Deleting the fields and their750,000 or so entries actually slimmed my workbook down to a lithe 29.7 mb, and that’s a good thing.

You may also note the slightly extraneous formatting besetting the INCIDENT_BEGIN_DATE, RECEIVED_DATE, and ARRAIGNMENT_DATE fields, their cells bearing time stamps all reading 0:00. I suspect these superfluities owe their unwanted appearances to the data in the ARREST_DATE field, which do exhibit meaningful times of suspect apprehension. We’ve seen this kind of excess before, but again it’s proper to wonder if any of it matters. If, after all, it’s your intention to re-present the data in pivot table form, for example, you’ll attend to any formatting disconnects there, and not here. If so, a reformatting of the data source may be no less superfluous.

But whatever you decide we can proceed to some analysis, acknowledging at the same time the scatter of blank cells dotting the records. Given the welter of substantive fields in there, quite a few possibilities beckon, and we could start by breaking out types of offenses by year, once you answer the prior question submitting itself, i.e. which of the available date parameters would be properly deployed here? I’d opt for ARREST_DATE, as it affords a kind of read on Chicago’s crime rate at the point of commission – or at least the rate of crimes culminating in arrest, surely a different and smaller-sized metric.

But if you’re thinking about installing the arrest dates into the column area, think twice – because the dates accompanied by their time-stamps are sufficiently granulated that they surpass Excel’s 16,384- column frontier. You’ll thus first have to swing these data into the Rows area, group them by Year, and only then can you back them into Columns, if that’s where you want them stationed.

And that’s what I did, only to be met up with a surprise. First, remember that Excel 2016 automatically decides upon a (collapsible) default date grouping by year, like it or not; and when I corralled the arrest dates into Rows I saw, in excerpt:


Now that ladder of years seems to be fitted with a column of rickety rungs. Remember that the sentence data appear to span the years 2010-2017, and so the aggregates above hint data entry typos, and at least some of them – e.g. the 1900 and 1915 citations – doubtless are.

The additional point, however, is that some of these putative discrepancies might tie themselves to crimes that were in fact brought to the attention of the justice system well in the past, and that took an extended while before they were actually adjudicated. Remember that our data set archives sentences, and some criminal dispositions take quite some time before a sentence is definitively pronounced.

For example, the 12 sentences associated with arrests made in 1991 reference serious crimes – seven murder or homicide charges, one armed robbery, one unlawful use of a weapon charge, one robbery and two thefts. One of the thefts, however, records an incident-began date (a separate field) of November 17, 2013, and thus appears to be erroneous.

But in any event, since our immediate concern is with arrests carried out in the 2010-17 interval I could click anywhere among the dates and proceed to group the data this way:


Note that I’ve modified the Starting at date to exclude the pre-2010 arrests, be they errantly captured or otherwise. Now after I click OK I can drag the years into the Columns area, after filtering out the residual <1/1/2010 or (blank) item.

Now I can drag OFFENSE_TITLE into Rows.

Surprise. With 1268 Offense categories cascading down the area you’ll have your work cut out for you, once you decide what to do next. Do you want to work with the data as they stand, or collapse near-identical types, and vet for misspellings along the way? Good questions – but in the interests of exposition we’ll leave them be.

How about something more immediately workable then, say age at incident? Exchange AGE_AT_INCIDENT for OFFENSE_TITLE, filter out the 2300 blanks, and group the ages by say, 3 years. Reprise AGE_AT_INCIDENT into Values (count). I get:


We see an extremely orderly negative association between age and arrests, with only the 20-22 tranche exceeding its predecessor bracket among the grand totals and only slightly. You’ll also observe that the numbers for 2017 are far smaller than the previous years, a likely function of incomplete data. In addition, track down to the Grand Totals row and behold the very significant ebbing of overall arrest totals from 2013 to 2016. Again, our table records arrest, and not crime totals, but the two likely point the same way – unless one wants to contend that the downturn in the former owes more to policing inefficiencies that any genuine diminution in crime – a not overwhelmingly probable development.

I’d then move to a Show Values As > % of Column Total look to learn how the brackets contribute differentially to arrests:


(The zeroes at the lowest reaches of the table reflect round-offs.)

Among other things, note the considerable, relative pull-back in arrests of suspects in the 17-19 range.

No, I don’t have an explanation at the ready for that, but perhaps you do.

Airbnb Data, Part 2: A Tale of Three Cities

14 May

There’s such a thing as overstaying your welcome, but Airbnb landlords – and in effect that’s what they sometimes appear to be – may be prepared to happily abide your long-term self, or selves, in their abode.

And that heartening show of hospitality may be illegal. They take a dim view of Airbnb’s good neighbor policy in London, for but one example, where the powers-that-be aren’t thrilled about the kind of lucrative serial subletting some Airbnbers perpetrate, straitening as it does the market for folks who’d prefer to hang tight in an actual, lease-driven apartment.

The long and the short of it then, is that it a review of Airbnb property availabilities – defined as the number of days in a year in which a given room remains on offer – could prove instructive, and our data for New York, London, and Paris devotes a field to just that question.

The analysis, then, should proceed pretty straightforwardly, once we do something about the startingly sizeable count of rooms – about 55,000 – that presently aren’t to be had. That is, their value in their dataset’s availability_365 field states 0, indicating that, for now at least, that room has been withheld from the market. An email from Inside Airbnb compiler Murray Cox informed me that zero means the property’s next 365 days in its calendar (presumably its start date is a moveable inception point, which Cox finds in the the scrape_date field in a different dataset) aren’t in play, at least temporarily.
And as such, those zeros – which are, after all, values that would contribute to and very much confound any formula result – have to be barred from the data set. Here I rely on my venerable highest-to-lowest sort of the availability_365 field, relegating the zeros to the bottom of the set; and once put in their place, an interpolated blank row immediately above the first zero will detach them from the usable data, for now (of course they can be recalled if needed via a simple deletion of the blank row).

And all that enables us to proceed here:

Rows: City

Values: availability_365 (Average, formatted to two decimals)

I get:


Real city variation is in force; and recall the linked article above, the one reporting the December 2016 London-approved bill “limiting Airbnb hosts to renting their property for only 90 days”. Looks as if a few thousand room owners in that city haven’t read the news lately.

We could next cross-tab the results by room type, by rolling room_type into Columns:


All the cities trend in the same direction, though not identically – itself a differentiation worth pursuing, perhaps. Availability widens as the rental space constricts, with shared rooms – defined by Airbnb as those in which “Guests sleep in a bedroom or a common area that could be shared with others”, presumably humans, who presumably might or might not be actually residents of the property – freed up for a considerably larger fraction of the year.

And the results make sense – even common sense, perhaps. Entire homes and apartments need be empty, by definition, and if so, where would their owners be expected to go for the duration of the rental?

That’s a good question, one that directs itself to one of flashpoints of the Airbnb controversy. Are its hosts the kinds of proprietors who hoard multiple listings that might otherwise be released to the conventional rental/purchase market?

A few sure-footed steps toward an answer would require us to divide all the rentals in a city by its number of hosts, i.e., an average of properties per host; and that simple division exercise needs to fill its denominator with a unique count of hosts, thus returning us to a problem with which we’ve tangled before. To reiterate it: an owner of multiple properties will naturally appear that many times in the data set, that is, once each for each holding, even as we want him/her here to appear once. In light of that complication I think the neatest way out this time is to conduct a Remove Duplicates maneuver (Data ribbon > Data Tools), and ticking the host_id field, the parameter whose entries contains the duplicates we want to shake out (again, you may want to save these results to a new workbook precisely because you’re shrinking the original data set).

But how do the host ids, once in their respective, solitudinous states facilitate a calculation of the number of properties they own, on average? Here’s how: once we’ve identified each host id singly, we can average the calculated_host_listings_count in column P via a city pivot table breakout. That field, which restates the number of each host’s holdings in each of his/her record entries, is one I would have deemed redundant to the data set’s design. After all, the owner-property count could otherwise be derived when needed, for example, via a pivot tabling of the hosts id, delivering the field to both the Rows and Values areas. But because we’ve removed all host id duplicates, that plotline has to be red-penciled – and that’s where the calculated_host_listings_count comes to salvage the script:

Rows: Country

Values: calculated_host_listing_count (Average, to two decimals)

I get:


We see then, that Airbnb hosts are for the most part single-property marketers, at least for the cities we’ve gathered. For those interested in more detail, we could try this:

Row Labels: calculated_host_listing_count

Columns: City

Values: calculated_host_listing_count (Count, % of Column Total, formatted in percentage terms to two decimals)

I get, in excerpt:


You get the idea, though we see London owners are notably more likely to offer multiple properties.

Speaking of which, croll to the bottom of the table above and you’ll find a 711, signifying the voluminous, apparent holdings of a fellow named Tom in London. But When I returned to our original, entire Airnbnb dataset, including the rooms for which availability was set at 0 days, I discovered but 350 properties strewn about London associated with his name.

Now Tom’s the kind of person Murray Cox wants us to know about; he owns so many properties that he’s lost track of half of them.

Airbnb Data, Part 1: A Tale of Three Cities

27 Apr

Would you rent your apartment to me? I have references from real people, don’t smoke, clean up after myself (is my nose growing?), and probably can be counted on not to trash your living room and adjoining vicinities.

Still don’t want to take my scratch for your pad? See if I care; there are plenty of other flats out there where yours came from.

Too many, in fact, according to Murray Cox, the self-identified “data activist” whose researches into Airbnb’s rental listings opened the door on a warehouse of dodgy practices, in numerous localities, e.g. property owners who market multiple dwellings, a clear and present violation of New York state law. Cox maintains that, among other things, the outsized scale of Airbnb offerings can worrisomely constrict a city’s available rental stock, and has published a city-by-city inventory (brought to my attention by a student) of Airbnb listings that you and I can download in most convenient spreadsheet form (look for the Summary Information and metrics for listings in… link attaching to each city).

It occurred to me that, among other potential takeaways, an intercity comparison of Airbnb activity might advance the journalistic cause a mite. I thus downloaded the data for New York, London, and Paris, all nicely exhibiting the same fields. With the intention of copying and pasting the data to a single workbook I cleared out a new column to the left of A, called it City, and entered and copied down the respective names of the three locations, properly lined up with their data once pasted, culminating in 162,701 rows of data, its 20 megabytes just itching to tell me what Airbandb has been up to.

Of course, the three-city amalgamation means to prime the data for a range of comparisons, but some comparisons don’t avail. I’m thinking in the first instance about the price field in what is now column K. These entries presumably cite daily rental rates, but express themselves in disparate currencies – dollars, pounds, and euros. One supposes an exceedingly determined investigator could mobilize and apply a round of currency equivalences to the records, a prospect that would require a vast compiling of date-specific rate fixes in short, a challenge likely to a real-world, deadline-mindful journo. I’ve thus elected to leave the numbers as they stand, and if that touch of laissez-faire works against the analysis I have no one to blame but myself. The buck stops here – and maybe the euro, too.

In any case, before we get fancy, we can think about this self-effacing pivot table:

Rows: City

Values: City (Count, by definition for a textual field)

I get:


We see that Paris – by far the smallest of the three cities – nevertheless advertises the largest number of Airbnbs. An accounting for that disjuncture would probably justify a deeper look. Might tourist cachet or friendlier legalities spur the Paris margin? I don’t know. But consider that, juxtaposed to Paris’ population of around 2.25 million and its average household size of approximately 2.3 persons, the city’s Airbnb stock could house around 6% of its residents – with the point, of course, that the inventory is apparently being withheld from the permanent-residence rental market.

Other incomparables have their place among the data, too. There’s little comparing to do as such among the three cities’ neighborhoods, and indeed – the neighbourhood group (UK spelling) field for Paris and London is utterly empty (the field for New York comprises the city’s five boroughs).

But of course other workable comparisons are available. What, for example, about average minimum stay requirements by city and type of rental? We could try this:

Rows: City

Columns: room_type

Values: minimum_nights (Average, formatted to two decimals)

I get:


We see that diffident London Airbnbers expect notably briefer stays at their places on average, with those uppity Parisians insisting that you agree to set down your knapsack – and debit card – more than a day-and-a-half longer before they let you in. At the same time, New York’s shared-room minimum is disruptively anomalous.

And for more evidence of cross-cultural heterogeneity – if that’s what it is – flip the values into Count mode and hand them over to the Show Values As > % of Row Total, ratcheting the decimals down to zero and switching the Grand Totals off (because the rows must invariably figure to 100%). I get:


The overwhelming Paris proportion devoted to the Entire home/apt offering is not, I would submit, proof positive of yet one more Gallic quirk, but rather a simple function of the Paris housing stock, in which apartments predominate.
For additional, if glancing, corroboration, try this pivot table:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (tick New York)

Values: neighborhood_group (Count, % of Row Total)

I get:


Recall that New York is the only city among our trio whose neighborhood group field is actually occupied with data – the names of its five boroughs. Note the relative Manhattan tilt towards Entire home/apt, even as the other boroughs, whose housing range features far more private homes, incline towards Private room – that is, presumably one private room among the several bedrooms in a home.

And what of daily price by city, keyed as it doubtless is to room type? It looks something like this:

Rows: City

Columns: room_type

Values: price (Average, to two decimals)

I get:


Again, that imperative qualification – namely, that the prices reflect evaluations per their indigenous currencies – need be kept in mind. As such, the New York tariffs verge much closer to the London figures when the appropriate, albeit variable, pound-to-dollar conversion is applied. With those understandings in place, the Paris Entire home/apt average seems strikingly low – because the Euro consistently exhibits a “weaker” relation to the pound, the former as of today equaling .88 of the latter. Yet at the same time, Paris’ private room charge would appear to be effectively higher.

Now again, because the data are there, we could compare average prices for New York’s boroughs:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (New York)

Values: price (Average)

I get:


No astonishments there, not if you’re a New Yorker. Manhattan expectably heads the rate table, though Staten Island’s second-place Entire home/apt standing may issue a momentary pause-giver, along with its basement-dwelling (pun intended) shared room rate.

That’s $7,300 a month for an entire place in Manhattan. And wait until you see the interest on the mortgage.

NY Regents Exam Data, Part 2: Multiple Choices

12 Apr

Numbered among the additional conclusions we can draw from the New York Regents data is a natural next question from the aggregate test averages we reckoned last week, answered by kind of reciprocal finding: namely, the aggregate fail rates. Guided by the concerns about weighting we sounded in the previous post, I’d allow that a calculated field need be applied to the task here too, a field I’ve called PctFail.

But before we proceed we again need to contend with the not insignificant number of records who, for apparent reasons of confidentiality, won’t count their fewer-than-five students, replacing the totals with an “s”. Thus I tapped into column N, called it NFail, and entered in N2:


The formula assays the relevant cell in J for an “s”; if it’s there, a 0 is supplied. Otherwise the value in K – the Number Scoring Below 65 – is returned.

Again, we’ll copy that formula down N and proceed to calculate the PctFail field:


Once effected, this simple pivot table, abetted by the All Students Slicer selection we ticked last post, opens the story:

Rows:  Year

Values: PctFail (formatted here in Percentage mode to three decimals)

I get:


The failure rates are substantial, a near-ineluctable follow-on from the overall score averages settling in the 68 range (remember that 65 passes a Regents exam).

But you’ll want to know about failures by Regents subjects, too. Sight unseen, you’d expect appreciable variation among the test areas, and that drill down can tool its way into the data via the Regents Exam field, e.g. something like this:

Rows: Regents Exam

Columns: Year

Values: PctFail (formatted similarly to the table above)

I get:


And variation there is, some more provocative than others. You’ll note the massive leap in failure rates for English and Geometry from 2015 to 2016, a determined, ascending slope of failures for Algebra2/Trigonometry, and a restitutive, noteworthy shrinkage in failures for Common Core Algebra. (The Common Core tests are controversial, in part because of their redesign; see this report, for example).

You’ll also want to do something about those error messages. In some cases, the #DIV/0! outcomes simply key the absence of data for the exam, owing to an exam’s discontinuation or later introduction, while the (blank) label appears in virtue of the ten rows that bear no exam name. Should you want to pave over the errors, and you probably do, click anywhere in the pivot table and proceed to PivotTable Tools > Analyze > Options > Options > Layout & Format tab > and tick For error values show:. Enter some appropriate stand-in for #DIV/0!, e.g. — , and click OK. Because the dashes in the 2015 column push far left and look almost as unseemly as the original error message, you may want to select all the Values and align them right. (You could also filter out the blanks.)

Now if you want to crunch failure rates by ethnicity, for example, you’ll again have to reconcile the double-counting character of the fields we described last post. The ethnicities – Asian, Black, Hispanic, Multiple Race Categories Not Represented, and White – have been quartered in the Demographic Variable field, but so have a potpourri of other, disconnected items bound to other Variables, e.g. Female to Gender, English Proficient to ELL Status.

We’ve stubbed our toe against this odd problem in the previous post, in which Excel’s million-record limit has forced records otherwise deserving of their own field into one, messy cosmopolitan column – the one called Demographic Category, itself dispersed into no-less-heterogeneous items in Demographic Variable. It’s a confusing issue, but I think we need to tick Ethnicity in the Slicer now and slide Demographic Category –confined by the Slicer to its ethnic item entries – into Rows. Sweep Year into Columns and you get:


The disparities here are dramatic, and rather self-explanatory – the results, that is, not the accountings of them.

Next opt for Gender in the Slicer:


Women outdo men, a finding that more-or-less jibes with current understandings of gender performance differentials. The female margin, nearly equivalent across 2015 and 2016, pulls away slightly in the following year.

And what of gender outcomes by exam? Slip Regents Exam atop Demographic Category (which has been sliced to Gender) in Rows, and (in excerpt):


And (in second, remaining excerpt):


You’re looking for palpable divergences, of course, but palpable congruences mean something here, too.  The decisive female advantages in the Common Core English scores are perhaps notable but not arresting; but their edge in Common Core Algebra does a fair share of stereotype busting, even as males emerge the stronger in Common Core Algebra2. (Important note, the Grand Total pass rates vary by Demographic Category even as the Total Tested remains neatly constant across all Demographic Variables. That’s because the distribution of “s” entries across the Categories isn’t constant.)

There are plenty of other permutations in there, but let’s try one more. Column Q quantifies the number of students in the record whose score achieves what the Regents calls College Readiness (CR), i.e., a 75 in the English Language Regents or an 80 on any Math exam in the system.

And here’s where I have to own up to a bit of spreadsheeting excess. In the previous post I implemented what I termed an NFail field, embodied by its foundational formula – an expression that replaced “s” entries with a 0, the better to factor these into a calculated field. I now understand that those exertions were unnecessary, because Excel will completely ignore an “s” or any other label in any case. Thus here (and last week, too) we can work directly with the Number Scoring CR field in Q. But because we do need to acknowledge the large number of “s” and “NA” entries in Q (NA, because only some Regents’ qualify as CR exams) that will impact any denominator we also need what I call here a CRCounted field to be put in place in the next available column, punctuated by this formula that gets copied down:


We then need compose that calculated field, which I’m calling CRPass:


Remember here, and for the first time, we’re computing pass rates. This pivot table among others, awaits, under the aegis of the Demographic Category Slicer – Gender:

Rows: Demographic Variable

Columns: Year

Value: CRPass

I get:


Provided I’ve done my homework correctly the results point to a striking spike in CR attainments, a set of findings that calls for some journalistic deep-backgrounding. (Note that the absolute CR numbers are far smaller than the global Total Tested figures, because as indicated above only certain exams march under the CR banner.) We see a small-scaled but real stretching of the female advantage in the pass rates between 2015 and 2017, one that also needs to be sniffed by some nose for news.

Now let my take a break while I double-check my homework. I hear this blogster is one nasty grader.

NY Regents Exam Data, Part 1: Multiple Choices

28 Mar

We haven’t met, but I can say with a jot of confidence that I’ve likely done something you haven’t – taken a Regents exam. Exams.

I’m not clipping the achievement to my lapel as a badge of honor, you understand, just stating a biographical matter of fact. The Regents – a staple of the New York State educational system in which I spent more than a little time – comprise a series of what are termed exit exams; pass enough of them and you walk away with a high school diploma of the same name. Your correspondent took his share of Regents, his scores emobdying the “scatter” in scattergram, but no matter; I took them, and I’m a better person for having done so.

But before you file a Freedom of Information Act request to verify those abnormally-curved results you may want to review a larger, ultimately more interesting record of Regents attainments, the dataset supplied by the New York City open data site that summarizes in grand form the Regents scores of students statewide for the years 2014-17. It’s a big file, needless to say – so big you’ll need to download it yourself – and its 212,000 or so records have a lot to say about the testing profile of New York’s high schoolers.

It also has a lot to say about spreadsheet organization, more particularly the juxtaposition of column G, Demographic Category, to H, Demographic Variable. Those columns/fields in fact identify a series of putative fields and field items respectively; and as such, G’s contents could, at least in theory, have been more conventionally structured into discrete parameters, each owning a column all its own.

But that prescription calls for an elaboration. Consider this fledgling pivot table drawn from the Regents data:

Rows: Demographic Category

Demographic Variable

Values: Total Tested

I get:


Note that that all the Total Tested subtotals are equivalent, intimating that the five Demographic Categories cleave the same population into an assortment of cross-cutting attributes, and thus summing the same student count five times. The by-the-book alternative, again, would have assigned each Category to an independent field, such that an interest in test achievements by Gender, for example, would require the user to simply drag Gender into Rows. As it stands, however, a Slicer (or filter) would have to grease the process, e.g.:


Moreover, casting the potential field-bearers into mere item status beneath that singular Demographic Category banner appears to obviate a good many cross-tabulating possibilities, e.g.: a breakout of tests by both Ethnicity and Gender. How, for example, are we to learn how black female students score on the Regents when both attributes are lodged in the same field, and so must occupy the same label area?

But at the same time, the spreadsheet designers had to contend with a supplementary problem that overrides my challenges – namely, that the upgrading of say, Ethnicity and Gender to field standing would appear to require that data present themselves in individual record form, e.g. each student’s performance on each test; and those 2,000,000 scores/records would burgeon beyond Excel’s data-accommodation space.
In any case, there is indeed lots to learn structural complications notwithstanding, and we could begin by starting coarsely – by calculating the average overall Regents scores by year:

Row: Year

Values: Mean Score (Average, formatted to two decimal points)

I get:


(Note that one record, attaching to the Island School in Manhattan, exhibits a nonsensical entry for its year. By filtering and comparing the Island School data, it appears that the record belongs to 2017.)

The averages are remarkably similar, though I’d venture that, given the 2,000,000-pupil universe, the one-point differential distancing the 2015 and 2017 scores is significant. Remember that the Regents passing score is pegged at 65, suggesting that the test designers got their threshold right.

But those averages aren’t quite definitive, and for a couple of connected reasons, one subtler than the other. The first recognizes that the student double-count pinpointed above proceeds in effect to compute the average scores multiple times, because the records operate under the steam of different demographic categories and numbers per record. Thus the mean average for the Gender category alone – which nevertheless contains all students – is likely to depart at least slightly from the mean average for Ethnicity, which likewise contains all students. If, for example, we reintroduce the Slicer for Demographic Category for the current pivot table and tick All Students, we’ll get:


The differences from the initial pivot table are very small but evident, again because the Total Tested numbers per the All Students records don’t perfectly line up with the Total Tested per-record numbers for Ethnicity, for example – because each record receives an equal weight, irrespective of its Total Tested value.

And it is the matter of weighting that points its arrow directly at the second question, one we’ve seen elsewhere (here, for example). The per-record mean averages ascribe an identical mean score input to each record, even as the test taker numbers vary. And that bit of record democratization vests greater, relative influence to the smaller numbers. The result, again: a possible skewing of the averages.

But because Column I enlightens us about the actual test taker numbers, we should be able to derive a simple calculated field to impart a corrective to the weighting problem – once we deal with the very large number of records – about 75,000 – that just don’t report their score results. These are the entries dotted with the “s” code, and our formula needs to ignore them as well as their test taker numbers, which are stated, after all.
So here’s what I did. I headed next-available column S TotalPts and entered, in S2:


That expression means to assign 0 for any “s” datum, and otherwise multiply the record’s mean score by its number of students. (Important note: absent Means Scores almost always associate themselves either with a Total Tested number of five or less, or with entries possessing the SWD or Non-SWD values. (SWD stands for Students With Disabilities.) One assumes that both types exclusions are justified by reasons of confidentiality; but remember that the five-way count of students in the data set should subsume most of the SWD takers anyway, via their inclusion in the All Students item.)

After copying the formula down I titled the T column CountedStu and dropped down a row, wherein I entered:


The formula asks if the relevant row in J contains an s. If so, a zero is returned; otherwise, the number of test takers cited in I is returned.

I next devised a simple calculated field, ActAvg (for actual):


That field can now be made to work with any other field whose items are banked in Row Labels, e.g. by substituting ActAvg for Mean Score in Values (and leaving the All Students Slicer selection in place). I get:


It’s clear that our weighting refinements have uncovered a “true” higher set of averages, and that continue to sustain the near-point improvement from 2015 to 2017.

But again, there’s quite a bit more to learn from the scores. But I’m asking now to recess for lunch – and yes, I raised my hand.