NYC School Attendance Data, Part 2: What I’ve Learned

23 Aug

Once we’ve decided we’re pleased with the New York City school attendance data in their current, emended state (per last week’s post), we can move on to ask some obvious but edifying questions about what we’ve found.

First, a breakout of attendance percentages by day of the week is something we – and certainly Board of Education officials – will want to see. In that connection, we again need to decide if we want to break out the attendance percentages arrayed in the %_OF_ATTD_TAKEN field, and/or the numbers we derived with our ActualTotals calculated field, the latter according numerical parity to each and every student; and as such, it seems to me that ActualTotals is fitter for purpose here (of course we could deploy both fields, but let me err on the side of presentational tidiness here).

But in the course of tooling through and sorting the data by the above %_OF_ATTD_TAKEN, I met up with a few additional complications. Sort that field Smallest to Largest, and you’ll have gathered a large number of records reporting days on which absolutely no students attended their respective schools – 7,245 to be exact; and while an accounting for these nullities can’t be developed directly from the dataset, we could be facing an instance of mass, errant data entry, and/or evidence of a requirement to furnish a daily record for a day on which classes weren’t held. And in fact, over 14,000 records attest to attendance levels beneath 50% on their days, and I don’t know what that means either. It all justifies a concerted look.

But in the interests of drawing a line somewhere, let’s sort %_OF_ATTD_TAKEN Largest to Smallest and split the data above row 513796 – the first to bear a 0 attendance percentage – with a blank row, thus preserving an operative, remaining dataset of 513974 records. But still, I’d submit that more thinking needs to be done about the low-attendance data.

Returning now to our day-of-the-week concerns, the pivot table that follows is rather straightforward:

Rows: Day

Values: ActualTotals

I get:


(Note that you’d likely want to rename that default Sum of ActualTotals header, because the calculated field formula itself comprises an average, in effect – NumberStudents/REGISTER*100. You’ll also want to know that calculated fields gray out the Summarize Values option, and thus invariably and only sum their data. Remember also that 2 signifies Monday.)

I for one was surprised by the near-constancy of the above figures. I would have assumed that the centripetal pull of the fringes of the week – Monday and Friday – would have induced a cohort of no-shows larger than the ones we see, though attendance indeed slinks back a bit on those two days. But near-constancy does feature strikingly in the middle of the week.

And what of comparative attendance rates by borough? Remember we manufactured a Borough field last week, and so:

Rows: Borough

Values: ActualTotals

I get:


By way of reorientation, those initials point to these boroughs:

K – Brooklyn

M – Manhattan

Q – Queens

R – Richmond (Staten Island)

X – The Bronx

The disparities here are instructive. Queens students are the literally most attentive, with Bronx students the least. Of course, these outcomes call for a close drilldown into the contributory values – e.g., economic class, ethnicity, and more – that can’t be performed here.

We can next try to learn something about attendance rates by month, understanding that the data encompass two school years. Try

Rows: Date (grouped by Months only)

Values: ActualTotals

I get:


The school year of course commences in September, with those early days perhaps instilling a nascent, if impermanent, ardor for heading to class. We see that attendance peaks in October, and begins to incline toward the overall average in December.

The question needs to be asked about June, or Junes, in which the attendance aggregate crashes to 85.21%, deteriorating 4.69% from the preceding May(s). While explanations do not volunteer themselves from the data, an obvious surmise rises to the surface – namely, that students beholding the year’s finish line, and perhaps having completed all material schoolwork and exams, may have decided to grab a few discretionary absences here and there. It’s been known to happen.

But let’s get back to those zero-attendance days and their polar opposite, days in which every student on a school’s roster appeared, or at least was there at 4 pm. The data show 1641 records in which each and every enrollee in the referenced institution was there, a count that includes 31 days’ worth of school code 02M475, i.e. Manhattan’s renowned Stuyvesant High School; a pretty extraordinary feat, in view of the school’s complement of around 3,300. And while we’re distributing kudos, mark down September 21, 2016, the day on which all 3,965 of Staten Island’s Tottenville High School registrants showed up, and June 24 of that year – a Friday, no less – on which the entire, 3,806-strong enrollment of Queens’ Forest Hills High School settled into their home rooms. But ok; you could insist that these laudable numbers should likewise be subjected to a round or two of scrutiny, and you’d probably be right.

Now for a bit of granularity, we could simply calculate the average daily attendance rates for each school and sort the results, and at the same time get some sense whether attendance correlates with school size as well. It could look something like this:


Values: REGISTER (Average, to two decimals)

%_OF_ATTD_TAKEN (Average, to two decimals)

Remember first of all that a given school’s enrollment is by no means constant, swinging lightly both within and across school years. Remember as well that because you can’t average calculated field totals, I’ve reverted to the %_OF_ATTD_TAKEN field that’s native to the data set.

Sort by %_OF_ATTD_TAKEN from Largest to Smallest and I get, in excerpt:


That’s the Christa McAuliffe School (named after the astronaut who died in the Challenger explosion) in Bensonhurst, Brooklyn on the valedictorian’s podium, followed very closely by the Vincent D. Grippo school, (physically close to the McAuliffe school, too). And if you’re wondering, I find Stuyvesant High School in the 907th position, with its daily attendance average put at 90.87. Tottenville High, interestingly enough, pulls in at 1155 out of the 1590 schools, its average figuring to a below-average 87.43. At the low end, the Research and Service High School in Brooklyn’s Crown Heights reports a disturbing 41.84% reading, topped slightly by the Bronx’s Crotona Academy High School (remember that you can learn more about each school by entering its code in Google). These readings merit a more determined look, too.

And for that correlation between school size and attendance: because my pivot data set out on row 4, I could enter, somewhere:


I get .170, a small positive association between the parameters. That is, with increased school size comes a trifling increment in attendance rates.

But if you want to learn more about correlations you’ll have to come to class.


NYC Attendance Data, Part 1: Some Necessary Homework

14 Aug

This open data thing is cool; they’ve even drummed up a report, available to every inhabitant on planet earth give or take a few censors, on my very own elementary school, P.S. 165 in Queens, New York City.

No; you won’t find my name large-typed anywhere in the record of 165’s distinguished alumni, but this is an academic status report, after all, and I’m prepared to forgive the omission. The larger point – I think – is that data – attendance data – about P.S. 165, and all of its 1600 or so companion institutions comprising the New York’s public school system, are present and accounted for here. Click the Download link, proceed to the CSV for Excel option (there’s a Europe regional-formatted version there, too), and wait for its 524,000 records to find their place in your virtual lecture hall.

The worksheet roll-calls attendance numbers for the days filling the September-June school years 2015-16 and 2016-17, and don’t be fooled by its understated five fields; at least three additional parameters can be easily synthesized from those five, and to productive effect. (Note: Apart from the need to autofit the field columns, keep in mind that the entries in the REGISTER field tabulate the school’s official enrollment as of the given day, and not the number of students who actually made their way to class.)

But the data do require a touch of vetting before the analysis precedes. Row 455543, if you’re keeping score, is in effect something of a blank, save a cryptic notation in the A column that won’t further our reporting interests. As such, the row should be, er…expelled. But another, more troublesome set of entries needs to be confronted and ultimately unseated from the dataset.

Start by running this pivot table at the records:

Row: DATE (ungroup if you’re on Excel 2016)

Values: REGISTER (sum)

Sort Largest to Smallest. I get, in excerpt:


The two uppermost dates – January 13, 2016 and February 2 of that year – exhibit suspiciously large register totals, just about doubling the other sums. It appears as if the records for that pair of dates have simply, if unaccountably, been duplicated – and as such one-half of these have to be sent on their way.

And that sounds to me like a call for Remove Duplicates. Click its trusty button on the Data ribbon, and tick:


Follow through, and you’ll be told that 3172 duplicate values – which sounds precisely duplicative – have been found and shed from the dataset.

Then inspect the lower reaches of that original sort and you’ll come upon more than 20 days, the great bulk of which are dated somewhere in June (i.e. the last month of a school year), with aggregate totals less than 800,000 – substantially beneath the totals above them. I suspect some reporting shortfall is at work, and indeed the legend attaching to the data on the Open Data site states the numbers may not be final “…as schools continue to submit data after preliminary report is generated”. We’re also left to explain April 18 of this year, for example, a date on which we’re told 819 students actually appeared, on what was the last day of the spring recess. Presumably they were there for some good educational or administrative reason – or the record is there by mistake (those 819 come from two schools, the Bard High School Early College institution in Queens and the School of Integrated Learning in Brooklyn).

Now we can go about deriving those three fields I advertised above. I’d first head into column F, title it Day, and enter in F2:


And copy down. We’ve certainly seen this before; WEEKDAY will tease out the day of the week from a date-bearing cell (with 1 denoting Sunday by default), and so presages any look at attendance figures by day.

Next, we could entertain an interest in breaking out attendance data by each of New York’s five boroughs – these the object of a neat, single-lettered code in the SCHOOL_DBN field, squirreled in the third position of each code:

K – Brooklyn (Brooklyn is officially named Kings County)

M – Manhattan

Q – Queens

R- Richmond (aka Staten Island)

X – Bronx

Thus we could name G1 Borough and enter in G2:


That is, the above expression collects one character from C2, starting (and finishing) at the third character in the cell – in the case of C2, X, or Bronx. (By the way, if you enter a school’s code in Google you’ll be brought to its web page, including P.S. 165 and 25Q425, or John Bowne High School, another of my academic ports of call). Then copy down G.

The final of our three supplementary fields would simply return the actual number of students who appeared in a given school on a given day, or at least the ones who were there to be counted at 4 pm; and while of course the dataset already reports schools’ daily attendance percentages, information that might tell us what we want to know, we’re again made to revisit an old but material problem: by  manipulating percentages alone we in effect weight each school identically, irrespective of student body size. And in fact that treatment might come to serve certain analytical purposes rather well, because there may be good reason to regard schools as equivalently important units. But for the same money we may want to have it both ways, by compiling stats that give each school its proportionate due. Thus we can title H NumberStudents and enter, in H2:


If that expression seems needlessly ornate, keep in mind that the values holding down the %_OF_ATTD_TAKEN field in D aren’t percentages, but rather percentages multiplied by 100; as such they need to be cut down to size by a like decrement. And the ROUND addendum means to quash any decimals that don’t quite do the results justice. After all, if 84.2% of a school’s 1,042 charges show up, should we let the Board of Education know that 905.15 of them have been sighted in the building? I’d bet that .15 has at least foot out the door.

But do these definitional niceties matter anyway? We can find out by first setting a pivot table in motion and churning out a calculated field, which I’ll call ActualTotals:


(And no, you won’t need parentheses around the two fields.) As for the 100 multiplier, it will put the results in alignment with those %_OF_ATTD_TAKEN numbers, which again exhibit a similar order of magnitude – that is, the latter comprise percentage data times 100.

The pivot table looks then like this:


Values: %_OF_ATTD_TAKEN (average, to two decimals)

ActualTotals (average, two decimals)

I get:


The differences are small but real.

But ok, class; let’s take recess. Just remember to be back by…4 pm.


Hit or Miss : Baseball’s Home Run and Strikeout Profusions

31 Jul

See the ball, hit the ball, advised the aphorist Pete Rose, with his winsome epigrammatic flair. And in fact, in the course of his day job as a major league baseball player, Mr. Rose took his reductive prescription to heart: the record shows he hit the ball 12,910 times, 4,256 of which resulted in that outcome curiously, and redundantly, called a… hit.

To be sure, Rose missed the ball – or struck out – 1,143 times, but as a percentage of his 14,053 times at bat, his 8.13% worth of misses is impressively low. And Rose would be interested to know that major league batters are missing the ball more often than at any time in the history of his chosen profession.

Batters in 2016 struck out 23.5% of all their at-bats, or .235 per the game’s three-digit formatting heritage (so don’t ask me why earned-run averages are squeezed to merely two decimals, though I have a theory); and the strike-out proportion thus far for this year (courtesy of has ticked up to .241.

But that narrative of futility has been thickened by a sub-plot: major-league batters are also hitting more home runs than ever, and the dialectic of hit-or-miss isn’t quite the contradiction it seems. After all, you’ll agree it’s easy to miss a 98-mile-per-hour fastball – all the more so if your neurons are forced to contend with a 78-mile-per-hour curve ball instead. Surprise. But manage to make contact with that fastball and the results could go a long way (see this consideration of the home-run/strikeout relationship in the New York Times, with particular attention paid to the home-run half of the equation).

To learn a bit more about the home-run/strike out antimony it would be a good idea to consult Sean Lahman’s free, go-to Baseball Database, as we have in earlier posts (look here, for example. Remember that Mr. Lahman will happily attend contributions.) Click the 2016 – comma-delimited version link, and another click upon the resulting zip file will empty its spreadsheet files into its folder. Then call up the Teams workbook. (Note Lahman advises that the files work best in a relational setting, but our analytic interests here can be sufficiently served by Teams all by itself.) Once we’ve gotten here we can throw a few simple calculated fields into a pivot table to productive effect, and join them up with a couple of applications of the CORREL function for our information.

To start, we could drum up yearly strikeout percentages – that is, strikeouts divided by at bats, and then for presentation purposes proceed to group the outcomes by say, bins of five years:

Rows: yearID

I’d group the years thusly:


I’ve earmarked the above span because some of the pre-1910 years have no strikeout data, and because the 1912-2016 interval comprises 105 years, yielding 21 equally-sized tranches.

Next I could devise this calculated field, which I’ve called sopct:



And garb the resulting values with this custom format:


And filter out the <1912 data.

When the dust settles I get:


The ascendancy of strikeouts is clear. Note the distinctly impermanent fallback in the strikeout percentages in the 1972-1981 era, a predictable and intended consequence of the lowering of the pitching mound in 1969 and its literal levelling of the playing field for hitters. But nevertheless the trend soon recovered its arc.

We could next calculate a similar field for home runs, calling it hrpct:


Applying a similar grouping and formatting to the field, I get:


The trendline is considerably flatter here, and indeed seems to have peaked during the 1997-2001 tranche – not surprisingly, because it was during that fin de siècle epoch that steroid use among ballplayers apparently peaked as well, thus fomenting, among other mighty anomalies, Barry Bonds’ surreal 73 home runs in 2001, and in 476 at bats.  Ungroup the yearID field momentarily and treat the numbers to a four-decimal format, and you’ll discover a home-run high of .0340 in 2000 – but note the .0339 for last year as well, succeeded in turn by this year’s all-time, relatively drug-free .0368 (again, check Baseball Reference for the to-the-minute totals).

Now what then about the association between strikeouts and home runs? Baseball common sense would predict an appreciably positive correlation between the two; the harder the swings at those fastballs, one could suppose, the more misses – but along with these flows the expectation of more long balls, too, once and if bat actually meets ball.

To palliate our curiosity, we can trot out both sopct and hrpct into Values, and continue to leave yearID ungrouped. With the strikeout and home run values stationing themselves into columns B and C and the year 1910 making itself home on row 43 – a positioning that by extension locks the 2016 data into row 149 – I can enter, somewhere:


That expression returns a most, almost unnervingly impressive, .845, shouting the view that, as strikeouts go, so do home runs – and virtually in lockstep.

We can then go on ask about the relationship between strikeouts and batting average. Might it follow that, with the increasing, overall failure to hit the ball, averages should falter correspondingly – simply because fewer balls will have been put into play? You can’t get a hit if you don’t hit the ball, it seems to me. Or do we allow that because hard swingers hit the ball harder (when they do) – that a hard-hit ball is harder to catch?

We can check that one out. We can mint still another calculated field, which I’ll call baagg (for batting average aggregate):


Replace hrpct in Values with baagg, leave sopct in place, and because the operative ranges don’t change the CORREL we authored above should rewrite itself, yielding -.477 – a notable inverse association (that’s a minus sign preceding the value, not a dash). That is, as strikeouts mount, batting averages do tend to respond downwards. And while it’s true that -.477 isn’t .845, most social scientists would be thrilled to even have that number crunched upon their desk.

And I suspect Pete Rose is smiling, smugly, too.

The Supreme Court Database, Part 2: Doing Justice to the Numbers

17 Jul

With 70 years and 61 fields worth of skinny on the 78233 Supreme Court rulings spread across 1946 through 2015, there should be a lot to learn from all those data – once we decide what questions could profitably be asked of them.

What about seasonality? That is, do Court decisions, and for whatever reason, issue from the hands of opinion writers in equal measure across session months?

They don’t. Try this pivot table:

Rows: dateDecision (grouped by Months only)

Values: dateDecision (count)

dateDecision (again, by % of Column Total)

I get:


Note that yearly court sessions commence on the first Monday in October, and typically proceed through late June or early July (for more, look here. The October inception point means in turn that the term years recorded in column K actually subtend only the later months of that year, and so extend well into the succeeding one. For example, the earliest decision disseminated in 1946 was time-stamped November 18 of that year.)

We see that more than half of all Court decisions were ratified in the April-June quarter, notably swelled by the aggregate June efflorescence. One could conjecture that a need to dispose of a case backlog pressed hard against impending summer recesses might explain the flurries in June.

And what about what the database calls its issue areas, i.e. the categorical domains of the cases pled before this court of last resort, and their yearly distributions thereof? The issue area field in column AO assigns the litigations to 14 broad classificatory swaths defined here; these can be copied and pasted to an empty range in the workbook and regarded as a lookup table (which we’ll name caseIssue).

Next, enter the field title caseIssue in BK1, the free column alongside our improvised uniquecount field (see last week’s post), and type in BK2:


And copy down (you may also want to drop a copy > paste special atop the results, a modest austerity measure that stamps hard-coded equivalents of the outcomes that condensed my file by about 600K.

But now we are made to return to the question imposed upon us in the previous post: because our data set remembers the votes of individual Justices for each case, the case ids naturally appear recurrently and, for our purposes here, redundantly – up to nine times each.  In the last post we applied a formula to isolate single instances of each case in order to realize our count of cases heard by year. Here we need to conduct another count of sorts of unique entries, but one that at the same time preserves the categorical identities of the cases.

So I banged together this pivot table, for starters:

Rows: Term



In addition, click Design > Report Layout > Repeat All Item Labels, a decision that reiterates the term (year) data for each record, thus overriding the standard pivot table default posting of each year but once. Turn off Grand Totals, too, and make sure to select the Tabular Form report layout.

Doing what pivot tables are trained to do, each caseid is singled out, per this screen shot excerpt:


Then filter out the #VALUE item from caseissue; these represent records (485 of them) for which no issuearea data avails. Note that the Values area remains unpopulated, because we’re not concerned here with any aggregated data breakout – at least not here. Rather, we want to treat the table as a de facto data set, itself to be subject to a second pivot table.

And is that prospect thinkable? It is, and a scrounge around the internet turns up a few slightly variant means toward that end. One strategy would first have the user select the entire table and superimpose a Copy > Paste Values overwrite upon it, downgrading the table, as it were, into a mere standard set that lends itself to a perfectly standard pivot tabling.

But that rewrite isn’t necessary; in fact, you can pivot table the pivot table straight away, by first clicking outside the pivot table and, by way of one alternative, now clicking Insert > Pivot Table. At the familiar ensuing prompt:


Select the entire pivot table (but I was unable to enter the default table’s name, e.g. PivotTable1, in the field, however). Note in addition that while conventional pivot tables default their location to New Worksheet, Existing Worksheet has been radio-buttoned here; but I’ll revert to type and opt for New Worksheet, because I want the table to begin to unfold as far to the left of the worksheet as possible – that is, column A. Click OK, and the fledgling second pivot table and Field List check in onscreen.

Then the standard operating procedure resumes:

Rows: Term (grouped in tranches of 5 years)

Columns: caseissue

Values: caseissue (Count, necessarily, by % of Row Totals; as we’re now working with percentages, turn off the Grand Totals that will invariably figure to 100%.)

I get:


I hope the shot is legible. If not, the table is yours for the clicking.

I’m not sure the outcomes elaborate “trends”, but that hardly discommends them from analysis. You’ll note a rather decided drop in Federal Taxation and Union-related cases, for example, and a general upswing in Criminal Procedure hearings, even as other categories (Due Process and even Civil Rights) fail to slope straightforwardly; in any event the percentages need to be read against the absolute numbers contributory toward them, because they aren’t always large. With what is in effect a matrix comprising 14 year groupings by as many case categories, those 196 potential intersections map themselves to 8683 cases; do the math, and some of those quotients are likely to be small.

And there’s a postscript to all this, of course. Last week’s post resorted to a moderately fancy formula that managed to seek out but one instance of each Court decision, preparatory to our count of caseloads by year. Now I realize that we – I – could have done the pivot-table-of-pivot-table thing to bring about the same result, and with only one field. That is, for the first pivot table-cum-data set:

Rows: Term (in conjunction with Repeat All Item Labels)

Case ID (and Tabular Form report layout)

Then pivot table the above:

Rows: Term

Values: Count



Hey, it’s my blog – I’m allowed to learn something, too.

The Supreme Court Database, Part 1: Doing Justice to the Numbers

4 Jul

It stands to reason as a matter of near-definitional necessity that open data in a democracy would crank open a window on its judicial system; and the United States Supreme Court seems to have obliged. The Supreme Court database, a project of Washington University that has adapted the foundational labors of the late Michigan State University Professor Harold J. Spaeth to the spreadsheet medium, affords researchers an educative profusion of information about Court doings – once you understand what you’re looking it.

That qualification broadcasts a caution to unlettered laymen such as your faithful correspondent, who’s still can’t understand why chewing food with one’s mouth open hasn’t been declared unconstitutional.

But enough about me. In fact, the database comprises a collection of datasets that breaks the Court information along a pair of variables, Case and Justice Centered Data, each of which supplies four variably-formatted sets tied to the above parameters. Here I’ve downloaded Cases Organized by Supreme Court Citation:


The workbook details yearly vote activity of Court justices dating from 1946 through 2015 (and you’ll note the parallel compendium for votes cast between 1791 and 1945); but again, you’ll need to appreciate what the workbook’s 61 fields want to tell you. (You can also conduct detailed data searches on site’s elaborate analysis page.)

For guidance, then, turn to the site’s very useful Documentation page, wherein the meanings behind the headers/fields are defined via the links holding down the page’s right pane (one quirk: the pane arrays the header/variables by category, and not by the sequence in which they actually appear in the workbook).

But we needn’t think too deeply about the field’s yields in order to issue a telling first read on the Court’s caseloads. We can move to break out the number of cases the Court has considered by year, by counting the number of unique case ids posted to column A. But there’s a hitch, resembling the one that has stalled us in a number of previous posts: because the records identify each justice’s vote per case, the case ids naturally appear in quantity – in the great majority of cases nine times, registering the number of sitting justices at any given time. (The number isn’t invariably nine, however, because occasional vacancies depress the total.)

But whatever the justice count, we want to reckon the number of unique cases the Court heard during a given year. We’ve entertained a number of takes on the matter in the past, and here’s another one, this a reasonably elegant formulation adapted from a solution advanced here. Slide over to column BJ, the one adjoining the dataset’s last occupied field, name it UniqueCount, and enter in BJ2:


Copy the above down B (note that, owing to processing issues, the copy down may take some time).

What is this expression doing? It’s conducting a COUNTIF for the appearance of each entry in the A column (positioned in the formula as the COUNTIF criterion), the frozen A$2 serving to progressively expand the criterion range as the copy-down proceeds. If any instance of the formula returns an aggregate count exceeding one, the IF statement instructs it post a zero. Thus we’re left in uniquecount with a collection of 1’s and 0’s, the 1’s if effect counting each case id once and ignoring any additional appearances of that id. (Note as well that here we need to conduct a row-by-row count of unique entries, because the 1’s need to be summed inside the pivot table. Other formulas will deliver the total

number of unique elements in a solitary formula; look here, for example for a pretty well-known array-formulaic solution.)

Once we’ve harnessed the UniqueCount values, i.e. we’ve winnowed the data for but one citation of each case, we can put these immediately to work in this pivot table:

Rows: Term

Values: uniquecount

I get (in excerpt, after having frozen the title row):


Beginning with the early 90s, the Court appears to have, for whatever reason, imposed a dramatic scale-down in cases heard see this analysis of the phenomenon). The Court adjudged 197 cases in 1967; by 2014 the number had contracted to 70.

For another, most interesting gauge of the court’s collective disposition, we could turn to the Direction parameter in column BF. Its entries are tidily binary; a 1 attests a justice’s conservative orientation to the vote, a 2 a liberal tilt. (Of course those conclusions require something of a judgement call; for a detailed accounting of the Database’s coding determinations look here.)

Some basic math should help ground the analysis. If the Court rules conservatively through a 5-to-4 vote, its Direction aggregate will figure to 13 – 5 1’s and 4 2’s, or a Direction average of 1.44 – 13 divided by 9. An equivalent liberal majority comes to 14, or a 1.56 average. A theoretical ideological midpoint of 1.5, then, centers the averages.

With those understandings in tow we can develop what could be called a directional index of Court votes, e.g.

Rows: Term

Values: Direction (average, rounded to two decimals).

I get, again in excerpt:


Of course, you’ll want to peruse all the averages, but the data seem to affirm the Court’s measurable movement to the right. From a relative liberal high of 1.70 in 1963, the averages begin to descend, first falling though the 1.5 divide in 1975, bottoming out in 1998 and 2002 at 1.40. The 2015 average of 1.49, however, plunks down right in the middle of the road – and a grouping of the terms in five-year tranches situates the votes for latest available interval right atop the center stripe (or 1.501996, if you insist).


A logical follow-on, then, would have us average the voting propensities of individual justices, names nicely paired with their votes in the justiceName field in BC. For good measure we can easily tally the number of decisions each Justice opined, by simply counting the number of times his or her name appears in the field (of course a Justice votes only once per decision):

Rows: justiceName

Values: Direction (average, to two decimals)

justiceName (count)

After sorting the averages smallest to largest and pasting together two screen shot excerpts I get:


Now those tabulations are instructive, albeit not wholly unpredictable. Justices Lewis Powell, Clarence Thomas, Warren Burger, and the late Antonin Scalia hold down the farthest-right positions, with William O. Douglas – a justice publicly renowned for his juridical liberality – pushing hardest towards the left, as it were. Chief Justice Earl Warren – famously lauded and/or scored for comparable takes on the law – isn’t far behind. And for the record it was William Brennan, among the post-1945 justices at least, whose 5325 votes marks him as the most participative; but at the same time remember that the 1946 divide artificially stunts the vote totals of justices whose appointments predated that year.

But time to recess for lunch. And I know – you’ll be watching closely to see how I chew my food.










Texas Child Abuse Data: The Grim Count

18 Jun

Credit the Austin (Texas) American Statesman with assuming the most disagreeable, but perhaps necessary, task of recording and expounding instances of child abuse perpetrated in the state – in this case, abuse taken to lethal extremes.

The paper’s Child abuse and neglect fatality database maintains data about 779 such instances, and makes them available in spreadsheet form as well:


(You’ll need to auto-fit the C column.)

Because the above-referenced web page features an array of graphical synopses of the abuse data:


We could properly ask if a spreadsheet-driven consideration could abet the information already put into place above. Perhaps it could.

Note the uppermost graphic, a gender-coded block of icons of the victims. Looking past the pink-blue gender associations, any click on an icon triggers a caption naming the victim, his/her age, and cause of death. The icons are not rigorously ordered chronologically, and I am not sure why. Moreover, and perhaps curiously, the graphic does not supplement its content with a simple numeric count of victims by gender; and while a scan of the icons inclines the view toward a male predominance, precision can be served via this simple pivot table:

Rows: Gender

Values: Gender (count; the data are textual)

Gender (again, % of Column Total, and turn off Grand Total)

I get:


I for one am surprised by the disparity; the statistical reality of why boys account for 60% of the reported victims poses a real and important question, though the finding would perhaps need to be cross-checked by data from other studies compiled for other regions. Could it be that boys are seen as somehow sturdier and more accustomed to roughhousing treatment, and hence putatively better able to weather abusive assaults?  I don’t know, but again, the gender question is worth asking.

Indeed, if one reconstructs the pivot table:

Columns: Gender

Rows: Cause of Death

Values: Gender (count)

One sees in excerpt:


We learn here that the boy-girl differential persists steadily across abuse categories.

Calculating the average age of victims could be understood as a grim but significant task, one complicated by the structure of the worksheet. Columns E presents age data variously denoted tragically by month, year, and weeks categories in F; and as such we need to unify the parameters. Perhaps the simplest means of reconciliation is to pry open a column between F and G, and enter in what is now G2:


The nestedIF expression identifies the time categories stored in the F column, and calibrates these into fractions of years, or whole years. Once we copy the formula down G we can execute a Copy > Paste Values into E, delete the temporary G column, and at the same time delete the Age Type field, containing as it does the now-inaccurate variable age categories. We could then in the interests of clarity retitle the field in E Age in Years.

We need at the same time to appreciate the aggregate understatement of ages the data convey. Because ages in years data were clearly rounded off, we cannot know if a three-year-old victim was in actuality three years and eleven months, for example. An important measure of precision could have been introduced with a date of birth column, and one assumes those data are available.

In any event, we can proceed to group the victim age, perhaps in bins of one years:

Rows: Age in Years (grouped by a value of 1)

Values: Age in Years (count)

Age (again, by % of Column Total)

I get:


We see that more than half the victims were two years’ old or less, qualified by the caution entered above. A first, if distressingly obvious, interpretation would associate vulnerability negatively with age: the younger the child, the more susceptible to deadly abuse.

Another point, apropos the Fault assignment field: the pertinent Statesman graphic – a pie chart captioned Paramour at fault- implicates a relationship partner (Paramour) in 137 of the 779 deaths recorded. That total, which can be swiftly realized by turning on a standard filter and simply entering Paramour in the vacant search field – indeed counts 137 such instances. But you’ll observe that many of these simultaneously arraign a parent; enter that term in the search field and you’ll turn up 581 records (remember that the filter will dredge any record that features the search term anywhere within a cell). Search again for Relative, for example, and 85 more cases populate the filter. In other words, a totalling of all perpetrator types will exceed the 779 records – even as the actual, lower total of is 779 too many.

New York City Park Clean-up Data: Sweeping Through the Numbers

30 May

Thinking twice about cleaning the bedroom – you know, that bespoke chamber of horrors heaving in entropic fury, the one housing the bed on which you sleep – somewhere? If you are, then how many times would you think about cleaning New York playgrounds and parks?

How about 880,000, the number of tidy-ups the city’s Sanitation Department performed on its recreational spaces for the 2015-16 fiscal year? Mighty big number – but even so, they don’t do windows.

They do just about everything else, though, and a canvass of those 880,000 (more or less – please read on) contributions to New York’s civic betterment that have been carted into the data set on the city’s open data site may inspire you to consider haulage as a next-career move. (You’ll understand I’m much too nice a guy to force WordPress to absorb 130 MB worth of download from a greedy subscriber. You’ll need to get this one yourself, I’m afraid; just click the Download button on the above link, opting for CSV for Excel or CSV for Excel (Europe), the latter presumably flipping day-month entries in order to comport with regional defaults.)

But even if you put your CV on hold, the still data have something to say, though their fields call for a spell of reflection before you proceed, apart from the necessary auto-fitting of the several date fields. Indeed – your reflection may need to be rather sustained.

If, for example, you want to break out the task activity by New York’s five boroughs – and you probably will – you may want to extract the borough codes from the gispropnum entries in column A: B=Brooklyn, M=Manhattan, Q=Queens, R=Richmond (the official name for Staten Island), and X=Bronx (you may want to consult the helpful field descriptions here). It seems to me that convenience could be served by deleting the propid data in B (these code the city properties on which work was performed, but may not contribute substantively to your analysis, though you may decide otherwise), renaming the field Borough, and entering in B2


Copy the expression down B and the borough information readies itself for grouping, after which you could presumably run a Copy > Paste Values on the B results and delete A, again in the interests of economizing. You’ll also of course observe the Null entries, which presumably report on activities not carried out on site, e.g., Staff Pickup and Lunch.

You may opt for additional field deletions as well in light of the workbook’s size, and the corollary understanding that each such deletion will relieve the book of about 880,000 data-bearing cells. You may have little use for the vehicle_number information, for example, and perhaps route_id, though I suppose both dislodgings may have to be thought through.

But there are deeper concerns. One might want to begin simply, by pivot-tabling task totals by month of the fiscal year:

Rows: date_worked (grouped by months and years. Remember that this fiscal year was booted up on July 1, 2015)
Values: date_worked (% of Column total. You may want to turn off subtotals, and certainly Grand Totals, as the latter will ineluctably add to 100%).

I get:


Note the apparent, pronounced pullback of activity in September – though I don’t know why – and the orderly step-down in work from March through September, though of course that trajectory plots a seasonality course that runs backwards against the actual yearly grain.

But that read isn’t necessarily that simple, or that accurate. That’s because a good many of the records seem to cite tasks that don’t quite qualify as an actual cleaning chore, e.g., lunch, De-Mobilization. So here’s a plan B: install the activity field in both the Columns and Values area, and in the latter, filter all activities save the one termed Work, which seems to serve as a catch-all item for…actual work on the parks. Continue to impose % of Column total on the values, and I get:


Once we’ve so trim the activity data – a retrenchment that abridges about 174,000 records from the whole – the monthly distribution redraws its curve. The conspicuous disparities bulge around July and August, each month here grabbing nearly two additional percentage points from the other months, even as the two September figures hold fast. Thus the obvious task-distributional suggestion – that non-work activity somehow features more numerously in July and August – remains obvious; but accountings for the numbers are not. I’m open to suggestions.

In any case, once we’ve put the work item in its place we could key the tasks to their boroughs, by exchanging the Borough field we fashioned earlier for date_worked. I get:


(Remember that N signifies the NULL entry.) It is perhaps notable that the assignments seem to have been similarly distributed across four of the city’s boroughs, with the far-less-populous Richmond well back. But these likenesses call for a deeper think, however, because they need to reconcile population (and of course Manhattan’s effective daytime census enormously exceeds its residential cohort) to aggregate park/playground counts and areas.
And what about the time expended on the tasks, again proceeding with the work-item filter as above? Here a New York open data field legend enters an important qualification about the nhours field: “A count of the adjusted number of hours a given task was performed [my emphasis].” You’ll see that for some instances – say row 43, a task whose start-to-end times evaluate to 35 minutes but is yet recorded by nhours as .292 hours, or 17.52 minutes – the lower estimate may reflect the simultaneous efforts of two workers, even as compensation of course would require the higher figure.
Going with nhours, then, we could replace activity in Values with nhours and format the resulting sums accordingly:


You’ll note that even as the task numbers stood as near-equivalents for Queens and the Bronx (the latter again represented by the letter X), work in the former borough seemed considerably more time-intensive.

Again, I don’t know why, but I’m just a spreadsheet guy, after all; if you want that Pulitzer you’ll have to take it from here.