NYC Attendance Data, Part 1: Some Necessary Homework

14 Aug

This open data thing is cool; they’ve even drummed up a report, available to every inhabitant on planet earth give or take a few censors, on my very own elementary school, P.S. 165 in Queens, New York City.

No; you won’t find my name large-typed anywhere in the record of 165’s distinguished alumni, but this is an academic status report, after all, and I’m prepared to forgive the omission. The larger point – I think – is that data – attendance data – about P.S. 165, and all of its 1600 or so companion institutions comprising the New York’s public school system, are present and accounted for here. Click the Download link, proceed to the CSV for Excel option (there’s a Europe regional-formatted version there, too), and wait for its 524,000 records to find their place in your virtual lecture hall.

The worksheet roll-calls attendance numbers for the days filling the September-June school years 2015-16 and 2016-17, and don’t be fooled by its understated five fields; at least three additional parameters can be easily synthesized from those five, and to productive effect. (Note: Apart from the need to autofit the field columns, keep in mind that the entries in the REGISTER field tabulate the school’s official enrollment as of the given day, and not the number of students who actually made their way to class.)

But the data do require a touch of vetting before the analysis precedes. Row 455543, if you’re keeping score, is in effect something of a blank, save a cryptic notation in the A column that won’t further our reporting interests. As such, the row should be, er…expelled. But another, more troublesome set of entries needs to be confronted and ultimately unseated from the dataset.

Start by running this pivot table at the records:

Row: DATE (ungroup if you’re on Excel 2016)

Values: REGISTER (sum)

Sort Largest to Smallest. I get, in excerpt:


The two uppermost dates – January 13, 2016 and February 2 of that year – exhibit suspiciously large register totals, just about doubling the other sums. It appears as if the records for that pair of dates have simply, if unaccountably, been duplicated – and as such one-half of these have to be sent on their way.

And that sounds to me like a call for Remove Duplicates. Click its trusty button on the Data ribbon, and tick:


Follow through, and you’ll be told that 3172 duplicate values – which sounds precisely duplicative – have been found and shed from the dataset.

Then inspect the lower reaches of that original sort and you’ll come upon more than 20 days, the great bulk of which are dated somewhere in June (i.e. the last month of a school year), with aggregate totals less than 800,000 – substantially beneath the totals above them. I suspect some reporting shortfall is at work, and indeed the legend attaching to the data on the Open Data site states the numbers may not be final “…as schools continue to submit data after preliminary report is generated”. We’re also left to explain April 18 of this year, for example, a date on which we’re told 819 students actually appeared, on what was the last day of the spring recess. Presumably they were there for some good educational or administrative reason – or the record is there by mistake (those 819 come from two schools, the Bard High School Early College institution in Queens and the School of Integrated Learning in Brooklyn).

Now we can go about deriving those three fields I advertised above. I’d first head into column F, title it Day, and enter in F2:


And copy down. We’ve certainly seen this before; WEEKDAY will tease out the day of the week from a date-bearing cell (with 1 denoting Sunday by default), and so presages any look at attendance figures by day.

Next, we could entertain an interest in breaking out attendance data by each of New York’s five boroughs – these the object of a neat, single-lettered code in the SCHOOL_DBN field, squirreled in the third position of each code:

K – Brooklyn (Brooklyn is officially named Kings County)

M – Manhattan

Q – Queens

R- Richmond (aka Staten Island)

X – Bronx

Thus we could name G1 Borough and enter in G2:


That is, the above expression collects one character from C2, starting (and finishing) at the third character in the cell – in the case of C2, X, or Bronx. (By the way, if you enter a school’s code in Google you’ll be brought to its web page, including P.S. 165 and 25Q425, or John Bowne High School, another of my academic ports of call). Then copy down G.

The final of our three supplementary fields would simply return the actual number of students who appeared in a given school on a given day, or at least the ones who were there to be counted at 4 pm; and while of course the dataset already reports schools’ daily attendance percentages, information that might tell us what we want to know, we’re again made to revisit an old but material problem: by  manipulating percentages alone we in effect weight each school identically, irrespective of student body size. And in fact that treatment might come to serve certain analytical purposes rather well, because there may be good reason to regard schools as equivalently important units. But for the same money we may want to have it both ways, by compiling stats that give each school its proportionate due. Thus we can title H NumberStudents and enter, in H2:


If that expression seems needlessly ornate, keep in mind that the values holding down the %_OF_ATTD_TAKEN field in D aren’t percentages, but rather percentages multiplied by 100; as such they need to be cut down to size by a like decrement. And the ROUND addendum means to quash any decimals that don’t quite do the results justice. After all, if 84.2% of a school’s 1,042 charges show up, should we let the Board of Education know that 905.15 of them have been sighted in the building? I’d bet that .15 has at least foot out the door.

But do these definitional niceties matter anyway? We can find out by first setting a pivot table in motion and churning out a calculated field, which I’ll call ActualTotals:


(And no, you won’t need parentheses around the two fields.) As for the 100 multiplier, it will put the results in alignment with those %_OF_ATTD_TAKEN numbers, which again exhibit a similar order of magnitude – that is, the latter comprise percentage data times 100.

The pivot table looks then like this:


Values: %_OF_ATTD_TAKEN (average, to two decimals)

ActualTotals (average, two decimals)

I get:


The differences are small but real.

But ok, class; let’s take recess. Just remember to be back by…4 pm.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: