America’s Schools, Part 1: Some Truants Among the Data

8 Sep

The segue is inadvertent, but calling up a census of America’s schools look right after our look at New York school attendance data with makes for a deft transitioning, if I may say so myself, and I think I’ve just granted permission to do so.

This nationwide listing -105,000 institutions strong – is curiously archived in a site putatively devoted to information about hurricane Harvey. I’m not sure about the placement, but that’s where it is.

And it’s pretty big, as you’d expect, counting 105,087 schools in its fold and pitching 24 MB at your hard drive, after a download and a save as an Excel workbook. (Note: the data unroll stats for the country’s public – that is, government operated – schools only. The very large number of private and sectarian institutions diffused across the US are thus excluded from the inventory.) And if you run a simple SUM at the base of column AC, the field storing student enrollment numbers, you’ll wind up with 50,038,887, and that’s pretty big, too.

But of course, that number can’t get it exactly right. For one thing, the overview introducing to the workbook tells us that the data feature “…all Public elementary and secondary education facilities in the United States as defined by…National Center for Education Statistics…for the 2012-2013 year”. And since then a few hundred thousand little Justins and Caitlins will have moved on to other venues, to be replaced by littler Treys and Emmas – and the turnover just can’t be equivalent. Moreover the (apparent) Source Dates recorded in X track back to 2009 in many cases, though I don’t completely know how those dates are to be squared with the 2012-2013 reporting period.

Now apart from the as-usual column autofits in which the dataset obliges you, you may also want to shear those fields likely not to figure in any analysis, though that of course is something of a judgement call. In view of the virtual equivalence of the X and Y data in A and B with those in the LATITUDE and LONGITUDE parameters in S and T, I’d do away with the former pair. I’d also mothball ADDRESS2 (and maybe ADDRESS, too – will you need their contents?) I’d surely dispense with the NAICS_CODE entries, as each and every cell among them declaims the same 611110. And I think VAL_METHOD, VAL_DATE, SOURCE (storing basic information about the school committed to web sites), and probably SHELTER_ID could be asked to leave as well, lightening my workbook by about 5.3 MB all told. On the other hand, WEBSITE appears to have done nothing but clone the contents of SOURCE and as such could assumedly be dispatched as well, but I’ve since learned that the sites offer up some useful corroborating information about the schools, and so I’d retain it. But a field I would assuredly not delete, in spite of my early determination to do so, is COUNTRY. I had misled myself into believing the field comprised nothing but the USA legend, but in fact it entertains a smattering of other geopolitical references, e.g. GU for Guam, PR for Puerto Rico, and ASM for what I take to be American Samoa, for example.

I’m also not sure all the Manhattan schools (the ones in New York county, that is) display their correct zip codes for what it’s worth, and it might be worth something. The Beacon High School on West 61st Street is zip-coded 10022, even as it belongs, or belonged, to 10023 (though that wrong zip code informs a review of the school by US News and World Report); but the error may be excused by an updated reality: the Beacon School moved to West 44th Street in 2015, calling the timeliness of our data into a reiterated question. I’m equally uncertain why the Growing Up Green Charter School in Long Island City, Queens is mapped into New York county.

More pause-giving, perhaps, are the 1773 schools discovered inside New York City’s five counties – New York, Queens, the Bronx, Brooklyn (Kings County), and Richmond (or Staten Island; note that a Richmond county appears in several states in addition to New York). You’ll recall that our posts on New York’s attendance data, drawn from the city’s open data site, numbered about 1590 institutions. Thus any story-monger would need to be research the discrepancy, but in any case it is clear that the dataset before us errs on the side of inclusiveness.

But a lengthier pause punctuates a Largest-to-Smallest sort of the ENROLLENT field. Drop down to the lowest reaches of the sort and you’ll find 1186 schools registering a population of 0, another 1462 reporting -1, 4493 sighting -2 persons on their premises, and 91 more submitting a contingent of -9. Moreover, you’ll have to think about the 5399 schools counting a POPULATION (a composite of the ENROLMENT and FT_TEACHER fields) of -999. It’s not too adventurous to suggest that these have been appointed stand-ins for NA.

In addition, we need to think about the schools declaring only 1 or 2 students on their rolls. Consider for example the Marion Technical Institute in Ocala Florida and its 1 student and 34 full-time teachers. Visit its web site, however, and we encounter a more current student enrollment of 3 and a FTE (full-time equivalent) instructional complement of 37 (as of the 2015-16 school year), not very far from what our database maintains. But at the same time many of the 1-student schools are accompanied by FT_TEACHER values of 1 or 0 as well, and these microscopic demographics demand scrutiny. The web site for Bald Rock Community Day school in Berry Creek, California, for example, reveals no enrolment/teacher information, for example.

What to do, then? It seems to me that any school disclosing a negative or zero enrollment – and now sorting the ENROLLMENT field highest-to-lowest will jam all of these to the bottom of the data set – be disowned from the data set via our standard interpolation of a blank row atop 97407, where the first zero figure sits. We’ve thus preserved these curious entries for subsequent use should their other fields prove material.

And all that begs the larger question tramping, or trampling, through the data: How much time, effort, and money should be properly outlaid in order to support the vetting of 100,000 records? Multiple answers could be proposed, but there’s a follow-on question, too: In light of the issues encountered above, hould the data in the public schools workbook should be analysed at all?

Well, if we’ve come this far, why not?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: