New York City Restaurant Inspection Data: Tips Included

9 Jan

Expressing an interest in getting a bite to eat in New York calls for a bit of narrowing down. You will need to get a little more specific about your preferences, in light of the 26,000 or so restaurants in the city happy to fill your mouth, and evacuate your wallet.

Indeed – advising your New York crew “Let’s go to a restaurant” reminds me of the woman who stood in front of me in a Starbucks and requested a cup of coffee, the kind of order that drives baristas into a hand-crafted frenzy.

But once you’ve finally sat yourselves down you may want to learn a little more about what exactly it is you’ve gotten yourself into – literally – and the restaurant inspection data reorganized by the Enigma public data site may go a ways towards telling you more than you wanted to know (the data are free to you, but they want you to sign up first. Remember that the lower the inspection score, the more salubrious.)

I say “reorganized” – although Enigma will tell you they’ve “curated”-  the data, because the inspection outcomes have presumably been culled from New York’s remarkably near-real-time and far larger official data set, available on the city’s open data site (and Enigma’s too, though their version is three months old). The revision opens an interesting point of entry, then, to an understanding of how someone else’s data have been re-presented by someone else.

In what, then, does Enigma’s remake of the original data consist? For one thing, they’ve proposed to distill the source data set down to a unique entry for each restaurant (keep that that stratagem in mind), each of which, after all, have been subjected to several inspections.  By means of verification I aimed a Remove Duplicates check at the camis field comprising restaurant ids, and came away with but six redundancies – not too bad for a compendium of nearly 25,000 records.

And once having completed that chore we can run a simple but revealing pivot-tabled census of New York’s eateries by borough:

Rows: boro

Values: boro (count)

boro (again, by % of Column Total)

I get:


No one will be surprised by Manhattan’s restaurant plurality, though it should be added that the residential populations of both Brooklyn and Queens far exceed that of the storied island. In addition, keep in mind that the endless turnover of restaurants (the Quora article linked above declares an annual restaurant closure rate of 26%, though that assertion should probably be researched), turns the count into an implacably moving target.

And for another thing, the Engima set has padded the progenitor data with each restaurant’s geo-coordinates (latitude-longitude), thus priming a mapping capability. But they’ve also, befitting one of Enigma’s enigmatic apparent first principles, reformatted the inspection dates into text mode.

And Enigma’s alternate take has also put the scissors to some of the set’s original fields. The Critical Flag field – naming restaurants that incurred what the Department of Health and Hygiene terms critical violations, “…those most likely to contribute to food-borne illness”, is gone, and I’m not sure why. Those data sound like something you’d want to know about, and analyze.

But there’s a pointedly more serious issue besetting the data that I haven’t quite figured out. Because Engima determined to squeeze the data into a one-record-per-restaurant yield, it had to decide exactly which record would be earmarked for retention; and common analytical sense would commend the latest such record, conveying the current inspection standing for each restaurant. But it appears that Enigma hasn’t always nominated the latest record. A spot comparison of the records across the two datasets turned up some Enigma selections that predate more current inspections for the same restaurant in the official New York workbook. And if those kinds of discrepancies riddle the Enigma data, then we need to wonder about the decision rule that authorized their inclusion – and I don’t know what it is. What would an aggregate averaging of inspection scores purport to say, if some of the scores have been superseded by newer ones? (My emailed query to Enigma about the matter remains unanswered as of this writing.)

Moreover, because the one-record stipulation is in force, Enigma was impelled to collapse disparate violation codes in that eponymous field. The very first record, for example, for the Morris Park Bake Shop, reports two violations coded 10F and 8C, both filed on May 11, 2018. But New York’s precedent dataset has assigned a distinct record to each of the two, easing a pivot table breakout by code.

And those code consolidations – an ineluctable follow-on of the one-record-per-restaurant decision – probably explains Enigma’s omission in turn of the original Violation Description field. Boxing multiple violations in the space of one cell might confound legibility for both researchers and readers, and so Enigma likely concluded the whole field was best expurgated – at a price of course, because now we don’t know what the violation codes mean.

Now to be fair, Enigma also furnishes a worksheet-housed directory of those codes, which make for a most serviceable lookup array; but the multiple-code cell structure of its inspection data makes for an exceedingly messy prospect for 24,000-plus lookup values, which must be individuated somehow.

But all these cogitations have given me the munchies. Where do you want to eat? You want Chinese? Fine – that pares the choices to around 2,400. Your treat.


