Chicago Story, Part 1: The Restaurant Inspection Episode

12 Oct

Hungry? Then here are about 28,000 dining suggestions aimed at your need to feed, courtesy of the Chicago Data Portal, and standing by for your download at the Food Inspections link on the Portal’s home page – as of this writing the eighth data set in the Portal directory.

(As the Chicago site sports what appears to be the standard American-city open data interface, you can click the blue export button in the result window’s upper right and select the CSV for Excel option. Tell your hard drive to prepare to be visited by about 39.5 megabytes, at least for starters.)

I’m assuming at least one eatery among the multitude is sure to slake your gustatory hankerings, provided you’re happy to add the travel expense to the Windy City to your tab, of course. And there’s another catch: while the long list will indeed supply a qualitative advisory about your the destination of your next meal, don’t expect a Michelin-like account of the likely dining experience there. These ratings, rather, are health inspections – which, upon reflection, may be the more important.
And something else, here: the 28,000 I counted out for you in the inaugural paragraph doesn’t square with the 111,000 records that actually bulk up the data set. And that’s because the set squirrels away multiple inspections for the cited restaurants, some dating back as far as 2012. Is that problematic? Could be.

But before we have a go at the data and its provocations some preliminaries need to be addressed first. 39.5 megs worth of data could stand a bit of slimming where possible, and I’d put the scissors to the Location field, a near- ubiquitous parameter in US open data sites that I’d suggest is near-redundant, too; its paired restaurant latitude and longitudes – reduced to text format precisely in virtue of the pairing – add nothing to the set, given the neighboring Latitude and Longitude fields which convey the same data, and in usable numeric mode. I’d also give the pink slip to the State field, comprising as it does nothing but the US state abbreviation IL (for Illinois) all the way down. With no differentiation comes no utility, even as that State field likewise seems to insist on making its presence felt across so many US open data sets. There’s a generic programming protocol in there that perhaps needs to be rethought. (I wouldn’t necessarily show the same disregard to the City field, though, because its complement features suburb names as well; in other words, localities other than “Chicago” have found their way in there.) And the AKA field – shoring up possible alternative names for the inspected establishments – could also probably be cast aside. You’re also likely to encounter some nasty text wraps in the Violations fields that shoot some row heights way above the Sears Building. You may want to thus flatten or unwrap these, though a meticulous vetting of 110,000 row heights could get a mite burdensome. On the other hand, those outsized verticals won’t in any way impact the data, qua data. (I’ve also had trouble scrolling up and down the sheet, and I don’t quite know why.)

But there are other, more substantive data claims on our attention that need attending, the loudest claimant bellowing at us from the Risk field, wherein the actual inspection assessments await. These are scaled on a simple 1-to-3 span (the British inspection ratings we considered here run from 1 to 5) , in ascending order of restaurant salubriousness (a 1 denotes the most problematic venues). The problem here – the obvious problem – is the puzzling, contaminative, explanatory text in the field that flanks the actual inspection values and subverts their serviceable quantitative standing. I’d respectfully submit that the clarifying text could have been productively confined to a single text box or legend somewhere else, leaving the data in the Risk field to their pure, usable numeric devices.

But that means we can’t do anything about that snarl, once a few additional complications are likewise brought to light. First of all a number of Risk-field cells are blank, with yet another cache of contrarian addresses reporting the “All” remark in lieu of a rating. Given those challenges, I’d invade the next available column (whose letter depends on how many fields you may have already deleted) and write, in row 2 (assuming the Risk field remains in E):
=IF(OR(E2=0,E2=”all”),0,VALUE(MID(E2,6,1)))

Attempting to be all things to all contingencies, the formula substitutes the value 0 for blank, or “all”-bearing cells, otherwise salvaging the inspection rating number via MID, and re-quantifying it through the good offices of VALUE. We’re helped here, in fact, by the constant, Risk[space] expression that fronts every cell – and formulas like constants. And once this deed is done you can work with your results in its own space (you’ll then need to the head the field too, of course), or you can even run a Copy > Paste Values atop the Risk field itself and replace the confounding text entries with gleaming numbers.

And now that the inspection ratings are duly countable, sum-able, and average-able, you can put those 110,000 values to work – once you’ve worked a few other things out.

How’s that for a teaser – or an appetizer?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: