Search results for 'aver'

Political Protest Data, Part 1: A Few Demonstrations

23 Dec

If a spreadsheet is here, can a chart be far behind?

Apart from giving new meaning to the phrase “poetic license”, it’s a question nevertheless worth asking, one which piqued me anew when I came across Alyssa Fowers’ Points of Light: Protest in America site, a chart-driven catalogue of the mighty spate of political demonstrations flooding the country across the 2017-19 span, dating more specifically from January 16, 2017, or four days prior to the inauguration of the current president, whose name escapes me. But in addition to the columns, bars, and bubbles she’s , the native data in their own, primordial columns offer themselves for download here (click the “download a full export…” link beneath the Can I use your protest data for…? paragraph head).

More than 22,000 protests are there for the counting, or about 20 a day if you’re averaging, and the data are remarkably current. My download records demonstrations mounted as recently as December 21, including last week’s the pro-impeachment rallies speaking their collective piece about the aforementioned chief executive. Fowers has broken out and represented the data by protest theme, state of incidence, month of staging, and the like, and touches off my question in turn: are these data best studied via the chart/datz viz medium, relegating alternative spreadsheet scrutiny to the hopelessly character based?

A fair question, though one that perhaps slopes the argument toward a reductio ad absurdum: given the choice between communicating data by chart or spreadsheet, is the former modality always to be preferred? Absurd indeed, or disconcertingly plausible?

After all, it seems people tend to think about data visually, and the charted portrayal of quantitative information with its language of proportion – e.g. this bar protrudes farther than that one, or a plotted line negotiates time in its leaps and troughs. What, then, can a spreadsheet bring to the picture – so to speak? If the answer is nothing, we will have flung the question beyond the vanishing point, and that sounds a little…absurd.

Perhaps I should have addressed so elementary – and unnervingly pertinent – a question in my very first blog post, but it probably hadn’t occurred to me then – an elephant in the room, perhaps, that must’ve been crouching behind my laptop.

So here are about three answers. First, spreadsheets afford the virtue of precision. The value signified by a jutting bar, or column, or flitting line point, or crescent of a pie slice (the latter is particular disesteemed, even by data-viz wizzes), may not tell you exactly and unambiguously want you want to know. Indeed – the very fact that so many charts garnish their outputs with data labels tells you as much. Reportage is not well served by revelations that Donald Trump received about 300 electoral votes in 2016. You want 304, not a good guess of a column’s height.

Second, a multi-parameter look at the data might – might – profit from the hard-edged clarity of a spreadsheet. For example, a distribution of demonstration themes by state committed to a more-or-less conventional chart must reserve one axis for the demonstration counts and the other to say, the themes. The state identities, then, are necessarily assigned to the legend, where things can get a touch messy. But a pivot table can align demonstration with row labels and states along the columnar orientation, nestling the counts in the interior values terrain. Dense, but clear.

And of course charts in effect need spreadsheets, are hinged upon them. Yet-to-be-aggregated data resist charting; thus if you want to learn about Civil Rights demonstrations in California you need to bring them all together first before they pull across a bar.

In any case, we could contrast the data-capturing virtues of charts and pivot tables by culling screen shots of a few of Fowers’ charts, companioning them with a pivot-tabled take of the same data, and letting you decide.

But before those chart/table tandems set forth the data need to be sharpened, to wit: the Location field bears the state names that figure in a number of the charts – but these are joined in their cells to the cities in which the demonstrations were held. Because the two-character state names (probably their postal codes) appear in each cells’ endpoint, by marking out a new State field in column I and entering in the first data cell

=RIGHT(B2,2)

we can isolate the respective state codes – with three exceptions, as it turns out. In 19 instances the formula culled the two-character “am”, delivering not a state abbreviation but rather the last two letters of the US territory Guam, which appears in its cells in full. In fact Guam has been assigned its own two-character postal code, GU; and so by running this Find and Replace:

Chart1

We could refit Guam with the prevailing nomenclature.

Another two formula outcomes yielded a #VALUE error, because neither sported a delimiting comma and corresponding state reference. One, location, Fredon Township, apparently maps somewhere in New Jersey (so saith Google; the entry Fredon, NJ appears in other records), and so I merely added a restorative, “, NJ” to the offending cell. The second discrepancy, recalling a demonstration held in Space, denotes the April 12, 2017 anti-Trump weather balloon floated by the Autonomous Space Agency Network, a decidedly non-governmental confederation of hackers, scientists, and artists. My cell rewrite, then: “Space, SP”.

And another field likewise begged our attention before the pivot tables could be commissioned: the Tags parameter in E classifying demonstration by theme, but fine-tunes each with a sub-theme of sorts set off by a semi-colon, e.g. Civil Rights; For racial justice; Martin Luther King, Jr. Because the Civil Rights rubric has in particular constellated so many of these modifiers, any pivot table invoking the Tags field would teem with granularity. I thus extracted the superordinate demonstration motifs by heading column J Theme and entering this formula in J2 and copying down:

=LEFT(E2,FIND(“;”,E2)-1)

(Note that the Civil Rights example above features two semi-colons, but FIND will pinpoint only the first instance of the entry to be found).

Now we can think about those chart/pivot table comparisons. Just don’t ask me to map that anti-Trump balloon. I think in 2-D.

 

 

Best U. S. Cities: Letting the Data Decide

11 Nov

Thinking of a change of scenery? No; I’m not talking about that fetching screen saver you just downloaded. I’m talking a real move, of the bricks and mortar kind – apocryphal floor plans, burly guys manhandling your Wedgwood and your pet, change of address forms, having two keys for three locks – in short, big fun.

But if you’re still planning on going ahead, take a look at and a download of Ben Jones’ Data World-sited ranking of what he terms 125 of the Best US Cities, a reworking of the listing assembled by US News and World Report.

It’s all a judgement call, of course, but US News has marshalled a set of parameters, which when judiciously factored, yield their rankings. City number 1, and for the third year in a row: Austin, Texas, the state’s capital and a college town, leading the pack for its “value for the money, strong job market, high quality of life and being a desirable place to live.”

OK – that last plaudit seems a touch tautological, but Austin it is. And if you’re wondering, New York City pulls in at 90, Los Angeles at 107, and San Juan, Puerto Rico, – the latter situated in a not-quite-a-state – floors the list at 125. But the larger point, of course, is to make some aggregate sense of the ranking criteria dropped across the sheet’s columns. I see no master formula into which the criteria contribute their evidence and output the rankings, so we’ll have to make do with some homemade reads on the data.

To start with, we could draw up a series of correlations pairing the rankings with the numerically-driven parameters in columns D-Q, e.g., correlate ranking with Median Month Rent. Enough of those might deliver a crude but sharpened sense of city stature.

But you’ll respectfully demur. You’ll submit that the ranking data in column A is ordinal, that is, expressive of a position in a sequence without delineating the relative space between any two positions. Thus, for example, the grade average of a school valedictorian isn’t twice as high as the student holding down the second position, even as their respective class standings are quantified 1 and 2. On the other hand, a median monthly rent comprises interval data, whereby a rent of $2,000 is indeed twice as high as a charge of $1,000.

You’re right, and as such the correlation between city ranking and rents, for example, is assailable, but not, in my view meaningless. In spite of the gainsaying, I’d allow that higher-ranking cities should in the main associate more favorably with the variables the sheet reports. But let’s see.

We can figure the correlations easily enough by aiming Excel’s CORREL function at the fields. For example, to correlate city ranking and Median Home Price in H:

=CORREL(A2:A126,H2:H126)

I get -.191, or a small negative association with home prices and ranking. With higher city rankings (denoted by the lower the number) then, comes a small but not overwhelming uptick in home prices, befitting those cities’ desirability. But again, the connection is small. (Note that if one of the paired values in a row features an NA the CORREL simply ignores the row.)

We can than apply the correlations to some of the other parameters:

With Metro Population: .426

Average Annual Salary: -.382

Median Age: -.105

Unemployment Rate: .679

Median Monthly Rent: -.211

Percent Single: .515

Violent Crime: .329 (By the way – I’m assuming the crime rates are superimposed atop a per-100,000 denominator, which appears to serve as the standard.)

Property Crime: .251

You’ll note the emphatic, not terribly surprising correlation between ranking and unemployment rate (here both dimensions reward a lower score). A city’s plenteous job market should naturally magnetize people through its perimeters; and note in turn the trenchant, but less decisive, association with ranking and average salary. The fairly high correlation with Metro Population suggests that a city’s appeal heightens somewhat with its relative smallness.

More curious and decidedly less obvious is the impressive correlation of ranking and percent single. Here loftiness of a city’s standing comports with a smaller rate of unmarried residents (but no, I don’t know if live-in partners are identified as single). (Again, don’t get confused; a “smaller” ranking of course denotes the more desirable city.) That finding could stand a round of analysis or two; after all, city quality is stereotypically paired with young, hip, unattached urban types. And by the way – though there’s nothing revelatory here – the correlations between annual salary and median home price, and median monthly rent: .658 and .756.

But of course, it’s a trio of fields, e.g. Avg High/Low Temps, AVG Annual Rainfall and Avg Commute Time – we haven’t correlated that begs the next questions: namely, why not? The answer – and we’ve encountered this complication before – is that the values staking this collection of averages have been consigned to textual status. Austin’s average commute time of 26.8 minutes, expressed precisely in those terms, is hard to quantify, because it isn’t a quantity. But couldn’t the field have been headed Avg Commute Time in Minutes instead, with authentic values posted down the column?

But the field is what it is, and so in order to restore the commute times to numerical good standing we could select the Avg Commute Time cells and point this Find and Replace at the range:

City1

That’s a space fronting the word “minutes”, a necessary detail that obviates results of the 26.8(space) kind, which would remain textual.

Now with the textual contaminants purged the correlation can proceed, evaluating to a most indifferent .012.

But while we’re at it, this array formula can correlate average commute times and city ranking without removing the minutes term:

{=CORREL(A2:A126,VALUE(LEFT(L2:L126,4)))}

The formula extracts the four leftmost characters from each Avg Commute Time cell (all of whose first four characters are numerics), convert them into values, and gets on with the correlation. (But note, on the other hand, that the array option doesn’t appear to comport with the AVG Annual Rainfall data in I, because the NAs obstruct the array calculation. The Find and Replace alternative will work, however).

And by the same token, analytical ease would have been served by allotting a field each to the average high and low temperatures, and without the degree symbol. Here you might have to make room for a couple of blank columns and enter something like

=IFERROR(VALUE(LEFT(F2,4)),”N/A”)

 

Fo the average high temperature, and copy down. You need all that because a straight-ahead VALUE(LEFT ) will turn up a good many VALUE errors, which will stymie the correlation. If you go ahead you’ll realize an association between city ranking and high average temperature of .1975 – something, but not much.

And for low temperatures, don’t ask. But if you do, try:

=IFERROR(VALUE(MID(F2, 9,4)),”N/A”)

Correlation, if you’re still with me: .2159.

Maybe not worth the effort.

 

 

 

 

 

The Grey Lady Learns Pivot Tables: NY Times J-Course Data, Part 2

15 Jul

The Intermediate tine of the three-pronged New York Times data journalistic syllabus casts its analytic lot with pivot tables, and the kinds of questions properly placed before that mighty aggregating tool. Among its several data offerings awaits a 2900-record gathering of New York (i.e. Manhattan) real estate data, naming property owners, their holdings and their assessed value, and a trove of additional metrics. Per the course’s pedagogical remit, the learners are prompted to pose a few questions of their own of the data – surely a useful heuristic – before they turn to the assignments at hand, which among other things asks “What are the things you notice about this dataset that shows it isn’t perfect?”

A good, important, and generic question, one that nevertheless isn’t advanced by the Census data sheet we reviewed in the previous post. In fact, worksheet data imperfections can assume at least two forms: a discernible scruffiness among the records, and/or design impediments that could constrain analysis, and I’m conjecturing the Times wants it staffers to concern themselves with flaws of the former stripe.

For example, if this qualifies as a blemish: once downloaded to Excel, both the start and end_date year entries present themselves in text form, thus obstructing any unmediated attempt to group those data. Yet the year_built data remain indubitably numeric, and I can’t account for the discrepancy. But at the same time, however, these data in their native Google sheet mode appear appropriately numeric, and when I copied and pasted some of the start dates to a Google sheet of my own they behaved like good little values; and moreover, the left orientation imparted to the numbers in the end_date field suddenly righted themselves (that was a pun intended) via my paste. Another Google-Microsoft lost-in-translation flashpoint, perhaps, and not a species of data imperfection, if one remains in Sheets. (Another note: I’ve been heretofore unable to actually work with the Times sheets in the their Google trappings, as access to them defaults to View Only, and seems to require permission from the sheet owners in order to actually work with them. My requests for permission have gone unrequited to date, but in fact you can copy and paste the data to a blank Google sheet and go ahead. The data are open-sourced, aren’t they?)

Far more problematic however, and presumably one of the data failings over which the Times hoped its learners would puzzle, are the disparate spellings in the owner_name field of what appears to be the one and the same New York City Department of Housing:

nyt1

(Note the last entry above is simply misspelled. The data were drawn from the coredata site, by the way, a project of New York University’s Furman Center.) And, while we’re at it:

nyt2

But the Times’ marching orders don’t oblige its learners to proceed and do something about the inconsistencies. Absent that determination, no accurate answer to the Times’ question (number 6) – “Which owner has the most buildings?” – can be enabled. Remember that the Intermediate unit is pivot-table-driven, and any table invoking the owner_name field is going to loose the untidy spate of spellings reported above.

Yet one more imperfection besetting the selfsame owner_name field is the formidable complement of cells – 381 of them, to be exact, or about 13% of all the records – that contain no owner name at all, a lacuna that likewise comprises the analysis. The Times asks its learners “Who are the biggest owners in each neighborhood based on the number of units? Limit your table to owners who have more than 1,000 units”, an exercise which would appear to call for a pivot table that looks something like this:

Rows:  Neighborhood

owner_name

Values: res_units (filtered in the Rows area for sum of res_units equal to or greater than 1000)

And that alignment of parts kicks out a set of results that, in excerpt, embody the problem:

nyt3

Indeed, both data shortcomings – the blanks and the variant spellings – degrade the findings prohibitively.

The Times also wants its learners to “Compare the average value per unit for different neighborhoods. Which is the most expensive and which is the cheapest?” That chore seems to call for a calculated field, e.g. in Excel:

nyt4

I’m just wondering if the Times cohort learned the equivalent feature for Google Sheets; perhaps it did, after all.  Its Data Training Skills List merely records the Pivot Tables rubric without elaboration. (Note in addition that the housing data sheet hoards an Income sheet from which the Neighborhood population, income, and diversity fields on the Housing sheet have been presumably drawn, probably through a series of VLOOKUPS whose yields have been subject to a mass Copy > Paste Special routine directed to the Housing sheet.)

Of course, that surmise points to a larger question: the breadth of spreadsheet capabilities spanned by the Times training. How, for example, were learners expected to apply themselves to this assignment: “Which neighborhoods will be most affected (in terms of number of units) in each of the next 10 years by expiring subsidies and which one is the most secure?” I’d try this:

Rows: Neighborhood

Columns: end_date (filtered for years 2019-2028)

Values: program_name (Count, necessarily; the data are textual)

And my table looks like this:

nyt5

Thus Central Harlem is the neighborhood most vulnerable to short-term expirations of program subsidies – by far – with the Stuyvesant Town/Turtle Bay district, really a mélange of sections on Manhattan’s East Side, the least exposed. But does my pivot table approximate toward the strategy the Times was seeking?  Again I don’t know, but a conversation with the paper’s syllabus architects about their intentions for the exercises would prove instructive – at least for me.

And that conduces toward the inexorable follow-on, then: I’m happy to offer my services to the Times, in the edifying form of a weekly column on spreadsheets, and for a magnanimously modest emolument; and I’d make myself available to help with the in-house training, too.

Just one question: will my press pass get me into all the Yankees games for free?


Addendum to the above: My previous post recounted my inability to access and edit the Times’ files in native Google Sheet mode. The paper’s Elaine Chen did get back to me yesterday (July 16), pointing to the File > Download alternative. One assumes, after all, that Times doesn’t want to approve shared file access for the multitudes, and probably for good reason. I should add that if one downloads the data in CSV instead of Excel mode, the formatting discrepancies I described in Part 1 seem to disappear.

 

The Grey Lady Learns Pivot Tables: NY Times J-Course Data, Part 1

28 Jun

This just in: The newspaper of record is rebranding itself into the newspaper of records. The Times – the one from New York, that is – has moved to evangelize the data-journalistic thing among its staff, and towards that admirable end has crafted an extended in-house workshop, syllabus and practice files/exercises made available to all the rest of us in Google Sheets here and here, respectively (ok, ok, call me the Luddite; I’m downloading the files into Excel).

The latter link above points to a sheaf of workbooks classed Advanced, Intermediate and Beginner (these rubrics sorted alphabetically, thus interpolating Beginner between the two ostensibly more challenging data collections. And note the Times cautions that even as the data sets have been mined from real-world repositories they’ve been alloyed, the better to serve their instructional purposes), and it occurred to me that a look some of the course contents might prove instructive in its own right.

We can begin with the Beginner Census_Characteristics of Older Americans (2016, 2019) workbook, whose associated Census: Worksheet exercise document asks us to unhide all its sequestered columns (about 65 of them in fact, most of which are massed at the far end of the data, something I missed repeatedly). Remember I’m downloading the data to Excel, an improvised recourse that bunches the field headers into ill-fitting Wrap Text mode. But by directing the download in Open Document mode instead the book nevertheless returns to Excel, but with the headers properly, visibly wrapped, though the columns could do with a bit of resizing (I don’t know if these little disjunctions bear traces of Google-Microsoft woofing).

The exercise text proceeds to let us know “We roughly color the group of categories. For example, the race and Hispanic stats are in light orange, and green columns are about marital status”. But no; these tinted ranges aren’t conditionally formatted, and to be fair can’t really lend themselves to those cellular  ornamentations. What shared textual/numeric datum, for example, could encourage all the ethnic data cells in columns K through V to turn orange? On the other hand, the columns brandish their colors all the way down to row 999, Google Sheet’s default row allotment maximum, though the data come to a halt at row 52.

Next, among other questions the exercise puts to us, we’re asked to “Take the average of the state mean earnings [presumably for Americans age 60 and over] and then look up the mean average for the US. Why do these numbers differ? “ Again, devoting ourselves to the 60-and-older data in the “states, 2016” sheet, and more particularly the 60+Pop; mean earnings field in column BB, that average is realized easily enough. But what mean average for the US does the Times want us to look up, and how? Of course, that very requisition may contribute to the exercise; and so after a bracing scroll-through across the 419 fields bulking up the “2016, US” sheet I stepped atop its cell K2, the one reporting mean household earnings for the 60+ plus demographic of $65,289 (sans curency format). But my lookup was eyeball-driven, and  certainly not under the steam of any maneuver typically entrusted to the redoutable V, or HLOOKUP function. Those instruments, after all, assume we know the identity of the pertinent lookup value – and we can’t know that the value reads “60 years and over; Estimate; INCOME IN THE PAST 12 MONTHS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Households – With earnings – Mean earnings (dollars)”, the header entry in cell KL1:

Times1

And so by “look up” I’m constrained to assume that the Times is asking of us a simple, unmediated, visual hunt for that information. In other words: look up, not LOOKUP.

And with the respective means – the national average recorded unitarily in the “2016, US” sheet and the state-by-state average figured in “states, 2016” – in hand, we can propose an answer to the question the exercise puts to us: namely why the two averages differ. The answer, I hope: that the state average accords equivalent weight to each of the 51 (Washington DC appears in the list) income figures irrespective of population size, while the single national figure in effect tabulates the earnings of every American, thus flattening out any skew.

And speaking as a mere auditor of the Times workshop, I’d pose the same strategic conjecture about the exercise question “Which 3 states have had the largest percentage increase in their residents who are above 60 over that time?” That is, I’d wonder if the Times expects its tutees to simply do the basic math and literally look for the three most prominent state increases – or rather, filter out a top three, a la Excel’s filter option.

But the filtering alternatives in Google Sheets can pull users in two very directions. One pathway transports them to a filter view resembling the standard Excel dropdown-menu mechanism – but I can’t find a Top 10 (or 3) possibility registered among its capabilities here. The other byway to a top 3 drops us off at the most cool FILTER function, a de facto progenitor of the yet-to-be-released Excel dynamic array function of the same name; but its workings demand an intricacy not likely to be broached in a beginner class. Thus, I suspect that the Times again wants its learners to resort to simple visual inspection in order for them to glean that top 3.

As for the actual line-item math here, should you or a Times staffer choose to proceed with the exercise, I’d hammer in a new column somewhere in the “states, 2016” sheet and slot this formula in for the District of Columbia, the first “state” in row 2:

=D2/C2-J2/I2

The column references – D,C, J, and I – offer up the base and 60+ population data for 2016 and 2009. (And yes, the formula can make do without parentheses: The order of operations take care of themselves.)

Copy down the ad hoc column, and the trio of states divulging the largest increments in the 60+ cohort will be in there, somewhere.

And if the filter isn’t working for you, why not sort the numbers largest to smallest, and check out rows 2 through 4?

 

Philadelphia Police Complaints, Part 2: One Civilian’s Review

20 May

Now that I’ve spoken my piece about the putative redundancy of the two Philadelphia civilian police complaint workbooks, fairness insists that I actually take a look into the data that’s roosted inside worksheet number two – because there are findings to be found in there, in spite of all.

That book, ppd_complaints_disciplines, concentrates its gaze on the officers against whom civilians preferred a grievance. The complaints are additionally parsed into the districts in which the alleged offenses were perpetrated, and class the complaints along with their dispositions.

Once we’ve widened the fields via the usual auto-fit we see that a good many of the complaint incidents identify multiple officers, as well as different allegations. We could then move to determine, for starters, the average number of allegations lodged per complaint. But that simple-sounding intention isn’t realized quite so straightforwardly, because we need to isolate the number of unique complaint ids in column A before we divide them into all the complaint particulars; and the elegant way out would have us travel here, to this array formula:

=COUNTA(A2:A6313)/SUM(1/COUNTIF(A2:6313,A2:6313))

The denominator – or rather the pair of denominators commencing with the SUM function – exemplifies a well-known array formula for calculating unique values in a range. The COUNTIF element subjects the A2:A6313 range of complaint ids to what are in effect criteria furnished by each and every record. Thus each of the four instances of id 15-0001 are assessed against every id entry, four of which of course happen to present the selfsame 15-0001. Thus each instance here evaluates to a count of 4, and the formula’s “1/ “ numerator reduces each to ¼ – and by adding 4 1/4s a 1 is returned – tantamount to treating 15-0001 as a single instance. That reciprocal routine is applied to each and every value in column A and then summed – yielding in our case 2779. Divide that outcome into the field’s 6312 records and we wind up with average of 2.24 allegations per complaint. (It should be added that Excel’s dynamic-array UNIQUE function would streamline the approach on which I’ve embarked here, but the dynamic arrays remain behind a locked door somewhere in Redmond, and I have no idea when the key will be found. Note as well that the dynamic arrays will only download to the Office 365 iteration of Excel.)

But that average, however informative, doesn’t apprise us of the number of actual, discrete officers implicated by each complaint, because the same officer is often cited for multiple allegations laid to the same complaint. Again, for example, complaint 15-0001 and its four allegations actually identify but two different officers – and that is the number we’re seeking here, as it would go on to contribute to a real officer-per-case average.

One way – an inelegant one – for getting there would be to pour the data through the Remove Duplicates sieve, selecting the complaint_id and officer_id fields for the duplicate search. Click through and you’ll wind up with exactly 4700 unique, remaindered records, of which 313 are blank, however; and we can’t know how many of those ciphers do, and do not, point to a given officer but once per complaint. On the other hand, because most officers are in fact identified we can acceptably assume that for those complaints directed at multiple officers the unknown party is likely not the one(s) who is named.  That supposition can’t dispel all our questions, of course, but divide 4700 by the 2779 unique complaints we derived above, and we learn that 1.69 distinct officers fell under investigative scrutiny per case – although the real quotient is probably a bit smaller.

In any event, that figure emerges at the cost of dismissing 1600 records from the data set, after which can we subject the stragglers to a formula, e.g.

{=COUNTA(B2:B4701)/SUM(1/(COUNTIF(A2:A4701,A2:A4701)))}

Inelegant indeed. For a sleeker alternative, we could first concatenate the complaint and officer ids in a new field in column I that I’m calling complaintofficer, e.g. in I2:

=A2&B2

That step positions us to cull unique officer ids by case; by running the unique-record array formula at these data we should be able to emulate the 4700 total and divide it by the other unique-finding expression aimed at the complaint ids:

=SUM(1/COUNTIF(I2:I6313,I2:I6313))}/{=SUM(1/COUNTIF(A2:I6313,A2:I6313))}

Elegance is a relative term, mind you, but it works.

On the other hand, if you wanted to develop a racial breakout of the number of individual officers charged with at least one complaint, you may here want to mobilize a Remove Duplicates by the po_race variable, because the breakout comprises multiple items (i.e. “races”; and if you adopt this tack you could save the results under a different file name, thus conserving the original, complete data set). That sequence yields 2549 separate officers, and conduces toward this pivot table:

Rows: officer_id

Value: officer_id

officer_id (again, by % of Column Total)

I get:

blogphil1

Now of course the proportions tell us little, because we need to spread them atop the racial makeup of the Philadelphia police force before any exposition can rightly commence. Note by the way that only one UNKNOWN officer informs the count here, even as we earlier turned up 313 such blank records; that’s because, of course, all the unknowns have the same blank, “duplicate” id.

Returning to the data set in toto, we can distribute allegations by their varieties. Remember of course that the 2779 complaints have accumulated an average of 2.24 charges, but each charge is exactly that – a complaint in its own right. Thus this conventional pivot table:

Rows: allegations_investigated

Values: allegations_investigated

Allegations_investigated (again, % of Column Total)

reports:

blogphil2

Apart from the indefiniteness of the modal Departmental Violation type, we need to recall that our previous post numbered 2782 such allegations populating the ppd_complaints workbook we reviewed then. It’s seems clear that the count enumerated there imputed but one allegation per complaint, a coarser take on the data than the more detailed table we’ve just minted above. In the earlier sheet, for example, Departmental Violations contribute 24.05% of all complaints; here they amount to 31.07%.

We also need to explain why our array formula here totaled 2779 unique complaint ids, when the count in ppd_complaints came to 2782.  In that connection I simply copied the already-unique ids in the ppd_complaints to a new sheet, and directed a Remove Duplicates to ppd_complaint_disciplines keyed to the same field there, and copied these as well to the new sheet. Scrolling about and doing some due diligence, I did find a few discrepancies, e.g. an absent 15-0176 among the ppd_complaint_disciplines ids.

But what’s a mere three records between spreadsheets?

 

The Hockey Stick Effect: Wayne Gretzky’s Goals, Part 2

12 Apr

There’s another parameter-in-waiting pacing behind the Wayne Gretzky goal data, one that might be worth dragging in front of the footlights and placed into dialogue with the Date field in column B. National Hockey League seasons bridge two calendar years, generally strapping on their blades in October and unlacing them in April. For example, Gretzky’s last goal – time-stamped March 29, 1999 – belongs to the 1998-1999 season, encouraging us to ask how those yearly parings might be sprung from the data, because they’re not there yet.

Of course, a catalogue of Gretzky’s season-by-season scoring accumulations is no gnostic secret; that bundle of information been in orbit in cyberspace for some time (here, for example), and so developing those data won’t presume to teach us something we don’t already know. But the seasonal goal breakdowns could be joined to other, more putatively novel findings awaiting discovery among the data, and so the exercise could be justified.

So here’s my season-welding formula. Pull into next-available-column R, head it Season, and enter in R2:

=IF(MONTH(B2)>=5,YEAR(B2)&”-“&YEAR(B2)+1,YEAR(B2)-1&”-“&YEAR(B2))

We’re looking to concatenate two consecutive years, and so the formula asks if the month of any given entry in B equals or exceeds 5, or May, or falls beneath that value. If the former, the year in B is joined to the following year, via the +1 addendum. If the month equals or postdates May, then the preceding years, operationalized by the -1, is concatenated with the year returned in the B column.

The formulas seemed to work, but as a precision check I rolled out this simple pivot table:

Row: Season

Values: Season (count, of necessity; the data are textual. The values should denote goal total by respective year).

I wound up with this, in excerpt:

Gretz1

Cross-referencing the results with the Gretzky goal data in the above hockey-reference.com link yielded a pleasing equivalence across the years.

Now for some looks in earnest at the data. Starting simply, we can juxtapose Gretzky’s goals scored at home to the ones he netted in away games:

Row: Home/Away

Values: Home/Away (count)

Home/Away (again, % of Column Total)

I get:

Gretz2

We learn that Gretzky scored a palpable majority of his goals at home, but we’d expect as much. As in nearly all team sports, NHL teams enjoy the proverbial home advantage, winning about 55% of the time – a near-equivalence to Gretzky’s ratio. That is, if home teams prevail disproportionately then their goal totals should exhibit a kindred disproportion, kind of. One difference with Gretzky, of course, is that he simply scored more of them.

And does the distribution of his goals by period pattern randomly? Let’s see:

Rows: Per (for Period)

Values: Per (Count)

Per (% of Column Totals)

I get:

Gretz3

Gretzky’s production appears to mount in games’ later stages (OT stands for the overtime period), but that finding needs to be qualified on a number of counts. We’d need first of all to track Gretzky’s average presence times on the ice; that is, was he deployed more often as games advanced toward their denouements and his valuable self was sent ice-bound at clutch time? And we’d also need to plot Gretzky’s goal timings against the league averages for such things; and while I haven’t seen those data, we can assume they’re out there somewhere.

Next, it occurred to me that a look at the winning percentages of games in which Gretzky scored might prove enlightening, once the task was properly conceived. Remember that, as a consequence of his numerous multi-goal flourishes, Gretzky’s goals scatter across far fewer than 894 games. The idea, then, is to fashion a discrete game count across which the goals were distributed; and that sounds like a call for the Discrete Count operation we’ve encountered elsewhere (here, for example). Once we isolate the actual-game totals – which should be associated uniquely with game dates – our answer should follow.

And this pivot table seems to do the job, enabled again by a tick of the Add this data to the Data Model box:

gretz4

Rows: Result

Values: Date (Distinct Count, % of Column Total)

I get:

gretz5

What have we learned? Apart from the up-front factoid that Gretzky scored in 638 of the 1487 games he played across his NHL career (638 is the numeric Grand Total above, before it was supplanted by the 100% figure in the pivot table; note Gretzky also appeared in 160 games in the World Hockey Association), we don’t know how his when-scoring 64.89% win percentage compares with his teams’ success rate when he didn’t score. I don’t have that information, and don’t know where to track it down. But it too is doubtless available.

For another analytical look-see, we can ask if Gretzky’s goals experienced some differential in the number of contributory assists that prefaced them. That is, players (up to two of them) whose passes to a teammate conduce toward the latter’s goal are rewarded with an assist; and the question could be asked, and answered here, if Gretzky’s assist-per-goal average fluctuated meaningfully.  We might to seek to know, for example, if during Gretzky’s heyday his improvisatory acumen freed him to score more unaided goals than in his career dotage, when he may have been bidden to rely more concertedly on his mates.

Since two Assist fields, one for each of the two potential per-goal assists, accompany each goal, the simplest way perhaps to initiate our query would be to enter column S, title it something like AssistCount, and enter in S2:

=COUNTA(J2:K2)

And copy down. That insurgent field readies this straightforward pivot table:

Rows: Season

Values: AssistCount (average, formatted to two decimals)

I get:

gretz6

Not much pattern guiding the data, but if you want to group the seasons in say, five-year bins, remember that because the season entries are purely textual you’ll have to mouse-select five seasons at a time and only then successively click the standard Group Selection command, ticking the collapse button as well if you wish:

gretz7

Even here, then, the variation, is minute – strikingly so.

Now for a last question we could ask about those teammates who were literally Gretzky’s most reliable assistants – that is, the players whose assist counts top their collaborative pairings with the Great One. The problem here is the two-columned distribution of the assist names, one for the first assist on a goal, the other for the (possible) second. I don’t know how a pivot table can return a unique complement of names across two fields simultaneously, preparatory to a count. If you do, get back to me; but in the meantime I turned again to the Get & Transform Data button group in the Data ribbon and moved to unpivot the data set via Power Query, by merging only the assist fields, e.g.:

gretz8

By selecting Assist1 and Assist2 and advancing to Transform > Unpivot Columns and Home > Close and Load the result looked like this, in excerpt:

gretz9

And of course you can rename Attribute and Value, say to Assist and Player.

Once there, this pivot table beckons – after you click TableTools > Tools > Summarize with Pivot Table:

Rows: Player

Values: Player (Count, sort Highest to Lowest)

I got, in excerpt:

gretz10

Nearly 22% of Gretzky’s goals received a helping hand – at least one wrapped around a stick – from his erstwhile Edmonton Oiler and Los Angeles King colleague Jari Kurri, no scoring slouch either with 601 goals of his own – a great many doubtless the beneficiary of a Gretzky assist. Then slip the Assist field beneath Player in Rows and:

gretz11

Now we learn that more than 60% of Kurri’s assists were of the proximate kind; that is, he was the penultimate custodian of the puck, before he shipped it to Gretzky for delivery into the net.

Now that’s how you Kurri favor with the Great One.

 

 

 

The Hockey Stick Effect: Wayne Gretzky’s Goals, Part 1

1 Apr

What is the measure of greatness? How about 894 records, one for each of the goals driven home by the National Hockey League’s Wayne Gretzky, aka the Great one?

That spreadsheet is as large as it gets for NHL scorers, and Tableau ace Ben Jones has infused the goal count with lots of supplementary background about each and every one of the 894, archiving the data for download on the data.world site here.

In fact the workbook makes itself available in both Excel and CSV mode, the latter requiring a text-to-columns parsing that likens it to the former. Either way, a few organizational points need to be entered.

For one thing, you’ll note that what’s called the Rank field in column A numerically ids Gretzky’s goals, in effect sorting them by newest to oldest. That is, Gretzky’s first goal – scored on October 14, 1979 – has received id 894, with the numbers decrementing ahead in time until his final score – tallied almost exactly 20 years ago on March 29, 1999 – has bottomed out with the number 1. It seems to me – and I suspect you’ll share the opinion – that the enumeration should have pulled in the opposite direction, with Gretzky’s last goal more properly checking in at 894. With that determination in mind I reversed the sequence via a standard autofill, entering 894 in cell A2, 893 in A3, and copying down.

You’ll also be struck by the unremittingly monotonic entries in the Scorer field, comprising 894 iterations of the name Wayne Gretzky. We’ve seen this before in other data sets, of course, being dragged into the data set as a likely accessory to some generic download protocol. Again, you can either ignore the field or delete it. Either way, you’re not going to use it.

And your curiosity will be stirred anew by the blank column-heading cells idling atop columns D, F, and G. It’s difficult to believe that Ben Jones, who doubtless knows whereof he speaks, would allow these most rudimentary oversights to escape his notice, but alternative explanations notwithstanding, the headings aren’t there and must be supplied.

Column D reports a binary datum – whether a Gretzky goal was scored at his team’s arena or at the rink to which his team traveled for an away game. I’ll thus entitle the field Home/Away and proceed to do something about the data themselves, whose cells remain empty when signifying a home goal and register an @ for “at”, that is, a goal netted at someone else’s arena. A pair of finds and replaces – the first, substituting an H for the blank cells, with the second supplanting the @ signs with a companion, alphabetized A – should sharpen the field’s intelligibility.

The headless column F archives game outcomes, i.e. wins, losses, or ties, and so I’ll call the field Result, or something like it. Column G denotes the phase of a game when the goal was scored, either during regulation time or overtime – or so I assumed. But a second thought soon followed on the heels of that hunch, if I may mangle the metaphor: it occurred to me that the Regulation/Overtime opposition simply recalls whether or not the game itself swung into an overtime period, irrespective of the actual times at which Gretzky scored. Could that uncertainty be relieved?

I think so, and I played it this way: first, I named the doubtful field Reg/OT, and ran a find and replace at the F column, substituting Reg for any empty cell therein. I then moved toward a pivot table:

Row Labels: Date (ungrouped, in order to exhibit each date)

Columns: Reg/OT

Values: Date (Count)

What I found is that no game date featured a value for both a regulation and overtime goal, a discovery that goes quite some way toward clinching the second speculation – namely, that the Reg/OT field entries do no more than inform us if the games necessitated an overtime period.

After all, if we confine the analysis momentarily to the games that spilled into overtime, one could most reasonably imagine that a scorer with Gretzky’s gifts would have occasionally lodged a goal in both the regulation and overtime phases of the same game; but the pivot table uncovers no such evidence. For any given date, Gretzky’s score(s) appear in either the OT or the Reg column. Moreover, some of the games – for example, November 27, 1985 – record two overtime goals, a unicorn-like impossibility in a sport in which overtime ends when the first goal is scored. (You’ll note by the way that the overtime-column goals only begin to appear in 1983, when a five-minute overtime period was instituted.)

Thus I’d aver that the Reg/OT field conveys little understanding of Gretzky’s scoring proclivities; all it does is identify games that happened to have extended themselves into overtime, and in which he scored – some time.

The Strength field cites the demographic possibilities under which Gretzky accrued his goals: EV refers to even strength, when both teams’ numeric complements on ice were equal, PP, or power play, during which the scoring team team temporarily outnumbered the other after a player was remanded to the penalty box, and SH or shorthanded, the rarest eventuality – when Gretzky scored while his team was outnumbered.

I do not, however, know with certainty what the EN entry in the Other field represents even though I probably should, and I see nothing in Data World’s data dictionary that moves to define it. It may very well stand for end, as in end of game, however; each of its 56 instances are joined to goals there were scored with fewer than two minutes left in their respective games. EN may then stand for scores achieved after the opposing goalie skated off in a losing cause and was replaced by offensive player, in order to buttress a desperate try at equalizing the game. Indeed – all 56 of the EN goals were scored in wins by Gretzky’s team.

As a matter of fact, I think I’m right. Filter the Other field for its ENs and look leftward at the Goalie field in L. There’s nothing there.

Screen Shots: Charting Movie Ratings

1 Mar

Granted; the line between an iconoclast and a smart aleck is but a few pixels wide, and so it’s not impossible to plant one foot on either side of the characterological divide. And while I’m straddling that gossamer border I’ll voice the notion, something at which I’ve hinted before: that Excel can do some things that other people do by mustering what feels like heavier artillery. That is, even as some practitioners of the programming arts unleash charts and vizzes of no mean quality upon their blogs, I’ll suggest in a whisper that highly reasonable facsimiles of the above can be emulated by Excel.

I’ve intimated as much before, e.g. my previous post, wherein I framed a scatter plot dotting NBA players’ offensive-to-defensive margins that successfully (if I do say so myself) imitated the chart synthesized by a corps of programmers applying a language or two (ggplot2) to the task. You may also recall my Excel-driven approximation of the celebrated charts of disease eradication here; and now I’m at it again.

My model this time is the chart shaped by the graphics adepts at the BBC that positions the respective film ratings of critics and viewers of 2017 Oscar-nominated films, e.g in excerpt:

movie1

Audience evaluations are captured by the red dot with the greenish speck symbolizing aggregated critics’ scores, and the interposing gray bar conveying the scope of the discrepancy between the two assessments.

It’s a neat representation, one redolent of the controls you’d slide across this or that hi-tech contraption, though of course – excuse the iconoclasm – you could ask if the above visualization delivers its message more crisply than an off-the-shelf, industry-standard column chart that treats each film to a pair of comparative bars.

But that judgement aside, a surmise followed on: could I do something similar in Excel? I suspect you know the answer.

First, the data, which was gleaned from the Metacritic site that compiles review scores. Not being able to contrive out how to access these directly I simply transcribed the numbers from screen shots of the two sheets bearing the figures, e.g.

movie2

There are indeed two sheets – one reporting movies having received relatively more favorable critic ratings, the other enumerating the ones for which audience scores were the higher. In practice for our purposes, of course, only one sheet is required; there’s no operational reason for parting the sheets on the basis of the critic/audience margins.

Moreover and as you see, the screen shots feature the critic/audience rating gap for the movies, a differential that, for our charting intentions, is irrelevant. After all, the intervening gray bar should portray those differences as a matter of course (it appears by the way that some of the critic-preferred subtractive differences aren’t quite correct).

For the sake of demonstration I then simply transcribed the data onto a spreadsheet for 20 records (say A4:C24, reserving the top row for headers) – ten in which critics’ estimations topped the popular appraisals and ten in which the support trended the other way, and rounding off the numbers when required in the interests of demo simplicity.

I then selected D5:CY5 – thus bridging the 100 columns following C and so making room for any potential audience/critic rating – and proceeded to clip their widths to .42 (a measure on which you may want to experiment), and entered in D5:

=IF(OR($B5=COLUMN(D5)-3,$C5=COLUMN(D5)-3),CHAR(149),0)

What is this formula doing? First, it needs to subtract 3 from each column reference (COLUMN identifies the column number of a reference, e.g. =COLUMN(D4) returns 4) in order to for D5:CZ5 to span values 1 through 100. It then asks if a given cell’s column value equates to the film rating stored in either B5 or C5; if so, the qualifying cell(s) (one of which should satisfy the stipulation in B5, the other in C5) installs a dot, via the CHAR(149) expression, the character number for that symbol in Calibri. And that’s a dot, not a sentence-capping period. (Note that I’ve elected to size the dots to 13 pts, given the column widths, a reading you may want to tweak.) I then copied the formula across D5:CY5.

In view of the tact that my first film – the one I recorded in row 5, Warcraft – received an average 38 ranking from critics and a boffo 82 from filmgoers, the dots should instate themselves in positions 38 and 82 – in default black. A hip color, perhaps, but we want to emulate the green and red buttons the BBC set forth – the green for the audience rating, and red for the critics’ assay. That sounds like a job for a couple of conditional formats; and so after selecting D5:CY I can write

movie3

and

movie4

These expressions ask if any cell falling within D5 and CY5 equals the value in B5 – the critics’ rating – or C5, that of the general viewers. If the former condition is met the dot turns the critics’ red; fulfillment of the latter logical test colors the dot green. Note that because the CHAR(149) realizes a textual result – that is, the dot is just that and not a conditional formatting icon – the conditional format invokes a font color change.

Now we next need to engineer the gray band that connects the dots, as it were. I selected D5:CY5, and fired up another conditional format formula:

=AND(COLUMN(D5)-3>MIN($B5:$C5),COLUMN(D5)-3<MAX($B5:$C5))

The AND statement looks for cells whose column values register a number between the respective critic and audience ratings. Cells that conform to the criteria receive a gray fill color.

And all that means that the charted scores for Warcraft look like this:

movie5

Not too bad, but I’m biased.

If you’re happy with that take, you can copy the contents of D5:CY5 down through the other films, a move that’ll bring the conditional formats along with them. Next, in the interests of presentational clarity, I’d insert a blank row between each pair of films. (If that stratagem sounds slightly tedious, see this alternative that might or might meet with your approval.)

That chore attended to, the ratings start to look something like this:

movie6

I think it’s ready for release; I may submit it to Metacritic, in fact. I, for one, give it an 84.

NBA Field Goal Data, Part 2: More Than 3 Points to Make

11 Feb

Among the metrics figured and plotted by the Medium look at NBA shot-making (and missing) is a two-way-player analysis, a comparison of the points scored by a given player to the points surrendered in his defensive capacity. As the study authors allow, the measure’s validity need be qualified by several cautions, e.g. the fact that an offensive dynamo might be assigned to guard a scorer of lesser prowess, thus padding his points differential. In any case, we can ask how a spreadsheet might applied to the task.

And that task is encouraged by the data’s CLOSEST_DEFENDER field, which identifies the player nearest a shooter at the point when he launched the shot. Thus by totalling a player’s points and subtracting the sum scored “against” him in his closest-defender role, the metric is realized. (Remember that the data issue from three-quarters of the games comprising the 2014-15 season.) But in view of the way in which the data present themselves, calculating that difference is far from straightforward.

It’s simple enough to drop this pivot table into the equation:

Rows: player_id

player_name

Values: PTS

That resultant – indubitably straightforward – apprises us of the number of points (scored via field goals, but not foul shots) credited to each player, and we’ve earmarked player_id here for inclusion in the table in order to play the standard defense against the prospect of multiple players with identical names. (Subsidiary point, one that’s been confronting my uncomprehending gaze for quite some time: fashioning a pivot table in tabular layout mode substitutes the actual data source field names in the header for those dull “Row Label” defaults. Thanks to Barbara and her How to Excel at Excel newsletter.)

But it turns out that an another, data-set-specific requirement for player_id imposes itself on the process. In fact, the player names in CLOSEST_DEFENDER are ordered last name first, surname distanced from the first by a comma, Yet the entries in player_name hew to the conventional first name/surname protocol, daubing a viscous blob between the two fields. Excuse the pun, but properly comparing the fields would call for a round of hoop-jumping that won’t propel me off my couch – not when I can make a far simpler resort to both player_id and CLOSEST_DEFENDER_PLAYER_ID, which should encourage a more useful match-up (but I can’t account for the mixed caps/lower-case usages spread across the field headings).

That understanding in tow, we can plot a second pivot table, one I’ve positioned on the same sheet as its predecessor, set down in the same row:

Rows: CLOSEST_DEFENDER_PLAYER_ID

Values:  PTS

The paired tables should look something like this, in excerpt:

nba21

Once you’ve gotten this far you may be lightly jarred by an additional curiosity scattered across the data: namely, that the closest defender outcomes comprise far more players, a few more hundred, in actuality. I looked at the stats for two of the players who appear in CLOSEST_DEFENDER_PLAYER_ID only – ids 1737 and 1882, i.e. Nazr Mohammed and Elton Brand, and learned that their offensive stats for 2014-15 were rather sparse (check out www.basketball-reference.com for the data); Mohammed averaged 1.3 shots per game that years, with Brand checking in at 2.6. It may be, then, that the data compilers decided to omit field goal stats for players falling beneath an operationalized threshold, but that conjecture is precisely that.

You may also wonder why two pivot tables need to be impressed into service, when both train their aggregating gaze at the same PTS field. It’s because we’re directed different parameters to the respective Row Label areas – one identifying the scorers, the other naming what are in effect the same players but in their capacity of defender, there lined up with someone else’s points, so to speak.

In any case, once we’ve established the tables, we can dash off a column of relatively simple lookup formulas alongside the first pivot table, one that searches for the equivalent player id in the second. I’ve named the data range in the second table defense, and can enter, assuming the first receiving cell is stationed in A4 (I’ve named the budding field Points Surrendered in A3. Remember of course that the field is external to the actual pivot table):

=VLOOKUP(A4,defense,2,FALSE)

And copy down the column.

(The FALSE argument is probably unnecessary, as the ids in both pivot tables should have been sorted as a matter of course.)

The lookups track down the ids of the players listed in the first pivot table, and grab their points surrendered totals, culminating in a joint scenario resembling this shot in excerpt:

nba22

And once engineered you can, among other things, subject the lookup results to a simple subtractive relation with the sum of pts to develop the offense/defense differential on which the Medium piece reports. You could also divide players’ points by points surrendered instead, developing a ratio that would look past absolute point totals.

Remember, however, that Point Surrendered “field” and the suggested follow-on formulas are grafts alongside, but not concomitant to, the pivot table, and as such you could unify all the fields’ status by selecting the first pivot table and running a Copy > Paste Values upon the results, thereby sieving the pivot data into a simple data set now of a piece with Points Surrendered and kindred formulas.

If we go ahead and divide pts by points surrendered and sort the results highest to lowest we see, in excerpt:

nba23

The findings are both interesting and cautionary.  Dwayne Wade’s and Lebron James’ enormous differentials may have more to do with their offensive puissance than their preventive talents, offset by the understanding, on the other hand, that a good offense may well be the best defense. What’s really needed, however, is a finer scrutiny of the players to which they’ve been assigned – and the same could be said about those who cede far more points than they score on the other end of the sort.

Of course, with all those parameters there’s no shortage of looks you can cast at the data. For example, try this pivot table:

Rows:  CLOSEST_DEFENDER

CLOSEST_DEFENDER_PLAYER_ID

Values: PTS

ShotPct (the calculated field we hammered together in the previous post).

I get in excerpt:

nba24

The intent here is to compare players’ points surrendered – an absolute measure – and the shooting percentages of the players they’ve guarded. Scan the list and you’ll see that Lebron James “held” his shooters to a middling .442 percentage, but Dwayne Wade restricted his opponents to a .394 mark, suggesting his defensive goods are for real. But again – the numbers need to be checked against the overall percentages of shooters. It may be that Mr. Wade has been issued a light workload.

And for a concluding, graphical touch, the Medium piece offers a scatter plot pairing players’ points scored by and against:

nba25

I don’t know with what tool the authors plied the chart, but I very much doubt it was Excel. In any case I managed to achieve something very similar with that application:

nba26

How? Well first, recognize that Excel simply can’t put together a scatter plot from a pivot table. If you try, you’ll be told “Please select a different chart type, or copy the data outside the Pivot Table.”

Opted for the latter counsel, I copied these data, for example:

nba27

And pasted them into a blank sheet area via Copy > Paste Values. I then selected the two columns of data, and headed toward Insert > Insert Scatter (X, Y) or Bubble Chart (to add data labels, see this You Tube video).

I did all this stuff without a programming language in sight. Does that make me a philistine?

 

New York City Restaurant Inspection Data: Tips Included

9 Jan

Expressing an interest in getting a bite to eat in New York calls for a bit of narrowing down. You will need to get a little more specific about your preferences, in light of the 26,000 or so restaurants in the city happy to fill your mouth, and evacuate your wallet.

Indeed – advising your New York crew “Let’s go to a restaurant” reminds me of the woman who stood in front of me in a Starbucks and requested a cup of coffee, the kind of order that drives baristas into a hand-crafted frenzy.

But once you’ve finally sat yourselves down you may want to learn a little more about what exactly it is you’ve gotten yourself into – literally – and the restaurant inspection data reorganized by the Enigma public data site may go a ways towards telling you more than you wanted to know (the data are free to you, but they want you to sign up first. Remember that the lower the inspection score, the more salubrious.)

I say “reorganized” – although Enigma will tell you they’ve “curated”-  the data, because the inspection outcomes have presumably been culled from New York’s remarkably near-real-time and far larger official data set, available on the city’s open data site (and Enigma’s too, though their version is three months old). The revision opens an interesting point of entry, then, to an understanding of how someone else’s data have been re-presented by someone else.

In what, then, does Enigma’s remake of the original data consist? For one thing, they’ve proposed to distill the source data set down to a unique entry for each restaurant (keep that that stratagem in mind), each of which, after all, have been subjected to several inspections.  By means of verification I aimed a Remove Duplicates check at the camis field comprising restaurant ids, and came away with but six redundancies – not too bad for a compendium of nearly 25,000 records.

And once having completed that chore we can run a simple but revealing pivot-tabled census of New York’s eateries by borough:

Rows: boro

Values: boro (count)

boro (again, by % of Column Total)

I get:

resto1

No one will be surprised by Manhattan’s restaurant plurality, though it should be added that the residential populations of both Brooklyn and Queens far exceed that of the storied island. In addition, keep in mind that the endless turnover of restaurants (the Quora article linked above declares an annual restaurant closure rate of 26%, though that assertion should probably be researched), turns the count into an implacably moving target.

And for another thing, the Engima set has padded the progenitor data with each restaurant’s geo-coordinates (latitude-longitude), thus priming a mapping capability. But they’ve also, befitting one of Enigma’s enigmatic apparent first principles, reformatted the inspection dates into text mode.

And Enigma’s alternate take has also put the scissors to some of the set’s original fields. The Critical Flag field – naming restaurants that incurred what the Department of Health and Hygiene terms critical violations, “…those most likely to contribute to food-borne illness”, is gone, and I’m not sure why. Those data sound like something you’d want to know about, and analyze.

But there’s a pointedly more serious issue besetting the data that I haven’t quite figured out. Because Engima determined to squeeze the data into a one-record-per-restaurant yield, it had to decide exactly which record would be earmarked for retention; and common analytical sense would commend the latest such record, conveying the current inspection standing for each restaurant. But it appears that Enigma hasn’t always nominated the latest record. A spot comparison of the records across the two datasets turned up some Enigma selections that predate more current inspections for the same restaurant in the official New York workbook. And if those kinds of discrepancies riddle the Enigma data, then we need to wonder about the decision rule that authorized their inclusion – and I don’t know what it is. What would an aggregate averaging of inspection scores purport to say, if some of the scores have been superseded by newer ones? (My emailed query to Enigma about the matter remains unanswered as of this writing.)

Moreover, because the one-record stipulation is in force, Enigma was impelled to collapse disparate violation codes in that eponymous field. The very first record, for example, for the Morris Park Bake Shop, reports two violations coded 10F and 8C, both filed on May 11, 2018. But New York’s precedent dataset has assigned a distinct record to each of the two, easing a pivot table breakout by code.

And those code consolidations – an ineluctable follow-on of the one-record-per-restaurant decision – probably explains Enigma’s omission in turn of the original Violation Description field. Boxing multiple violations in the space of one cell might confound legibility for both researchers and readers, and so Enigma likely concluded the whole field was best expurgated – at a price of course, because now we don’t know what the violation codes mean.

Now to be fair, Enigma also furnishes a worksheet-housed directory of those codes, which make for a most serviceable lookup array; but the multiple-code cell structure of its inspection data makes for an exceedingly messy prospect for 24,000-plus lookup values, which must be individuated somehow.

But all these cogitations have given me the munchies. Where do you want to eat? You want Chinese? Fine – that pares the choices to around 2,400. Your treat.