# Crime and City Size: Big Data, Small Associations

4 May

A literal page out of Harvard’s Shorenstein Center on Media, Politics and Public Policy, the journalistsresource.org site makes instruction, compendia, and illustrative data available to its readers, one of which spreadsheets crime data for the first six months of 2013 for the 269 American cities exceeding 100,000 inhabitants:

Data-analysis-with-crime-statistics-20131 copy

(As usual, you should consult the fine print on the Notes tab that qualifies some of the data.) The sheet – really in effect a practice file with live information – and its accompanying textual exposition presses a concern with the statistical significance of per-1,000 violent crime rates compared across the cities, operationalized on the sheet by a city’s totalled murder, rape, robbery, and aggravated assault figures (summed in column 0) multiplied by 1,000 (column P). (Note, by the way, the blank Aggravated Assault cell for Phoenix and the missing rape figure for Chicago; both omissions are accounted for by footnotes numbered in the City_notes field.) The rates are set atop a denominator furnished by Oakland, California, the city evaluating to the highest rate and compiled in column R; the resulting %less numbers (the formula could have alternatively written =(P\$2-P3)/P\$2 in cell R3, for example) forge the comparisons, of which you’ll find the statistical synopsis in N272:P279 that in turn feeds much of the discussion on the web page.

Still, there may be some additional things to say about the data. For one you may well want to draw that %less field into a pivot table – but because it’s been ferried to the other side of a blank column (presumably because it was appended to the sheet by the article’s authors, Leighton Walter Kille and John Wihbey , who don’t direct their readers to the pivot table option in the piece) you’ll need to send the column – the blank one, that is – packing, thereby readdressing the %less data to the Q column. Moreover, the %less entry currently in R2 is, and is not, a header, having been lowered one row beneath the other headers. One assumes, then, that the %less means to shore up the tautological comparison of Oakland’s data with themselves by refusing to quantify what would after all yield a zero, were the city’s figures subjected to its own denominator. But a) the field still needs a header, after all, and b) a zero in R2 would better serve the field than a textual caption that stands in the way of any attempt to group its numbers, because you can’t group a field that’s been infested by even a single cell’s worth of text – at least not in a pivot table. And then we could at last call the field %less – in cell Q1.

And once those deeds are done I’d run the field through a very standard Copy > Paste Values routine, but not in the interests of slimming the byte count by replacing 269 formulas with their hard-coded yields. Rather, the pasted values will preserve the citys’ percentage relations to Oakland- because, if, for example, we were to leave the formulas as they’re written in place and sort the data by say, the Population1 field largest to smallest, New York’s ascendency to row 2 would instate its per1000 value in the P2 references in all the %Less formulas, that replacement denominator completely rewriting the rate comparisons as a consequence. Sort New York to the top of the data set, then, and Flint, the city whose violent crime rate stands only .58% lower than Oakland’s, now returns a -248.71% figure –a violent crime rate about three-and-a-half times greater than New York’s pre-emptory baseline (remember that a -100% differential attests a value that’s twice as numerous as the value to which it’s compared). You’ll note by the way that the bullet-pointed citation of Philadelphia and Houston near the close of the web-page discussion does describe a sort by population, but doesn’t reckon with the problem we’ve just flagged.

You’ll also note a small but meaningful typo in cell A2 of the Histogram sheet that charts the cities’ per1000 frequencies in intervals (or bins) of 1. It’s clear that the FREQUENCY function – one of Excel’s off-the-shelf array formulas – has been put to the task, though all evidence of its formulaic operation has been cleared from the sheet. (Note the unattended bins column in T3:T10 that was likely bound to now-gone FREQUENCY formulas in U3:U10.) FREQUENCY’s user-defined bins/intervals gather values that fall beneath each threshold, a functional reality that the >=1 label misrepresents.
For example – with FREQUENCY, a crime rate of 1.8 would attach itself to the 2 bin in U4 on the data set worksheet, and by extension the 49 stationed among the histogram data:

Should have been labelled <=1. In fact, the actual frequencies were almost surely calculated in the main sheet in T3:U10, and were simply transcribed anew into the Histogram data. (For a pretty intelligible how-to on FREQUENCY, look here. Take particular care to read the NOTE that concludes the discussion.)

But apart from or in addition to rate frequencies, one might want to learn something about the relation of city size to the violent crime rate. A simple first pass at the question would attempt to tie size and rates via CORRELATION:

=CORREL(E2:E270,P2:P270)

I get an association of .113, bespeaking a small, but not overwhelming, positive relation between a city’s population and its violent crime (remember that the ViolentC metric composites four crime types).

And while we’re correlating, it might be worth eliciting the relationship between city size and crimes against property, those infractions counted in columns J through N (for a distinction between the US-legal understanding of robbery and burglary look here). The former assumes the victim is present and has been subjected to some assault or threat). We can take over column R, name it propper1000, and enter, in R2:

=SUM(J2:N2)/E2*1000

Then copy down R (but note that the arson figures for nine cities are missing). We could then write, somewhere:

=CORREL(E2:E270,R2:R270)

The resulting -.011 registers an even less persuasive association than that realized by city size and violent crime. That is, larger cities seem to be no more provocative of property crimes then smaller ones, common perceptions notwithstanding.

Maybe there’s a byline or two in there.