London Police Stats, Part 2: A Second Look at the Evidence

30 Jan

I trust the suspense has been bearable in the intervening seven days, and that your nails have remained immaculately unbitten during the wait for my explanation of the

=COUNTIF(B\$7:B7,B7)

formula with which I closed last week’s post.

I commended that expression as a means of easing the unsettlements wrought by our attempts last week simply to title a set of fields in the Wards workbook with which we’re working. If my WordPress search engine is to be trusted, I don’t seem to have brought COUNTIF to your attention before, so if you’re new to the function, your attention, please.

COUNTIF counts the number of entries in a range that satisfy an identified criterion (for the record, COUNTIFS, a kindred function which works with multiple criteria, is also out there, but let’s work with the original. COUNTIFS didn’t get nailed into the function list  until the 2007 release, by the way.)

In our case, we’re concerned to count the number of different London wards tracked during each month of the crime report before us, operationalized as the number of times each month reference informs the B column – one reference per ward. When a month gives way in that column to the next month, the ward count is thus supposed to start anew.

COUNTIF comprises two elements, or arguments: the range of values to be counted, and the criterion any value in the range needs to meet before it in fact gets counted. The above formula, strapped into cell E7, counts all instances of Jan-2012 in the accompanying range – right here, nothing but the uni-celled B7:B7. But in virtue of the anchored, absolute-referenced B\$7, a formula copied all the way down E will, for example read

=COUNTIF(B\$7:B112,B112)

in E112, and so on. And because the B column is sorted, and so brings all the Jan-2012, etc., entries together, the idea is to count the number of cells meeting the B, or month, criterion up to that point. Thus the above formula yields 116, or 116 different wards counted for Jan-2012 so far. The next row’s formula will evaluate to 117, and so on (remember that the actual data commence in row 7). And when you skid down to row 631 – the first of the Feb-2012 data – its attendant expression in E:

=COUNTIF(B\$7:B631,B631)

hands over the number 1, because again, a new month debuts among the data.

It seems to me that the COUNTIF alternative delivers, if I may resort to a phrase I don’t like very much, something like best practice here, because among others things, and as opposed to the formulas we’ve encountered, it’s purely cell-referenced, and as such is better sensitized to the data it’s been asked to count. The original formulas I’m proposing to replace with COUNTIF – e.g.,

=IF(F6=624,1,F6+1)

hard-code the number 624 among other things, and what that means is that should you sort the data by the Borough Ward field, for example, you’ll continue to realize a 624 at the accumulation of every 624 rows – but those rows won’t be continuing unique wards, since each ward has been sorted with its namesakes. And if that explication leaves you bleary-eyed, try the sort out (before you label those troublesome fields, thus keeping those error messages at arm’s length).

Moreover, and as a matter of standard operating procedure, impressing a fixed 624 into the formulas won’t resonate to any change in the number of wards. If London were to suddenly redraw a new ward, and the workbook were to receive its crime data, the formula would disavow all knowledge of ward 625 – and contrariwise, if for whatever reason one ward’s data were carted away from the workbook, our formula would continue to cling to the superannuated total of 624.

Now in the matter of field organization…first, the several Total columns/fields need to renamed, else Excel will insist on its own crude differentiations, Total, Total1, Total2, etc. But something else needs to be said about all those totals.

In previous posts (January 24, 2013, for example), I sounded the alert on rows that whose data comprise subtotals of the rows immediately above them, with the deleterious, potential effect of summing the same numbers twice, once they’re pivot tabled. And here, the Total columns, which after all compute subtotals, might appear to bode the same kinds of redundancy. But they don’t; they don’t because the totals here stand their ground discretely as fields, not data rows, and as a result, in fact, do an admirable thing; they spare us the chore of dreaming up the calculated fields that would achieve what our total fields are already achieving; that is, for example, totalling all Violence Against the Person offenses for the Barking and Dagenham borough ward of Abbey.

This issue – and it’s a most significant one, banging its spade atop one of the foundational elements of pivot tables – is one I’ve already taken up in my August 22 post, one that inquires into the relation of fields to items in the fields, and how these might be properly structured.

As per that tricky August excursus, our data should ideally have been promulgated along these lines:

such that each offense type and offense category would have been documented in record, and not field terms (again, you may want to review the August 22 discussion), a visioning of the data that would have facilitated pivot table summing far more fluently. But because we have the data as we have them, the multiple total – again, really, sub-total – fields have in effect done the calculated field work for us, and as such can’t truly be called redundant. In other words, let’s leave them be, albeit with new names.

Just note, by way of a concluding observation here, that the wards data as we received them overarch the records with supercategories, e.g., Violence Against the Person, Sexual Offenses, Robbery, etc.; but at the very outset we segregated these from the rest of the data precisely because they overarched them, in merged cells that would have, had they remained “in” the data, layered what in effect would have a second set of field headings upon the data, e.g.

And even if the merges were to be disassembled, and the Violence heading landed atop each column:

That revision would redefine all the current, actual, field headings (now in the second row above) as records – and that doesn’t sound right, does it?

Whew – and we haven’t even done anything with the data yet. A Part 3, anyone?