It’s a big country, the UK, but not so big that a unitary police force can’t stand watch over the nation’s railways, from London to Aberdeen and the still higher latitudes. The British Transport Police has 2600 stations to patrol, and it’s made its crime data a matter of public knowledge on the data.gov.uk website, more specifically here:
Don’t be fooled by the link’s name; you’ll see that the relevant download bears the date 2013, and indeed – the data themselves are end-pointed by February 2011 and January 2013, an ungainly, but precisely two-year, interval that needs to be dealt with one way or another. And at 15MB it’s a big file; but it’s a big country.
And once you’ve stationed yourself in front of the data you’ll find their 106,000 records read pretty crisply, and exhibit the right type of organization , more or less (e.g., the three reported years themselves a pack single field, rather than fare-beating their way into separate columns). But those records have been made to shoulder a weighty, onerous excess. About 40,000 of the 106,000 are of the All Crime and ASB (Anti-social Behaviour, as they spell it in the UK) stripe, and in my view these are simply dispensable – because each one realizes the aggregate crime total for each station – for each reported month and year. The upshot: a massive potential double-count of the numbers.
For example: The data for the first-alphabetized station, Abbey Road (no, not that one; this Road is way over in London’s east, near Canary Wharf), comprises 13 records – seven of which report a specific crime-type incidence for a specific month of a specific year, with the other six tabulating all crimes for each month/year. (Because two crime types – Robbery and Pubic Disorder and Weapons – were committed in May of 2012, both fold into the May 2012 Abbey Road All Crime and ASB record.)
Thus if you pivot table the data thusly:
Row Labels: Station_Name
Column Labels: Crime_Type
Values: Crime Count
You’ll discover that the 14 Grand-Totalled crimes ascribed to the station really enumerate seven crimes, because the seven offenses counted by All Crime and ASB item merely duplicate the actual total – and that’s a problem.
We’ve gone toe-to-toe with double-counts before (see here, for example), but those earlier confrontations brandished avowed, self-described sub/grand totals rows that presented themselves as such, but wormed into the larger data set just the same. Here, the All Crime and ASB records are unmarked cars, as it were – they simply back themselves into their appointed rows like any other record, with no subtotal pretensions whatsoever. (Note: the March station in Cambridgeshire is oddly sorted atop the station list because Excel reads the name as a month, which in fact and in effect possesses numeric standing. If March’s peculiar positioning bothers you, the fix is surprisingly begrudging; the classic label-casting strategies, e.g., prefacing the text with an apostrophe or tapping a space after the “h”, just won’t work. I’d simply sort the field, aim a Find command at the first instance of March, add the word” Station” to it and copy down. Refresh the pivot table and now you’ll find March where it belongs.)
What do to about this supererogation? Something simple, I think. Because it turns out that the All Crime and ASB rubric happens to appear earliest in the alphabet in its field, I’d run a Z to A sort, race down to row 66045, the present lodging of the first All Crime and ASB, and insert our tried-and-true blank row. While as always I’m happy to be persuaded otherwise, I don’t think the remaining data and the analyses they encourage will be in any way impaired. And had the All Crime and ASB designation not cooperated and sorted itself somewhere in the middle of the pack instead, I’d have introduced a new, temporary column somewhere and loaded its first row with something like this:
=IF(A2=”All Crime and ASB”,”zzz”,A2)
I’d have then sorted all the zzzs to the bottom, and deleted accordingly.
But because neurosis must be served, I’d leave those 40,000 outcasts in the sheet, one row away – just in case someone decides they’re needed after all.
Now think about another issue, this one likewise raising a double-count concern, albeit bearing a slightly distinct provenance. Most stations report a passenger number, what I take to be an aggregate station traffic estimate (and estimates some of them clearly are, what with their zeroes-laden totals, e.g., the Acton Town and Aldershot stations). Have these numbers gathered themselves across all 24 of the recorded months? I don’t know, but the larger point is that each station number appears to be invariant, featuring in each and every station record. Thus the 84 rows of Abbey Wood (not Road, and my count assumes you’ve sorted out and shelved the All Crime and ASB records) data all disclose the same 1515106 passenger number, and therein lays the complication. Because even as each record marks out a subset of all Abbey Wood crime, each passenger number upholds the same overall figure, record after record. This granular disconnect needs to be refit too, because, for example, this pivot table
Row Labels: Station_Name
Values: Crime Count (Sum)
Passenger numbers (Sum)
won’t work, because a station’s constant Passenger Number will, by default, be added again and again, along with the crime counts. And that’s not a double-count there – it could be a 100-count.
So let’s all think about that one.