Chance has nothing to do with it, according to the just-published New York Times scrutiny of the race-driven maldistribution of police traffic stops in Greensboro, North Carolina, the piece finding black drivers far more susceptible to stops than their white fellow travellers. That investigation had me describing a beeline to the nearest substantiating data, and I found something here that seemed to fulfill the requisition:
Those data are enlightening and seem to prove the Times’ point (assuming controls for ethnic distributions and miles driven by race have been instituted) but at the same time they could be deemed limited, with apologies to the Greensboro Police; I had something less summatory, something more microscopic in mind, as it were, a grander data set sporting its record-by-record details and other, hitherto uncharted fields that could, when properly cultivated, take some analytical roads less traveled. And in fact I did find what I was looking for, more or less, but only after this post went to press in its (ahem) first edition: a collection of North Carolina-city spreadsheets recounting traffic-stop data, courtesy of Professor Frank Baumgartner, an expert on the matter.
Still, those data aggregate stops and searches by police officer IDs, and don’t drill down to the contributory incident records. But my pick axe did, however, clang against this huge trove of traffic stop data from the state of Connecticut here:
I said huge – as in 841,000 records huge, its accumulation of police interventions tracking back to October 1 2013. In other words, if there’s a lawn out there you’ve been waiting to cut, grab the mower now and get out there while the file oozes across your bedraggled RAM. (Be advised that the latter-cited link up there, having been trucked into a Connecticut state open data warehouse with its standard-issue US interface, supports a filtering option you may well want to exercise, should the data behemoth in the room set your walls bulging. In other words, you’ll want to save this workbook very occasionally.)
And because this data set just won’t quit you’ll probably look to push the byte count back where you can, by throwing the barricades around fields upon which you’ll likely never call. I’d start with the completely dilatory Organization Activity Text field, appearing to incant the phrase Racial Profile (presumably to name the thematic concern of the group conducting the study) ceaselessly down the column. I’m tempted to tell Phillip Glass about it. I’d also probably hang the Closed sign atop the Intervention Location Description Text, a parameter that adds little value to the superordinate Intervention Location Name.
The same dispensability could be stamped upon Statutary Citation; it’s overwhelmingly spiked with NAs and probably not advancing the plot any farther than the companion Statute Code Identification ID. Organization Identification ID appears to merely code the adjoining and more forthcoming Department Name field, and as such I’d debar it too. I’m also tempted to bid farewell to Day of Week, an inarguably valuable bit of information that could nevertheless be derived from a pivot table deployment via the Group option (albeit in day-of-the-week numeric as opposed to named, form). But because I don’t yet know what the three Intervention Code fields mean to tell us I’m leaving them be, at least for now. But even if we stop here, we’ve blown away 3.3 million cells worth of data, not a shabby day’s work.
Our next remit, one born of experience, is to do some stopping and searching of our own for any gremlins haunting the data. For example, a temporary pivot table comprising the following structure:
Row Labels: Subject Age
Values: Subject Age (Count)
Will apprise you of the 1237 zero-year-olds whose behind-the-wheel conduct forced the authorities to pull them over – and I don’t blame them. And the seven -28 year-old drivers on Connecticut roads surely had no complaint when they were waved to the shoulder by the local constabulary. Of course these unassailably mistaken entries amount to nothing but around two-tenths of a percent of the whole, and in any event I’m not sure they call for excision from the data set – because the information in their other fields may yet be intelligibly applied to other, non-age-driven analyses. And besides, one could, perhaps, write off a -28 as a wrongly-signed 28, and entitle oneself to treat the value as such – perhaps; and those zeros might portend nothing more than a Not Available indication. But the 55 one-year-olds whose dodgy driving beamed them across the radar need to be considered, along with the 1876-year-old motorist flagged down at 3:12PM on November 13, 2013 in the town of Derby.
On the other hand, the Subject Sex Code data are admirably binary, offering up nothing but Fs and Ms and no blanks, and the date and time data remain resolutely numeric throughout – no small attainment for 841,909 records. The inappropriately formatted dates, needlessly freighted by 0:00 times throughout, do no damage to their usability, and can be cosmetized by the Short Date format, for example.
All in all, then, the data quality isn’t bad. No excuses, then, if we can’t do anything interesting with them.