Big data is a relative term, I’d allow. Redolent of prodigious, recondite servers and command-central cowboys programming their way through all that digital effluvia, big data could nevertheless be made to implode into spreadsheet dimension, comprising something like the 685,000 records overrunning the New York Police Department’s ledger of stop-and-frisk encounters for 2011. That workbook, a gorilla far too mammoth to be reposed in WordPress’s holding area, can be hauled up here:
And you thought my 348,000 food ratings worth of data (August 14 and 21 posts) was big.
But once you’ve succeeded in finding a place in your creaking RAM for all those rows and columns – if you’ve succeeded – the analytical prospects massed within should justify the exertion. Stop-and-frisk searches of citizens who have, rightly or wrongly, been placed under on-the-fly suspicions of the potential for wrongdoing, are cause for perennial controversy in the States (see, for example this recent New Yorker piece), what with allegations that young African-American men are a good deal more likely to be placed under this sort of improvised scrutiny than everyone else; and so the data – which are, among other parameters, coded by race – are likely to beam some light on the matter.
But again, you need to get the workbook on your screen first. Truth to be told, I was surprised my proletarian, workaday laptop and its demotic 4-gigabyte stash of RAM was up to the task, but it was, eventually, though you can probably wash your car during the time the file takes from call-up to appearance if your machine is similarly outfitted. When the smoke cleared following my first access, I found myself mano y mano with 232 MB worth of spreadsheet; but by liquidating columns that seemed unintelligible or less-than-serviceable, I managed to squeeze 129 MB worth of air out of the workbook – after which my RAM was able to exhale, at last.
And after which I was able to move from file to data management, where there was work to be done too. While many of the fields and their contents appeared self-evident and important, e.g., suspect height, ages, and weights – a few thousand or so weight readings seemed improbably, or impossibly large. Over 700 records featured suspect weights of 400 pounds of more, including a fearsome class of over 500 999-pounders – these, one assumes, data-entry proxies for no information. It’s hard to know where to draw the line here – there are 400-pound folks out there, after all – but there is where I drew it and introduced a blank row immediately before that number first appears, after having sorted by the weight field (it should be added that other parameters associated with these individuals – age, for example – seemed perfectly plausible). By the same token, it’s hard to know what to make of the detainees listed at 36 inches tall; it’s not inconceivable these were children, but some pause need be given here and at the other end of the ruler as well, at which 90-plus inch suspects have likewise apparently been sighted. And I remain clueless about the 20 one-year-olds who seem to have been subjected to police scrutiny.
Those issues matter, to be sure, but two other fields call for a more thoroughgoing think-through. Datestop and timestop, respectively, denoting the date and time at which a putative suspect was stopped, both exhibit the barest, unqualified numerical form e.g.:
(Pct, by the way, stands for police precinct.) That first date stop means to report December 7, 2011 (remember these are US-formatted data); its associated time purports to evaluate to 19:24. But that date is, in its present incarnation, nothing but 12,072,011, and its time merely 1,924. Any intention to group or otherwise understand the data in the spirit in which they were clearly intended needs to be anticipated by a tweak or two.
Turning to the dates, we can entitle ourselves to a breezy disregard of each record’s final four digits, because the spreadsheet devotes itself to 2011 data – and a result those digits teach us nothing we didn’t already know. It seems to me the simplest expedient here is to isolate month and day outcomes in independent columns, rather than jump through some formulaic ring of fire culminating in a wholly transfigured, actual date. I’d inset a column, then, between B and C, call it Month, and enter in C2:
This expression looks at the number in B2, inquiring after its length – which should comprise either 7 or 8 digits, either/or because the data don’t sport trailing month zeros. If the length is 8, the formula simply captures two month-signifying digits from the value’s left; otherwise, it dusts off but one. Couching the expression in VALUE prepares the result for numerical grouping, in the event you want to so treat the data (the implication clued here then is that, absent the application of VALUE, the results will assume textual form).
Then add a column between what is now C and E, dub it Day, and enter in D2:
That careening between LEFT and RIGHT seems a little dizzying, but it works like this: We can infer, through an inspection of a sample of values, that, in contradistinction to months, every day does present two digits, including a trailing zero when necessary. Since we know that the six rightmost digits in every entry contain day and year information, we can carry away the two leftmost digits of those six, and that’s what the formula endeavors to do, that is: take the six rightmost digits from each cell value and from these uncouple the leftmost two. Again, VALUE quantifies the result.
The Time variable jams a stickier wicket into the turf, though, because of time’s 24-hour, 60-minute nomenclature. A stop-and-frisk-recorded time of 2055, for example, is simply not re-readable as 20:55, and the variable lengths of the time expressions here – two to four characters – throws another pylon into the road (54 in the data, for example, presumably means 00:54, and 140, 1:40). To make some usable sense of the data, I’d set these suggestions forth:
Clear out a column between E and G, head it Hour, and in F2 write
Welcome to a nested IF statement, and note the syntax therein, which declares the following: if the time reading in E2 amounts to three characters, then return a 0 in the formula cell, because again, a two-digit number must register a time in the midnight (00) hour. If, on the other hand, the cell offers three characters, grab its first digit (denoting hours 1 through 9). If neither condition is satisfied, that is if in effect the cell comprises four characters, then pinch two characters from the cell’s left. Note that the last-cited condition appears solo, not joined to an IF; that’s because it serves as a default, fall-back consequence when no other condition is met.
And if you need minute data, force another column, this one between what is presently F and G, call is Minute, and enter in G1:
That’s all. We can afford the simplicity here because all the time data cells seem always to devote their right two characters to minute data (understanding, of course, that some cells consist of only two characters).
One last point for now – I told you it’s a big file; don’t try to save it too often, unless you have another car to wash.