Cell Breakout: Crime Data and Dates

14 Aug

Time for a bit of review. Spreadsheetjournalism is in the first instance about how spreadsheets might be placed in the service of story-seeking, a marching order coupled to a corollary remit – namely, to attempt, through a judicious tweaking of the data as they currently stand, to synergize some novel, reportable take on the information that wasn’t there before, or at least that had been hiding-and-seeking coyly in the brush. I’m not necessarily in love with the term, but we could call this sort of enterprise value-added journalism, an instance of which we saw it in the post on the 2008 election. Here’s instance number two.

Crime, like it or not, is a reporter’s staple, and crime data abound on the net, variously guised and variously usable. One notable stockpile of data occupies the shelves at data.dc.gov, a site devoted to informational bits and pieces on the nation’s capital; and I’ve adapted their Crime Incidents 2011 workbook for download here:

WashDC crime data 2011

(The data in their original form had to be taken back to the shop for some retooling; had I asked you to call up the workbook directly from the data.dc site the attendant issues would have pushed us off message).

Look around, and begin to think about what’s up with the data. (Click here for an explanation of some the more recondite fields on the sheet; thanks to Christopher Marshall of dc.gov for the heads-up.)

Of course you could begin unprepossessingly, by simply aggregating all crimes by their categories (needless to say, our view is confined to reported crimes) by running them through a pivot table, or by subjecting them to the DSUM function or even, if you insist, trotting out the hoary and ungainly Subtotal option, a feature I never use. You could do any of these things, to be sure, but your findings won’t quite qualify as stop-the-presses material; we’re entitled to assume that somewhere in DC that story has already been filed. But wait – I think I’ve experienced a low-grade epiphany: what if we were to analyze the distribution of crimes by day of the week? That might stake us to a less-than-obvious angle on the matter, particularly if we could bare day-specific variations in crime types. There’s just one problem with all that: even as the workbook before us indeed sports a REPORTDATETIME field in the C column, actual days of the week are nowhere recorded. So that’s where we come in; we need to somehow pluck days from the dates streaming down C.

You’ll be pleased to know that the means for doing this is rather simple.  Make your way to the K column, rewrite its heading to WEEKDAY or something kindred, and enter in K2:

=WEEKDAY(C2)

(You’re doubtless wondering why I elected the K column. It’s because the data there are completely redundant, consisting exclusively of the word WASHINGTON, thus adding nothing to the narrative. Rather than advancing to the next empty column – W – I’ve opted for a column far closer to the data with which we’re working.)

You should return a number somewhere between 1 and 7. I get 1 (but if you’ve sorted the dates you may see a different value), standing for the first day of the week – i.e., Sunday, day one in Excel’s calendrical view. (And don’t be fooled, by the way – WEEKDAY gives back any day of the week, not just Monday through Friday.) Copy that formula down the column, and you should get something like this:

You see what WEEKDAY does, and does easily – it extracts the day, in numerical terms, from a date entry. But I know what you’re thinking: wouldn’t readability be served by replacing the numerical day entry with its associated name, e.g., Sunday in lieu of 1? Agreed; and here’s what I’d do: in a blank area, say starting in cell Z1, enter the values 1 through 7 down the Z column. In the adjoining column AA enter the days of the week beginning with Sunday (you may know that you need only actually type Sunday in AA1; return to that cell, click on its fill handle and drag down, and the other day names appear in sequence, a consequence of one of Excel’s built-in fill routines). Then select cells Z1:AA7 and name the range DAYS (by clicking in the Name box and typing DAYS and tapping enter:

There’s a very good reason to name a range, as we’ll see.  Now return to K2 – the first of the day-identifying cells – and now type and follow with Enter:

=VLOOKUP(WEEKDAY(C2),DAYS,2)

What’s happening here? We’ve deployed a VLOOKUP function, a near-indispensable tool in the box that consists, in its default mode, of three elements or arguments, as they’re officially termed. The first alludes to that which is being looked up – in our case the numbered day realized by WEEKDAY, which has now been nested in the lookup. The second argument – the reference to DAYS – names the lookup range in which the numerical value we’ve just returned with WEEKDAY is to be looked up; and the final argument, here the number 2 – pinpoints the column in which the “answer” – that is, the day name – is stored.

In other words, and assuming the day in cell C2 is Sunday, the lookup returns the number 1 via the nested WEEKDAY formula, then reckons that number against the DAYS range (in that range’s first column), and then “looks” in turn to the second column – the 2 in the VLOOKUP formula – for the day name, in this case Sunday. In other words, 1=Sunday. And naming Z1:AA7 liberates you from the chore of having to deal with all those nasty absolute-reference dollar signs that beg your attention when you copy the range down a column. Naming Z1:AA7 as DAYS means that those coordinates always remain Z1:AA7, irrespective of any new, copied location.

And once the weekday names have been engineered in K, it’s but a short commute to a revealing pivot table (note: you should first click the Refresh All button in the Data ribbon before you proceed, so as to incorporate that amended WEEKDAY K-column heading into the table-to-be). Then once you’ve inserted the pivot table

Drag WEEKDAY to the Row Labels area.

Drag OFFENSE to the Column Labels area.

Drag OFFENSE (yes, the selfsame field – see the quick note-in-passing on this issue in my August 7 post) to the Values area.

You should see:

(ADW stands for assault with a dangerous weapon. THEFT F/AUTO signifies theft from an auto.)

OK – that looks moderately cool, but before we scan the data you probably want to know why Monday is instated first among the days, particularly after we observed that the WEEKDAY function nominates Sunday as day 1. Good question, the answer to which seems to turn on the day listing in Excel’s Custom Lists area (reachable via the File tab>Options>Advanced>General drill-down), in which Monday indeed is assigned the first position.

In any event, if you’re happy with the alignment above, fine. If, however, you do want Sunday to head the day list, right click on Sunday and maneuver here:

and click accordingly.

Now for the data. It’s notable that far fewer thefts are committed on Sunday; whether that ties in to a day-of-rest mindset I don’t know. Note in addition the homicide and ADW totals for Saturday, for example. There’s a good deal to think about here, as some crimes don’t appear to be distributed randomly – and the sample size of over 32,000 offenses is pretty formidable.

Now watch this. Click in any cell in the pivot table and click Pivot Table Tools>Options>Show Values As>% of Column Total:

34% of all arsons committed on Tuesdays? Hmmm. 44% of all homicides perpetrated on the two weekend days? The next step might be to contact a criminologist at a local university, or some police bigwig for some learned explication. In any case, one trusts you see the possibilities.

Of course, this is Washington DC, and you probably don’t live there. But I’d suggest it’s likely that your local police keep day-sensitive crime data; whether they’ll release it to you is another matter. If they won’t, get back to me: I’ll write you a supporting letter on my stationery.

6 Responses to “Cell Breakout: Crime Data and Dates”

  1. Stephen August 16, 2012 at 12:55 am #

    Big fan of the blog (I cam across it via the shout-out from the Freakonomics blog). I remember analyzing crime data from Chicago and correlating that with city-complaint data in a computer programming class in college. We used Python instead of Excel, but I’m beginning to think Excel would have been the easier option. I’m only beginning to understand the power of Excel and your blog is really helping me apply Excel to data beyond the usual finance data I work with. Thank you, and I look forward to reading more posts.

    • Abbott Katz August 16, 2012 at 10:05 am #

      Thanks much for your kind words. Once you get the hang of it – and needless to say there’s always more to learn (for me, too) – Excel becomes a really potent – and flexible – means for getting where you want to go. And it can be fun, too, if that isn’t too confessional. And the imminent 2013 version enables relational database capability between pivot tables, though I haven’t gotten into that yet. Thanks again and glad to be of service.

  2. Tzvi August 19, 2012 at 3:28 pm #

    I really love all of your posts (especially the intro to pivot tables). I have one brief question. How do you apply the =WEEKDAY(C2) formula to the remaining cells in column K? The only way I know of is to use the fill handle and drag it down the column, but with 30k+ rows, this would take a few minutes. I am certain there is a much quicker way to do this. May you please elaborate on how you did it?

    • Abbott Katz August 19, 2012 at 3:44 pm #

      Thanks again! One way to effect a quick copy is to hover over the fill handle and double-click it. All consecutive cells downward should be copied to, provided there is associated data in the immediately left or rightward column. I think I allude to this in an earlier post. An alternative: click on the cell you want to copy, and executive the standard copy command. Then type the range to which you want to copy the expression in the Name box (e.g., A2:A30000), and tap Enter twice. The first tap selects that range; the second copies to the range.

    • Stephen August 20, 2012 at 3:50 am #

      You can also select a large row or column of data at a single time by holding CTRL+SHIFT+Down/Up/Left/Right. So, if you want to copy the WEEKDAY(C2) formula down from K2 to the bottom of the K column, you can hit CTRL+C to on K2 to copy the contents of K2 (the formula in this case) and then hit CTRL+SHIFT+Down to select the all the cells in column K that are continuously filled, and then hit CTRL+V to past the formula from K2 into your entire selection.

      The CTRL+SHIFT+Down/Up/Left/Right will select all the cells in a in a continuous row of column, but it will not extend over any blank cells. For instance, if K34 were empty, hitting CTRL+SHIFT+Down from cell K2 would only select the range K2:K34. Thus, this shortcut is a good way to select large ranges of data that are continuous. It also allows you to be efficient, since you can can select and copy a formula across 3000+ cells without having to use your mouse at all.

      • Abbott Katz August 20, 2012 at 12:50 pm #

        Quite right – thanks. As with the fill-handle-double-click technique, contiguous data-bearing cells will be populated by the CTRL+SHIFT sequence, with an added advantage: whereas the fill-handle approach only copies cells situated alongside another column, CTRL+SHIFT doesn’t require that stricture.

Leave a Reply to Stephen Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: