Archive | September, 2012

Addendum to Previous Post/Out-Of-Office Message

30 Sep

My WordPress contract grants me a vacation every 14 posts, so according to my math now’s the time. Anyway, I’ve been working too hard and I need a break. (Did I just say that? I must have; my nose is growing.) Absent a late-breaking development I hope to be back here around October 11, though I should have email access intermittently.

Now for the addendum, a quick one. The kind of spreadsheet you want turns in large measure on what it is you need to do with it. The UN birth-data sheet I commended in the post below works as a primary research source, but clearly won’t serve a public keen to tune into recapitulated, ready-to-read results. As such, the NPP worksheet might very well speak to that public in its language.

The question, one which denominates a real spreadsheet conceptual divide, is thus one of Before and After:  is the sheet something you want to read right now, or something on which you want and need to work? I’d allow that for the story generator, you want a Before – the data backstage, 1,000 rows and 50,000 characters in search of an author.

@abbottkatz

abbott@abbottkatz.com

 

Field Work: Item-izing the Issues

28 Sep

You’ll recall my last post (and if you don’t, just scroll down; I’ll wait) broached the not-awesomely-exciting but integral matter of spreadsheet data, their contents and their discontents – that is, challenges of working with information that just won’t cooperate with your plans to refurbish and ready it for a round of analysis pegged to your story-seeking needs. And there’s is more to consider on this count, including a recurring, foundational, spreadsheet design question you do need to think about. So prepare to be unexcited again.

The problem is pointedly typified by a worksheet I’ve carted off from

http://data.nationalpriorities.org/searchtool/

a government site that breaks out and maps various federal expenditures along a number of budgetary variables.  The one I’ve brought to your attention, a state-by-state (and US-territorial) gender census for the years 1990-2010, comes to you right here:

NPP_FPDB_PopulationbyGender

(You’ll note first of all that no data avails for any of the territories, but that’s the way it goes.)

The communicative intent of the spreadsheet is clear – to tally state populations by year and gender, and as far as they go the data are nothing if not intelligible; so if you’re bringing a casual viewer’s eye to the sheet, the tableau unfolding before you may suffice as is. But if you’re expecting more from the data, a harder look may be in order.

For one thing, the worksheet has more columns than the Parthenon, appointing female, female percent, year male, male percent, and year total fields to each of the 21 years compiled and culminating in a distended, awkward read:

Moreover, the alternation and reprising of fields, i.e., female, female percent, male, male percent, year total for any year, and then again for the next year,  inflicts a crimp in the formula-writing process – not a prohibitive one to be sure, but an irritant at the least. It would be easy enough to sum each column – say, all females for 2010 – but calculating the average female percentage by state across the 21 years, for example, would involve an expression starting (say for Alabama)

=SUM(B2,G2,L2….) and so on, to be divided by =SUM(F2,K2,P2…) and so on.

And you’d have to do something similar for male percentages. Again, these conjurations aren’t forbiddingly complex, but you’d have to plant them in column DC at best (that is, the first free column), and additional formulas grounded in the data streaming horizontally would have to lurch farther out still (one need not be troubled by all those N/As, however; those are text entries and not formula-resultant error statements, and so standard formulas will ignore them without hazard. You could, on the other hand, simply delete the N/A-beset rows, because they contain no data at all.).

Once thought through, the sheet’s presentational inelegance and impediments to productive data breakouts are problem enough; but once you decide to commit the data to pivot table scrutiny, the problem burgeons – for a very – er, pivotal – reason.

The problem – a classic one haunting this kind of work – is this: the worksheet under your steely investigative gaze has awarded individual field status to what should be more properly accredited as items.

And exactly what does that mean? It means that the year fields – all 21 of them  (these in fact hybrid fields twinned with a gender property) – could have, and in my humble estimation should have, been compliantly huddled beneath the umbrella of a single field – something called Year, or something like it. And it means that Female and Male – the attributes bolted to those 21 years – should have been spirited away and sequestered beneath a new, liberated field called Gender. In my view, the data should have been arrayed more or less like this, using four years’ worth of data in this truncated example:

And so on, down the respective columns for all the states and all the years. Now while these records appear at first and perhaps even second blush to suffer from a kind of data profligacy – after all, we’d need to enter each state name 42 times (21 years times 2), even as each was keyed in exactly once in our downloaded worksheet – yet major analytical and presentational efficiencies follow as a result, because when these redrawn data are visited upon a set of pivot tables, you can get this:

And this:

And this:

This rather abstruse formulation delivers the percent increase in men by state, relative to the preceding year – and there are notable divergences by state.

Granted, these data don’t conduce toward Eureka-level findings, but the larger point is there to be made, namely this: that data of what could be called the same order of phenomenon belong together, enrolled as items in the same field. Thus here, year data should be accounted as discrete instances, or items, of the overarching field called Year. The two genders should likewise be made to share the space plotted beneath the field Gender, or some such suitable name. And the states should not in turn be allotted separate field statuses – i.e., a field called Alabama, a field called Alaska, etc. – as indeed they are not here.

The advantages of this strategy may or may not be wholly transparent to you, but they are real, and include these:

  • Extending the above example: if every year were to possess its own field status, each one would have to be dragged separately into a pivot table – and that, is after all, a drag. But gathering all the years under the Year field aegis would require only one field drag.
  • If each year were to arrogate separate field status, for example, each one would have to be formatted separately once having been roped into a pivot table. A unitary Year field requires only one formatting operation, and allows its data to be immediately grouped:

Try carrying out this elementary pivot task on the downloaded worksheet. I’m not sure you can, because when two fields co-reside in the Row Labels area, one necessarily subordinates itself to the other, e.g.

Thus put two independent year fields in play and one needs to appear “first” in Row Labels, in the superordinate position; the other won’t and can’t occupy the same space, appearing only subordinately in the next Row Label column as above.

  • The pivot table’s native mathematical complement – that is, the Summarize by Sum, Average, etc., and Show Values As capabilities – work directly on same-field data, bypassing any need for those calculated formulas discussed in an earlier post (September 6). For example, this little pivot table elicited from my demo data:

simply enumerates the aggregate increase in US population across adjacent years (that’s why 2007 is blank – it serves as the base year). Again, the labors required to enable this sequence if 2007, 2008, 2009, and 2010 were to hold down individual field statuses is a pause-giver.

And if you’re looking for corroboration of all this, return to that UN birth data spreadsheet I   referenced in my post on Olympic birth months. On second thought, don’t bother – just check out this shot:

You see the point – the ostinato of Albanias and year entries, both items complying with the Country or Area and Year field rubrics respectively. That’s the way it should be done.

But what then of our census data as they stand – are these workable, or grounds for another write-off? I’d suggest that some room for rapprochement could be set aside here. For one thing, and unlike the adjunct data worksheet we reviewed last post, the data themselves – the actual cell-specific entries that inform the sheet – are ok in and of themselves. One could leave the data in their as-is state and throw in some of the formulas I had suggested about 1,000 words ago. But it also seems to me that with a measure of due diligence – and perhaps not as much as you might initially fear – you could, with some sage copying and pasting, restructure the sheet commensurate with the counsel I’ve proffered above (though you’d probably want to do the copying to a completely new sheet). I managed to assemble  the four-year census demo above in about  15 minutes, so if you have reason to believe that, given the proper exertions, the data  may have something to say, the project might be worth a go.

So go.

@abbottkatz

abbott@abbottkatz.com

Knowing When to Fold ‘Em: When Tweaking Isn’t Enough

21 Sep

All due props to Josh Boldt, fighting the good fight for adjunct college instructors worldwide. It’s a dirty job, and Boldt decided he’s gotta do it.

You know these adjunct folks well.  After all, you probably owe more than half your college credits to them, those front-line staff with flat-line salaries, keepers of the peace in anarchic lecture halls teeming with text-addled, twittered-out undergrads with anything but today’s assignment on their mind. Take it from me, I’ve been there;  and just because the words “adjunct” and “tenure-track” never appear in the same sentence is no reason not accord adjuncts the full eye contact they deserve when you catch them thumbing through What Color Is Your Parachute? in the Barnes and Noble.

What Boldt, a writing teacher and freelance writer in Athens (Georgia), has done and continues to do is accrete a dynamic record of the adjunct experience via a pair of spreadsheets to which adjuncts are asked to volunteer basic workplace information on a form. Each response – comprising data such as school name and location, department, salary, contract duration, etc. – is meant to add a few pixels to the big picture of the adjunct circa 2012, which may not be terribly pretty. As Boldt puts in on his site:

This website was designed collaboratively by the new majority of motivated, intelligent, and driven academics who are struggling to use their experience and knowledge in a meaningful way that benefits both themselves and society.

The spreadsheets, one earmarked for adjuncts in the US, the other for international part-timers , might thus have some interesting things to say about this oft-bedraggled work force. You can download the US rendition in Excel form here:

 adjunct data

 (Note: While you could in principle grab the spreadsheet from the Adjunct site via an Excel web query – a capability we’ve yet to broach in this blog – my repeated tries at reeling the data in through via this route failed, for reasons not entirely clear to me. But no matter – web queries refresh their data at identified intervals, normally a cool and desirable thing. But the refresh shakes out any changes made to the base spreadsheet, restoring it to its primeval state, and that’s not what we want in any event.)

I’ve saved the sheet precisely as it looked when alighted onto my blank workbook, and as such seems to be in need of a variety of repairs:

So let’s put on the hard hat and get to work. First, if we want to put the sheet in the service of additional analysis, e.g., through a pivot table etc.., we need to delete rows 3 and 4 – they’re blank, and as strand the header row from the data beneath.

Second – and I originally missed this – we need to delete the A column; look closely and you’ll see dots or periods speckling the A cells. These are data of the most purposeless kind, and should be chased away.

Next – the header row, along with many if not all the data rows, has clearly been subjected to the Wrap Text option, a fillip to which Excel isn’t doing much justice (see above). Rather than refit individual columns to widths that would suitably befit the wrap typography, what I’d simply do is select the entire worksheet (here you should click the Select All button, that unassuming blank button to the immediate left of the A column header) and click the Wrap Text button (Home tab, Alignment button group) which is currently illuminated. By turning the effect off all the data straighten themselves out atop the floor of their cells, even as the header labels pierce the columns to their right. Note that the worksheet remains selected, and so you need only double-click any column boundary and institute a mass auto-fit across the data (true, some columns will dilate to an untenable width; these can be cinched manually). Note the data cells entries are middle-aligned – that is, they ascend to the vertical centers of their cells – but I’m not bothered by any of that, and I’d leave that effect alone.

Now we need to clear away another data obstruction. Each state’s data is topped by what could be properly regarded as a sub-heading, in which only the state’s name occupies the row. These geographical identifiers can’t properly be termed data, as they bear no information about any institution and simply subsume the records immediately below. I tried a number of quick workarounds of the sort-these-rows-to-the-bottom variety we’ve discussed in earlier posts, but because the pertinent cells are so sporadically populated I had to turn to Plan B. In the first free column – K, by my reckoning – enter a header in K2 (it could be anything – I simply called it Sort), and write this formula in K3:

=IF(COUNTA(A3:J3)=1,”zz”,”aa”)

This expression does the following: it counts the number of cells sporting any sort of data in the named range (COUNT, on the other hand, tallies only those cells containing numeric values). If the count evaluates to 1, you’ve apparently nailed a row containing a state name only. If it is 1, the formula enters the text zz in its cell; otherwise, it returns the letters aa.

Then copy the formula and copy it to the remainder of the records – only here, because so many blank cells have presumed upon the data, click in the name box (the white field to the left of the Formula Bar that typically reports the current address of the cell pointer) and enter K3:K1984 (1984 happens to be the last data row on the sheet). Tap Enter twice (the first tap selects the range; the second actually does the pasting). Then sort A to Z on column K. And guess what? You’re rudely interrupted by the “This operation requires the merged cells to be identically sized” error message, about which something obviously needs to be done. Presumably because merged cells absorb two or more cells, the sort command gets flummoxed by the resulting cell address conflations and can’t do its thing. Now, I don’t particularly know which cells are in actuality the merged ones, but I’ll just select the entire worksheet again, click the down arrow fused to the Merge & Center button (Home tab, Alignment button group), and select Unmerge Cells (“unmerge”, and its cousin “unhide”, rank as two of the most stilted verbs in the lexicon, and people at Microsoft get paid a lot more than adjuncts to come up with stuff like this). Now you can sort A to Z on K, and insert a blank row right above 1928, which is where my state labels now start (remember that state abbreviations are supplied in any case in the B column if you need them).

All of which raises the question I was afraid you’d ask: Now what? And that’s precisely the point. Take a look at the Pay Per Course field, likely to propose itself as the essential parameter in the worksheet. But the data here are so multifarious, so apple-and-oranged, that our next analytical move seems to have been checked – if not check-mated.

What to do? I’m not sure. While many of the data are standard numeric fare, many are not, and banishing the latter from the worksheet will comprise the data too dearly. Would it be possible to recondition those non-numerics into quantitative form? I wouldn’t declare the chore impossible, but you’d have to bounce some headachy cost-benefit equations off your medulla, and think hard about how you’d make over a cell like this:

There are three salary figures in that cell – informative, to be sure, but try to add them.

When the data don’t cooperate –when they exhibit snarly recalcitrance – you may have little option but to direct your attention to a more compliant workbook –in other words, break out the white flag, cry Uncle, and punt (is that what they mean by a mixed metaphor?). (You might also ask yourself, by the way, how you would have constructed the worksheet from scratch had you been entrusted with that assignment.)

But none of this should be read as an arraignment of what Josh Boldt is trying to do. He’s putting together a public record, one answerable to his intentions, not ours. That the worksheet isn’t all that analyst-friendly is a discontent I’ve superimposed on the data. The sheet is what it is. And were I still an adjunct I might dial into his site and add my own info, too.

But I need work on a more pressing issue now. I need to find out if people still say “props”.

@abbottkatz

abbott@abbottkatz.com

The Summer Games, 2012: The Birth-Month Question Redux

13 Sep

St. Pancras Station, London, Aug 8.

Remember the Olympics, that quadrennial competition between the world’s elite corporate sponsors? McDonald’s-by-the-Thames, Lebron and Kobe on the same team, scandal on the badminton court, wholesome sublimated paint-ball nationalism (you know, my guy ran faster than your guy, especially if you live in Jamaica) and all that?

It was in all the papers. In any case, it seemed to me that the topical concern of a previous post (August 24) – namely, the modal incidence of August-born baseball players in the major leagues – could be naturally transposed to the Olympic context. Does birth month in any way broaden or constrict an athlete’s (Summer) Olympic prospects, or is the correlation spurious, or scant?

Postscript and Prelude

But before we put the lens to the data, you recall that the August birth-month skew among ballplayers is typically ascribed to the July 31 birthday cut-off for kids enrolling in any given year’s Little League cohort. That closing date makes the August-born oldest in their admitting class, so to speak, and thus bigger stronger, more adroit in the baseball arts, and thus possessive of a sustained developmental edge.

In this regard, and you may have missed it, reader James Zhang’s September 9 comment thumb-tacked to the baseball birth-month post asks a most salient, purloined-letter of a question: what if the August birth-month differential could be more properly laid not to Little League registration policies, but rather to the simple demographic possibility that more children are born in August?

A very good question, Mr. Zhang, and it turns out there’s something to it. By peeling off birth data from a melange of sources on the net (including Vital Statistics of the US) and grafting these to equivalent-year stats from Sean Lahman’s baseball database, I got, by way of a sample:

Birth Month

US August Births

Baseball August Births

1941

9.13%

11.21%

1950

9.15%

14.39%

1960

9.21%

12.23%

1965

8.93%

10.34%

1970

8.86%

7.78%

1975

8.90%

11.11%

The baseball numbers aren’t enormous (averaging about 168 players per year), but the aggregate effect is pretty evident, particularly when keyed to the 8.48% “chance”-driven expectation for August births. The conclusion then is multivariate: both national and ballplayer August births better chance likelihoods, but it is the baseball effect that pulls farther away from default expectations. In any case, a birth-month tug does seem to be in force here, something to be kept in mind.

But what then about the Olympics and its vastly more heterodox, Little League-less athlete pool? What sort of birth-month association, if any, marks the data?

Let’s see. Our data source comprises the Guardian’s spreadsheet of Olympic 2012 medal winners saved to Excel and made available here:

All London 2012 athletes and medal data

(N.B. These medal data are incomplete, the workbook having been compiled before event outcomes were realized in the Games’ last three days, e.g., basketball and the marathon. I think it would be a reach to insist that the deficit degrades our analysis, but in any event John Burn-Murdoch, who works on such projects for the Guardian, tells me the completed rendition should be out there soon. For a 2008 census of National Basketball Association player birth-months see

http://statsheet.com/blog/applying-outliers-does-birth-month-matter-in-basketball ).

Some Notes About the Book

Before we broach our question – whether a birth-month effect in any way correlates with Olympic medal success – a few words about the workbook’s design would be instructive.

  • Note the coupling of first and last athlete names in column A – not the by-the-book way of handling such data, particularly if you need to sort these by last name.  That task can be dealt with even as the data stand, but the task can get messy. On the other hand of course, the names are extraneous to our purposes right now – but maybe not next time.
  • The Age field could be regarded as an ever-so-slight redundancy, given the companion Date of Birth field and the latter’s enabler role in helping derive real-time athlete ages. Example: if you want to learn A Lam Shin’s right-now age, replace the 25 in cell C2 and write

=(TODAY()-H2)/365.25  (the denominator represents the average number of days per year).

Her age result – 25.97 as of today – will change daily. Then copy down the C column. Just remember to send Ms. Shin a birthday card. (Of course, this formula can fall victim to its own volatility. View the workbook in ten years and you probably won’t need to know that she’s now 36.)

  • Moreover the Age Group field, impressed into Row Label service in Pivot Table 10, comprises text data, and as such cannot be further manipulated absent a concerted round of hoop-jumping. If in fact you want to group the athletes by age tranches you’re far better advised to substitute the Age field and dice the data with the Pivot Table’s Group Selection option.
  • By sounding an unrelieved drone of YESes in its cells, the Medal Winner? field consigns itself to dispensability, teach us nothing. By definition, the workbook means to catalogue none other than medal winners. On the other hand of course, if you have no plans either to use these data or present the field to readers you can merely ignore it.
  • Note the G, S, and B fields (Gold, Silver, and Bronze) and their fractional medal representations, the Guardian’s attempt to align overall country medal totals with the grand total of actual athlete medallists, an obeisance to team-based competitions. Thus Croatia’s Valent Sinkovic in A932 comes away with .25 of a silver medal, having competed in the Men’s Quadruple Sculls Rowing event.  These apportionments do make a certain sense, but other analytical necessities may bid you to award him one, indivisible silver.

And Something Else…

Now here’s the real problem, to which you may have already altered yourself. Remember that we want to see if athlete birth-month data stack higher for this or that month, and if so, why. But unlike the Lahman baseball records that informed our baseball birth-month post, no immediate month data avails in the Olympic workbook. However, Excel’s MONTH function enables us to do what we want, and very simply:

=MONTH(cell containing date).

Thus

=MONTH(A2)

will return the value 9 (September) for Ms. Shin. Once in place, we can copy that expression down a column, and we’re done – maybe.

But look at cell H11; the date for Mr. Zielinksi exhibits a left alignment that intimates, without quite proving, that H11 contains a text entry, which simply can’t comport with MONTH, which is looking for numerical data. Remember that dates, beneath all the formatting rouge and lipstick, are numbers (see the August 20 post), and while it’s quite possible and perfectly legal to left-align a number, the anomaly in H11 hints otherwise. To clinch the point, click in any blank cell and type

=H11*2

Don’t you just love error messages?

I can’t explain why these text entries infiltrated the larger complement of workable dates but they did, and to dangle a preposition, it’s something we need to deal with. Before we break out birth months we need to see to it that all our data-to-be in fact qualify as months.

One possibility: we could sort the DOB column by Oldest to Newest and eye the bottom of the record stack. I see 21 text-formatted, faux-date entries; and while we could return to the insert-a-blank-row-above-these-records expedient (something I’ve explained in earlier posts, including August 30), thus estranging the bad apples from the usable data (they only contribute about 2% of all records, after all), we don’t have to. The reality is that we can retrieve the months from these date pretenders, and here’s how I’d to it.

Insert a column to the right of DOB and title it Birth Month. Select the new column and select Number in the drop-down menu in the Number button group:

We’ve taken that step to insure that our results here look like numbers, and not dates. Then in cell I2 write:

=IF(ISTEXT(H2),VALUE(MID(H2,4,2)),MONTH(H2))

OK – you’re entitled to an explanation of this rococo expression, though you’ve probably managed to make some fledgling sense out of it already. The ISTEXT function, written simply

=ISTEXT(cell reference)

is invariably soldered to an IF statement, and inspects the referenced cell for its data type. If the cell comprises text, one thing happens; if it’s other than text (e.g., a number), something else happens. In our case we stipulate that if the cell evinces a text content, Excel will proceed to extract the month from the cell via the nested MID function (how that happens will be detailed in a moment); if the entry is not text, that is a number, then the formula will apply MONTH to the cell’s date/number.

And how does MID work? Its three elements, or arguments as they’re known in the trade, do the following, respectively

  1. Identify the cell in which MID will perform its work (H2 above)
  2. Cites the position of the character at which MID will begin to extract characters
  3. Declares how many characters will be appropriated from that inception point.

Thus if I type

SPREADSHEET

in say, cell A7, and enter

=MID(A7,3,4)

in A8, I should realize the character sequence READ.

Thus for our text-date data, those data to which we want to direct MID, we see that the month segment of  text-bearing cells always situates itself in characters 4 and 5 – a good thing, because had single-digit months (January through September) been conveyed in single-digit terms (9 instead of 09) we’d have had to wrap our hands around a much stickier wicket; we’d have to have written a formula that sometimes extracts 1, and sometimes extracts 2, month characters.

(Note also that the text-date cells are expressed in European date format, in which day of month precedes month, even as the other data appear in American, month-first style).

And finally, we need to brace MID with the VALUE function because our results would otherwise remain text, and we want the resulting month extraction to hold numeric status.  VALUE simply mutates a number formatted as text into its quantitative equivalent:

=VALUE(A7)

would realize the value 7, had you entered a text-formatted 7 in the cell.

Now that you’re panting from all that heavy lifting, kick back and smoke ‘em if you’ve got ‘em, because the rest should be a comparative day at the beach. Copy the formula you’ve inscribed (and you need all those parentheses in place) in I2 down the I column and you should be treated to a medley of values ranging from 1 to 12, something like this:

(Ignore the decimal points if you see them; they don’t matter.) Next, let’s pivot table the months. Boot up a table and

Drag the Birth Month field to the Row Labels area.

Drag Birth Month into the Values area. Click in the Values area, and in turn click PivotTable Tools > Options > Summarize Values As > Count (the data have defaulted to Sum simply because they are numeric). You’ll see:

I’ll take a wild guess and allow that the largest birth month happens to be…August, the only one stepping up to three figures. Note in addition the no-less-striking, precipitous fall-off in the succeeding months.

Now remain in that field and click Options (if necessary) > Show Values As > % of Column Total (again, I’m directing you through the Excel 2010 interface). You should see:

August rules – again – but of course a satisfactory accounting awaits.

Now there is some evidence for a global August birth predominance, but a closer look is clearly in order. A UN international, country-by-country birth-month spreadsheet (click the small download link, but there’s some hoop-jumping required here in order to whip the data into shape) shows a July-September birth skew, but you’d have to toil to reconcile Olympic countries and team size with these data.

Note as well that if you swing the Country field into the Report Filter area and click on United States of America (athlete total here: 120) you’ll get

Still another win for August, albeit for a relatively small universe. Another remarkable stat: filter for People’s Republic of China and you’ll behold a 23.68% birth contribution from January (athlete total: 76), an extraordinary outlier I’ll leave to the Sinologists.

Sure there are other pivot table permutations to be crunched (you could break out for gender, for example, and sport), but the principal research remit is already out there: tracking and explaining the birth-month curve. An artifact of fertility trends, or the consequence of subtle, worldwide athlete recruitment protocols – or a bit of both?

Well, there’s your assignment, and I know you can handle it. It’s why, after all, you make the big money.

Crime Reporting, Part 2: What’s the Story?

6 Sep

The Plot Thus Far

You like to tweet? I like to tweak. Before I so rudely interrupted myself with that simpering apology below, the one about which you remain undecided, you’ll recall we spent the better part of the first installment (even farther below) assaying the King County crime-tracking spreadsheet, with the intention of reframing it for a new set of purpose-driven looks at the data. Again, the practice of emending someone else’s data needn’t always be construed as a headlong critique of the original; rather, the overhauls we want to carry out here should be read as an attempt to personalize the data, so to speak. If you’ve yet to download the workbook you can do it here:

KC crime rates 1985-2011

(You should then review the data modifications we enacted in Part 1.)

Picking up the sheet, then, at the point at which we had left it at the close of Part 1, another perambulation across the data yields 20 records (at least that’s what I get) whose Months Reported amount to fewer than 12, or in other words, less than a year’s worth of data. Of these, four rows offer up 0 months, meaning that, for whatever reason, there’s nothing here to report – no crimes, nothing. Leaving aside the reasons for that data dearth, I’d sort Months Reported descendingly and jam a blank row immediately above those zeroes, thus isolating them from the substantive rows.

But those remaining, fewer-than-12-month rows need to be thought about, and thought through. After all, these 16 or so records do divulge crime data, but because their totals are necessarily fractional we need to extrapolate these to an entire year. For example – the Index Crime Total (the sum of all reported crimes) in column E for Algona in 2004 shows 40 for its nine months’ worth of tracking; but 40 doesn’t represent a rate, but rather an absolute figure, and as such needs to be mapped to 12 normalized months. What I’d do here is motor to the next available column – T – head it something like Projected Crime Total (if you’re bothered by the disparate title formatting you can click on S5, click Format Painter on the Clipboard button group, and click T5), and enter this formula in T6:

=IF(S6=12,E6,E6*(12/S6))

and copy it down the column. (Important note: if you’ve copied here with the fill-handle-double-click technique you may find the copied results extending even to those cells on the other side of the blank-row divides we imposed on the data. This curious overreaction on Excel’s part seems to have been stirred by the auto filter buttons sewn to the column headings, though I don’t have a confident interpretation yet as to exactly why. Turning the filter buttons off, though, appears to forestall the complication; but if you do find zeroes in the blank, demarcating row near the bottom of the data you need to delete them.)

In any case the formula above considers the value in the S column, simply returning the Index Crime Total if the S value is 12 (thus signifying an entire year), or if not, multiplying the Index Crime Total by the appropriate reciprocal fraction. Thus for our Algona case, the 9-month Crime Total of 40 will be swelled by 12/9, ratcheting the total to 53.33. Of course this sort of linear, straight-line conjecture is likely imperfect, but 53.33 better surmises the Algona data than the original, but partial, 40.

In view of the above, prudence recommends we put an effective crime rate into play as well. Hang a right to the U column, title it Effective Crime Rate, and enter the following in U6:

=(T6/D6)*1000

and copy it down the column (keeping that caution about the blank-row-zeros in mind).

There are Rates, and There are Rates

Now what about these crime rates – that is, aggregated, year-by-year rates for King County writ large? Those data aren’t in place in the worksheet yet; what are available to us right now are the County’s city-by-year rates, and true – we could orchestrate a pivot table assay of a yearly average of the rates, by simply dragging the Total Crime Rates per 1,000 Pop. (column F) into the Values area, breaking it all out by the Year field, and selecting the Summarize by > Average option.  In fact, that tack is in my view an arguable strategy, but it accords disproportionate weight to smaller cities (think, for example, about simply averaging two batting averages of .200 and .300, in which one player has 10 at-bats, and the other has 600), and doesn’t hold any pride of place as a standard metric.

What I think we really want, then, is to hard-wire this basic equation

(All Crimes/Total King County Population)*1,000

into a pivot table and see how these recombined data play out by year. And in order to put those numbers on the board we need to introduce a new term into the equation – the calculated field.

I suspect that a great many pivot table devotees have yet to scavenge it from the toolbox, but the calculated field is something you’d do well to know about, and bringing it to your table isn’t a fearsome deal either.

A calculated field is an on-the-fly field gleaned, or calculated, chiefly from existing fields in the pivot table’s data source, but at the same time it remains accessible only through a pivot table. Now that’s a terribly abstract first pass at a definition, but as we proceed the concept should sharpen into intelligibility.

Keep in mind what we want to do: Total all the crimes in King County by year, divide that total by that year’s King County combined population that year, and multiply the yield by 1,000. But precisely because the crime data in our spreadsheet is atomized by year and city, it falls to us to do the aggregating, and here’s how:

Start up a pivot table and

Drag Year to the Row Labels area

Make sure you’ve stationed the cell pointer anywhere in the Row Labels area, and click PivotTable Tools> Options > Fields, Items, and Sets > Calculated Field. You should see:

Type a name in the Name field, say County Crime Rate, click in the Formula field, and then click on Projected Crime Total (remember this is the field we appended to the data source) in the Fields area and click Insert Field (you can also double-click the field name). Type a “/”, the division sign, and click Population and Insert Field. Enter parentheses on either side of this incipient expression so far (but to the right of the equal sign) and conclude the process by entering *1000 to the right of the close parenthesis. You should see

Click OK. Now you should see

The calculated field immediately whisks into the pivot table, and also enrolls in the Pivot Table Field List:

Remember that as a calculated field, County Crime Rate won’t suddenly – or ever – line up in the next available source data column; it remains pivot-table specific, just sort of out there, and confined to guest appearances in pivot tables alone.

If you want to refine these results with decimals, proceed as usual – right-click anywhere in its column in the table and select Value Field Settings > Number Format > Decimal Places. I’ve added two decimals, but that’s up to you. (I should add that the outcome we’ve just choreographed could have been reproduced via some SUMIF formula-based derring-do, but that approach is the more ungainly one, and I’ve passed on it here.)

You’ll note the obvious – the extraordinary scale-down in King County crime across the 1985-2011 swath by nearly two-thirds from its 1987 peak, again to be qualified with the understanding that numerous County cities are AWOL from some of those 27 years.

And while we haven’t gotten into charting here (at least not yet), you can boldface the finding above by churning out a Pivot Chart tout suite. Just click anywhere among the data and click PivotTable Tools > Options > Pivot Chart. Click on a basic Line Chart option, click OK, and for starters you get something along these lines (or line):

(We won’t pursue chart formatting issues here, though.)

And to exemplify those missing city years – which, after all, could “artificially” depress or dilate the crime rates – let’s reconstruct the pivot table.

Remove our County Crime Rate calculated field (it’ll always remains on call, however), and drag the year field to the Column Labels area.

Drag City to the Row Labels area.

Drag Effective Crime Rate to the Values area. In excerpted form, you should see something like this:

You’ll note the vacant cells, denoting years for which a given city has no data. (You probably should also turn off the Grand Totals; their sums are meaningless in our context. Adding crime rates is akin to adding a group of batting averages.)And if you don’t like all those geeky decimals, you can again right-click any cell in the Value area and take the Value Field Setting > Number Format route. (Note by the way that, multiple columns notwithstanding, all the year data in there belong to the same Year field – and because pivot tables treat a field’s data of a piece, formatting one cell here formats them all.)

A Top-Rank Feature

Now watch this space – but before you rigidify your gaze make way for abject apology no. 2, really a reiterated one (see previous post). So far as I know, the feature we’re about to expound swept in with the 2010 release, and so 2007 users will have to sit this one out. I told you I was sorry.

Click anywhere among the data and then click PivotTable Tools > Options > Show Values As > Rank Largest to Smallest. You should see:

Click OK. What you get is an incisive, year-by-year arraying of crime rankings, vesting the lower numbers in the more crime-ridden cities – that is, largest to smallest:

Had you nominated Year as the Base Field instead, you’ve have strung together an intra-city crime ranking, in which each city’s yearly rates would be compared to that selfsame city’s other years (try it).

I’d say that’s pretty striking. Note Tukwila’s absolute hegemony over the number 1 crime rank, holding fast across all 27 years. And if you spot-check the source data, you’ll also note the spike in the city’s population from 1989 to 1990, along with a commensurate, nearly 50% drop in crime across these adjoining years (not enough to displace it from the first position, alas). I placed that curiosity before Dr. Nathan Brown, the spreadsheet’s designer, and he ventured that some manner of territorial annexation might explain both, i.e., a city merger with nearby, more peaceable districts. But that’s something to be scrutinized, along with all sorts of other data possibilities in there (for example – how about another calculated field shaping the yearly proportions of property to violent crime?).

Advising me to watch for new articles on the data, Dr. Brown allowed that Tukwila and other cities show some “interesting multi-year patterns”. Is it too nerdy to suggest that he’s right?

Errata: A Quick Mea Culpa to 2007 Users

4 Sep

With all due contrition, it is high time I owned up to an error of omission of sorts, one that may have inflicted no small measure of vexation upon users of Excel’s 2007 version. So here goes; I know I can count on you to be kind.

On a couple of occasions I’ve asked readers to carry out commands of the

PivotTable Tools > Options > Show Values As variety:

These instructions assumed that readers were clicking away on the 2010 iteration; but if you aren’t, you immediately knew something I didn’t: that you couldn’t get there from here. That’s because the pivot table Calculations button group and its concomitant Show Values As (as well as Summarize Values By, i.e., Sum, Average, etc.) button didn’t barge into the Pivot Table ribbon until 2010; and so what I should have added is that command options such as % of Column Total (phrased without the word Total in 2007) can also be actuated by right-clicking anywhere in the relevant field and proceeding to click Value Settings > Show Values As, transporting you somewhere here:

(That’s a 2010 screen shot – again, the 2007 version goes without the word Total.)

There – I’ve said it, and the catharsis is doing me good. But in the interests of transparency, I am duty -bound to acknowledge a catch – namely, the fact that 2010 has stocked at least two Show Values As options that 2007 simply doesn’t have –anywhere. They’re the Rank Largest to Smallest/Rank Smallest to Largest tandem -very good things to know, too and as a matter of fact I may be talking about them soon. Don’t say I didn’t warn you – but remember, there’s always the free 2013 beta to download.