Cracking the Code, Part 2: Taking the Temperature of the NCDC Data

24 Jan

It’s a man thing, this problem with following instructions, but if I’ve guessed right, the NCDC temperature spreadsheet on which we’ve been working should look something like this right now, in excerpted form:

ncdc1

And you can get what you see above here:

NCDC temperature stats 2

(By the way, ignore those green markers clinging to the upper-left corners of the Code cells; they signal a Number Stored as Text “error” message, a state of affairs we actually desire here, and one accounted for in last week’s post. In addition, you don’t need to introduce decimal-point consistency among the temperatures right here; since these are to be shuttled into a set of pivot tables, the pertinent formatting can be carried out there.)

You’ll recall I closed Part 1 with the gentle augury of more tweaks to come, and I don’t aim to disappoint. I’m proposing now that we thread in still one more column, to be interposed between C and D and named Area Status. In what is now D2 enter

=VLOOKUP(LEFT(A2,3),Areas,3)

An expression that looks up the kind of area embodied by the data in the C column, e.g., Alabama is a State, and the Soybean Belt is a…Belt. And why do we need to know any of this? Because a mass pivot tabling of all the records – that is, a wholly inclusionary tabling of each of the 11564 records, needs to come to terms with the overlapping character of those areas – but it won’t. The Eastern Region, for example (code 121), comprises a zone of 16 proximate states (see the NCDC Read Me page I referenced in last week’s post for an enumeration of these, along with some methodological morsels that won’t make their way into the popular press); and batching these among the individual state data is tantamount to counting the same data twice, or perhaps even thrice, a messy glop of redundancy that’ll throttle any and all attempts at analysis. We need Area Status, then, the better to filter one status or another at a time – to properly curtail our gaze to only state, or only regional data at one time.

And last, and very possibly least, flash over to the Q column and enter, in Q2:

=AVERAGE(E2:P2)

That recommendation is fairly straightforward. We want to be able to return state/regional averages by year, I think, and that deed is now hereby done (note I’ve formatted the averages to two decimal points). (You might, however, want to think about how a design alternative, whereby the 12 month columns or fields could be piled into one, unitary Month field, would serve the data. It’s true, of course, that you’d be pumping 12 times as many records into the source worksheet – one record for each state/region, year, and month – and 120,000-plus rows of data might bring your computer’s processor to its knees. Nevertheless, other explicative gains could be realized as a result, because as the data currently stand, each month field has to be wheeled into the pivot table separately – and because you’d sooner or later want to aggregate the months into yearly composites, things start would start getting labor-intensive as a result. Hence my AVERAGE field, which does some of that aggregating work before the data reach any pivot table.)

Now once all the fields have been tilled you can begin to do the pivot table thing, starting with some of the standard come-to-mind orchestrations, e.g.

Report Filter: Area Status (click State)

Row Labels: State/Region (these will return only state names at the behest of the Report Filter State selection above).

Column Labels: Year. These data could, in the interest of a visual streamlining, be grouped by say, 5-year intervals, thus pinching 118 columns worth of years into 24.

Values: Average (these values to be summarized by Average).

ncdc2

These outputs could then be nicely Sparklined (remember them? See my November 29 post), e.g., this excerpted shot of a few of the state trajectories (the shot takes advantage of the Freeze Panes > Freeze First Column option posted to the View tab:

ncdc3

(Note that you’d want to turn off the grand totals for rows here (PivotTable Tools > Design > Grand Totals in the Layout button group), because those aggregated data would be read by the Sparklines as just another yearly data point, and they aren’t.)

What you should find, across the pivot table permutations, is a broad if not wholly ineluctable upwards angling in temperatures – as expected, of course, but I’m prepared to believe there are plenty of nuances among the numbers.

So there you have it. Everyone talks about the weather, so now go do something about it – or at least do something with it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: