Cracking the Code, Part 1: NCDC Temperature Stats

17 Jan

Last week’s meteorological revelation about 2012 – its one-degree edging of the previous annual US average temperature high – received its just media due, adding one more metaphorical log onto the fire of controversy about global warming (though the New York Times article linked above equivocates about the evidentiary conclusions). Those on the lookout for some hard substantiating numbers can tune into the National Climatic Data Center (NCDC), more specifically this file:

ftp://ftp.ncdc.noaa.gov/pub/data/cirs/drd964x.tmpst.txt

But not to worry; I’ve adapted the above to spreadsheet form here:

NCDC temperature stats

(Two notes: this spreadsheet updates an earlier version. And note the lookup table I’ve manufactured for the Areas Lookup sheet, range-named Areas. We need to get to this later.)

We can register a few conjectures about what we might want to learn from these data – say, a breakout of monthly temperatures by years or groupings of years and/or states (the 48-contiguous ones, that is), and perhaps even by coarser, regional rubrics, provided of course those parameters are in there, and don’t worry – they are. And of course you’ll be anticipating a fitful but dogged rise in temperatures across the 118 years’ worth of data, something you should be able to demonstrate or disconfirm here.

In any case, I’ve put the file contents in their pace starting at row 2 because the native NCDC data are headerless, and that just won’t do; you’ll thus want to reserve row 1 for some apt titling.

But first things first; we can’t top columns with headers without columns to top, and right now, and first appearances notwithstanding, all the data are presently crowded into the A column only – because the source file whence the data come is a good old all-American text file, in which the data have yet to be spliced into spreadsheet-friendly columns.  As a result, you’ll need to trot out the Text to Columns option (see my October 11 and 18 posts, which explain Text-to-Columns in some detail); only then, once you’ve safely tucked the data into their respective columns, can you go about heading them all (and you should hold to the Fixed Width default at ticked at Text to Columns’ inception.) Just note something else – you’ll need to depart from the standard Text to Columns run-through in one regard for reasons that should become intelligible a bit later: at Step 3, be sure to tick the Text Column data format radio button for the first incipient column only:

blog weath

When the smoke clears your sheet should have sawed the data into 13 columns, in which numbers 2 through 13 clearly report monthly temperature data. These’ll naturally be headed Jan-Dec, or something cognate. But it’s column one that holds out the more formidable challenge, and naming it is the least of it.

I was directed by a helpful NCDC tech rep to a site read me file that breaks down the composite contents of those 10-digited data in column one, which I’ll title Code. We’re told:

blog weath2

That is, digits 1-3 attest a state or other regional code, such as basin or shelf (these are enumerated in the accompanying lookup table, and keep those data in mind).  Digits 4-6 are unvarying across all the records, and as such can be ignored (digits 5-6 among our data always yield 02 for temperature, as per the above screen shot).  And digits 7-10 always denote the record’s year, a datum we want to isolate in order to abet various pivot-tabled breakouts and the like. Thus you can insert a column to the immediate right of Code, name it Year, and format the column in Number mode. Why bother with that last step? Because the new Year column will inherit the prevailing formatting of the column to its left – in our case, text – any formula entered in the column will, absent user intercession, degrade into unreconstructed typing – that is, nothing but text.

Then entire this formula in cell B2, which should be copied down its column:

=VALUE(RIGHT(A2,4))

This expression cadges digits 7-10 from A2, etc., streaming the relevant years down the Year column in B (and you’ll probably want to lose any decimal points that may have squeezed their way into the cells). We need VALUE here because that 10-digit code in A2, etc. from which the year has been finessed is text, after all; and VALUE forces numeric status onto numeric-looking data that have been formatted as text.

Next, we’d likely want to supplement each record with the name of the state/region associated with its reported temperatures, and we can do that job by making resort to a relatively uncontroversial VLOOKUP formula. Insert a new column to the right of Year, call it State/Region or some such, and in C2 enter

=VLOOKUP(LEFT(A2,3),Areas,2)

Translated, the expression Identifies the first three characters of any A-column entry, looks these up in the Areas lookup table (again, it’s stored in the Areas Lookup sheet), and returns the appropriate item (e.g., state or region name) from the table’s second column. Copy the formula down the C column.

That looks good, but the issue I raised about 550 words ago remains unrequited: Namely, in the course of conducting the Text to Columns routine why did I insist that we invest the first, Code-bearing column with Text status in the first place?

Here’s why: left to its default devices, Text to Columns will format its columns in General mode – meaning that for each column Excel will try to determine the data type befitting the data therein. And by default Excel regards the first column – the one we’re calling Code – as a numerical field, and thus will, for example, streamline the very first Code:

0010021895

to look like this:

10021895

Because Excel deems the first two characters as leading zeros – numerically meaningless frippery – Text to Columns simply throws them out, leaving us with an 8-digit code for Alabama, and labelling its state identity 1, in lieu of the original 001. But New York, coded 030, suffers a loss of but one leading zero, and its data are thus remaindered into a 9-digit value. And the Eastern Region – code 121 – hoards all 10-digits, and all this digital disparity causes big problems for the VLOOKUPs – not insurmountable ones, mind you, but the round of additional hoop-jumping required to set all the data straight isn’t worth the bother. But once we format the Code data column into text form, thus overruling Excel’s General default, each record preserves its original 10-character span, and frees us in turn to invoke the 3-character state codes in every case – a more consistent take on the data.

And yeah – there’s a bit more tweaking to be done, but once my word count hits four figures I know the natives get restless, and the collective slosh of watering eyes gets disconcerting on this side of the lectern. I’ll close this post, then, with a different kind of question, this one historical: you’ll observe that the temperature data for Arizona (002) dates back to 1895, as it does for all the other states. Yet Arizona did not enter the Union until 1912, a point I raised in an email to the NCDC. Meteorologist Mark Seiderman got back to me quickly, noting that data were nevertheless tracked in Arizona in its pre-statehood phase, but at a lower “station density” – meaning fewer readings were taken.

OK, that’s it for today; meet me at the rathskellar in a half-hour.

4 Responses to “Cracking the Code, Part 1: NCDC Temperature Stats”

  1. Lital January 18, 2013 at 4:31 pm #

    For some reason when I apply the VLOOKUP formula in the State/Region column nothing is returned and the formula remains. The code is formatted as text and I didn’t have a problem with inputting the year in the other column. Thoughts?

    • Abbott Katz January 19, 2013 at 10:49 pm #

      Hi and thanks for your note. Is it possible that you transposed the order of the two inserted columns, that is, inserted State/Region to the immediate left of Code, and followed by Year? Also, did you format the column containing Years as numeric? One other point – I think the state and region codes in the Areas lookup table were mistakenly formatted as numeric when they should be text. I’ve posted a revised version of the spreadsheet to the blog post in the new format. Otherwise, I’ll be happy to look at the workbook as you have it if you email it to me.

      • Lital November 4, 2013 at 5:59 pm #

        Thank you! That did help.

  2. Abbott Katz November 4, 2013 at 6:57 pm #

    You’re welcome – glad it’s working.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: