The Case of the Superfluous Space

30 Jul

Welcome back. Before we cut to the chase, a few words of orientation need be entered into the record here. As this blog is about the business of espousing Excel and the journalistic uses to which it might be put, mission accomplishment will be pinned in large measure to the aggregate savvy of its readership, something that can’t be presently known. Much of the exposition here assumes, for example, some understanding of pivot tables on your part, an assumption I can’t take to the bank, however. There are book-length treatments of the feature by Deborah Dalgleish and Tim Hill, and a host of web-based explications which I suspect are of variable quality. I’m about to dicker with my publisher for permission to link to my chapter on the topic in my Excel 2010 Made Simple volume, but that disposition awaits. In any event, your feedback about where you’re at is most valuable.

Another preliminary note: I’m using Excel 2010 (and spending some additional time with the 2013 beta iteration, freely downloadable at 2007 users won’t experience vertigo. But if you’re standardized on a pre-2003 release, you’ve been singing the interface blues for some time now, alas – try to bear with me.

In any event, it’s time to say a few words about data entry – that irremediably prosaic but downright integral concomitant of the spreadsheet enterprise. Remember that the spreadsheet journalist will very often make recurrent use of someone else’s data – data likely instated somewhere on the net and comprising data organized around someone else’s purposes, which may or not sweetly dovetail with yours. Thus your mission – should you choose to accept it – may oblige you to tweak those data and incline them toward your reportorial intent; and that intent in turn carries with it a mandate to see to it that the data work.

Let’s begin to see what I mean by downloading the University Tables 2012 workbook from the Guardian’s (UK) Datastore page. Among other holdings, the page warehouses a wildly ecumenical catalogue of spreadsheets drawn from a welter of sources and saved to Google docs format, but downloadable easily enough into Excel motif (click its File menu command and you’ll figure out how).

Turning to the workbook, you’ll find that the curiously-named 6 Ranking worksheet arrays British institutions along a variety of parameters, including their ranking across three years and what’s called Uni(versity) Group. These are consortia of universities that lobby for their respective interests, the most of tony of which appears to be the Russell Group (numbering Oxbridge, etc.)

You’ll see that the analytical and story possibilities abound, but for starters, I merely wanted to census the Uni groups by membership size. Tossing the pertinent variables into a pivot table, I came up with this count:

All of which raises the musical question: What’s wrong with this picture? Apart from the blanks, which seem to simply suggest that some universities have elected to remain ungrouped, a larger conundrum spooks the data. There is, after all, a near-cosmologic assumption about pivot tables: that items assigned to the row/column label areas appear once, the better to foster the aggregating activity that pivot tables mean to perform. Here, though, we see the 1994 Group cited twice, and the Million + association thrice. Why?

Now the latter discrepancy is partially explained right away: we see that some for 18 institutions the Million + name has been inscribed in contiguous characters, i.e., no space separates the Million from the +. And in data entry terms Million + is completely different from Million+, and regarded as such by the pivot table.

But what of the 1994 Group? In fact, a similar blemish defaces the data here. It turns out that the entry for Loughborough (don’t try to pronounce it) was typed 1994 Group[space], accounting for the one instance of that uni group that’s been estranged from the other 16. Check it out; click cell F11, tap the standard F2 edit key, and you’ll see:


Trust me – there’s an extra space right there


Note how the cursor distances itself from the plus sign. You’re looking at a superfluous space, an invisible, confounding character that can cause big problems for data organization – because again, 1994 Group is simply nowhere equivalent to 1994 Group[space]. They’re treated as different data, and can’t be grouped as a result And the same issue besets the Million + disparities as well. Five of these were entered Million +[space], and five sans space.

The simple remedy – even if large numbers of records were so affected – is to execute a standard search-and-replace operation. Turning to the Million + inconsistency, click the F column header, thus selecting the entire column, and click Ctrl-H. In the Find what: field type Million +[space], and enter Million + in Replace with: . Click Replace all and the deed is done. As the dialog box remains open, now type Million+ (that is, set the + hard against the text) in Find what:, and post Million + in Replace with:. Again click Replace all (you should have executed 18 replacements here), and you’ve instituted a felicitous, workable uniformity to the data. Refresh the pivot table and you’ll see:

That’s more like it

Note I’ve also eliminated the truant space from that solitary 1994 Group[space] entry.

Now you can start to rock and roll with confidence.

I’ve seen this needless-space complication more than once, which, if undetected, can be the cause of no small operational vexation. The moral of the story: if your pivot table is seeing double, the chances are that the data aren’t really duplicative – rather, they’re probably really different. That’s what an extra, unseen space can do.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: