The British census has done its work for 2011 (their remit carried out in years ending with 1), and with it has bestowed upon us a rather grand favor, or favour, as they would have it. Its prodigious researches have in very large measure found their ways into Excel spreadsheets, and been parachuted into the public domain, headed right into our importunate hands:
That domain is indeed public, and colossal – a mighty myriad of hyperlinks cascading down its endless directory (including pre-2011 data, too), massifying into one giant, irrefutable conclusion: for journos, and academics, there’s got to be something there.
And doubtless there is; one could imagine a moneyed news organization stationing a reporter on the Census beat alone, pressed with the mandate to liberate all those stories yearning to breathe free behind the rows and columns.
But with the plenitude comes the questions. I’ve skulked around some of the data – a neutrino-sized fraction of them to be sure – and can’t help entering what is at base perhaps an unfair demurral: that the spreadsheets I’ve brought to my attention on the Census site aren’t terribly suited for what I’m calling here the next move. That is, that data they present, as they are presented, often don’t easily facilitate the sorts of secondary analysis this blog promotes: the pivot tabling, the charting, the value-added tweaking, the movement from the data as we find them to a round of conclusions that originally weren’t quite there.
As indicated, that cavil could be ruled out of order, driving itself as it does from the assumption that the data are there to be put through a new set of investigatory paces. But that may not be the case. It is possible the Census had, or has, something else in mind: an interest in simply styling the data lucidly, to be read and learned from and reported on as the reader finds it, and nothing more.
Could be. But where does that put the journalist? In an earlier day, when popular access to data was either check-pointed or even proscribed, the journalist could play the part of messenger, bringing information to light that might otherwise be consigned to the shadows. But today, when we’ve all been entrusted with the keys, we might thus come to expect more of the journo, the data professional.
That’s not a novel desideratum, of course, but in any case if we do want to make more of the data they have to cooperate, or be made to cooperate; and if the trammels pull too tightly the job gets harder.
Just a few examples, none of which would necessarily bring the enterprise to a halt, but which points to the kind of work that needs in order to get data in shape for that next move:
Note this excerpt from a Census report on religious affiliation:
Again, the numbers are, by themselves, intelligible, but the next move is more uncertain. The data break themselves out by regions (e.g. A NORTH EAST), but then undergo a second subdivision: the regions are parsed by localities (e.g., Hartlepool, Middlesbrough, etc.), subjecting the numbers to a classic, nettling redundancy. The same totals are gathered twice – really, more than twice; the All People total for the North East is accumulated again, through a summing of the localities, and again, by a summing of the Remainder and Urban areas subheads (these indented very curiously by a serial tapping of the space bar – a textbook no-no). Disentangling all those totals makes the job harder to be sure, if you want to do more with the data.
In addition, many of the workbooks I’ve seen suffer from another classic, if remediable, error – the laying down of blank rows and columns between data-bearing cells, e.g. this excerpt from a workbook on youth education and employment:
And check out the dates. An expression such Mar-May 1992 isn’t a date – it’s text, and as such can’t be grouped in a pivot table, for example. A lethal miscue? No, but an impediment.
And how about this one, from a workbook categorizing crime in England and Wales:
That category heading isn’t availing itself of the Wrap Text feature; rather, it comprises text that’s been entered in two different rows. Put simply, you shouldn’t do that – it instates the same record, or at least part of the same record, in two rows, and that’s just too problematic.
In his cite of my Wednesday post, in which I served up the Guardian’s Olympic athlete data, digital journalist savant Paul Bradshaw asked his readers to “Let us know if you do anything with it!”. That is indeed the point – as they presently stand, what can you do with the Census data?
In sum, I can’t help wondering if the Census and like concerns would stand to gain by some alternative thinking about their spreadsheet data and how they’re realized, in the interests of instilling a measure of analyst-friendliness to the files. Would it be too high-minded to recommend that the Census, et al, place the matter before consultants, who’d be deputized to perform the relevant makeovers? On the other hand, a reconsideration of a few thousand workbooks is no small gig, and with the government locked into austerity mode the expense would likely be deemed too discretionary to justify the outlay.
Maybe then it’s time for me to step up and offer my services. And I’d prepared to offer the government a break – I’ll bill them in dollars, and eat the exachange rate.