Scratch a data viz and you can bet the house you’ll uncover a spreadsheet, chilling compliantly beneath. Yeah, a spreadsheet – the primeval, character-freighted, yet indispensable ground atop which the viz’s figure raises its superstructural prodigies – is happy to play silent partner to the graphical histrionics emoting above. And that’s fine. But training your gaze worshipfully at the viz means turning your head from the elephant stalking the room – the prosaic, but inarguable, truth that spreadsheets can do things that data vizzes can’t.
True, that avowal is by turns self-serving, near-Luddite, and determinedly misses the point that vizzes can do things spreadsheets can’; and in the interests of cross-platform amity I’ll plead guilty to all of the above. Using what you use depends on what you want to say.
Still, let us not undersell the spreadsheet’s value in the data-communicative process. Even a homely sort says something about the numbers (or text) that data maps can’t – although a chart admittedly could, once you get past the inescapable necessity to sort the data first.
It’s apples vs. oranges, to be sure, but consider Métropolitain.io (requires Google Chrome), a viz-driven site picturing the hurly-burly of the Paris Métro, e.g.:
I assume the distributed intensities of light scale passenger activity either to or from the selected station, in this case Gare du Nord, Paris’ busiest, but I’m not sure. I suppose that’s my way of saying I don’t quite understand the viz (which to be fair, has a number of parts – check it out). But Métropolitain’s blog kindly points to the spreadsheet from which the viz springs.
That’s the RATP (Regie Autonome des Transports Parisiens), or Paris transit authority, site. Click the XLS link and you’re there.
The data are heaped into two tiers worth of sorts, the first ranking stops on the Paris Métro by rider traffic, that is, entries into the respective stations (these are obviously official, and hence deeply understated, totals. Fare-beating is so pandemic across the RATP that some evaders contribute to an insurance pool that indemnifies their fines when and if they’re detained. See this piece to learn more). The second sort arrays traffic on the RER, the system that connects Paris’ suburbs to the city.
As for the sheet field headings for those of you who ne parlez pas francais, Rang is rank, Réseau means network more or less, Ligne de correspondances denotes the train line numbers/letters associated with each stop; multiple entries signify line transfer options at the stop. Ville is city, and Arrondissement pour Paris refers to the city’s districts, 20 all told.
In order to get the data to do what you want you’ll need to do some fine-tuning first. For one thing, the Cluny La Sorbonne stop on row 197 should say Paris in its Ville field, and its arrondissement should read 6 (note the customized numeric format at work here; typing the 6 yields 06).
Next, the field headings need a bit of refurbishing; by selecting B3:L3, turning Wrap Text off and conducting an auto column fit, you’ll restore those misshapen headings to wholeness. Then perhaps more importantly, click F3 and turn Merge and Center off; five newly-independent columns reappear, each in need of a heading. In the interests of keeping it simple, you can name them Line (or Ligne)1, Line2, and so on. (note: auto-fill works here; if you enter Line1, click back on that cell and drag its fill handle across the next four columns, Line2, etc. enumerate themselves) .
But these columns harbour another curiosity, recondite and untoward. A great many of the cells dotting the G4:J370 range – the last stop, so to speak, takes the data to row 370 – remain ostensibly empty, for the simple reason that only a very few stops serve as the nexus of five lines. But the RATP, or whoever it was who framed the spreadsheet, chose to fill every deservedly blank cell with the number zero, and conditionally formatted each one of these to affect the color white:
(That’s the actual color white tincturing the Format field up there.) But why does any of this matter? It matters if, for example, you want to correlate stations’ traffic by the number of lines making tracks through them; and if you’re planning on running a simple COUNT for each station, those zeros are no less countable than any other, actual line number holding down a cell. Thus right now, every station will number five lines – and that’s not right at all.
There is, one could intelligibly assume, a standard remedy awaiting its call-up – selecting the relevant range, triggering a Find and Replace, finding all the zeros, and replacing it with nothing. Sounds intelligible to me, but it won’t work – because foraging for zeros will also capture, and zap, the zeros in the ones column of the number 10 – and there is a line 10 on the RATP.
But there is a Plan B – a bit step-intensive and short on elegance, but it works:
- Select columns G through H (you needn’t bother with column F; all its cells are filled with actual line numbers) and
- Find 10 and Replace with some extreme, frivolous value, say 555 (whatever the selection, make sure it’s shorn of zeros).
- Find 0 and Replace with nothing
- Find 555 and Replace with 10.
- Then name column M Line Count, and write
=COUNTA(F4:J4) in M2, and copy down the M column.
That appears to be the shortest distance between two points here. You need COUNTA because some stations, e.g., 3bis, are text entries, and COUNT only treats those cells storing values (Note: this problem can be more neatly tied and bowed with an array formula, but we haven’t the column inches to describe those here.).
Now we can go about correlating Paris Métro passenger traffic by the number of lines streaming through each station. We can try this pivot table:
Row Labels: Line Count
Values: Line Count (Count)
Report Filter: Réseau (Métro)
Those filters confine our scrutiny to Métro stops roped inside Paris’ city limits. We need both, because some Métro stops penetrate the city’s suburbs, and because many RER stops brake into Paris proper. I get:
Unsurprising, perhaps, but interesting. The more stops, the more traffic, though if you divide average traffic by number of lines the traffic average steps up with each additional line.
But there’s more you can do with the data, even without ramping them up into the vizosphere; hope to say more la prochaine fois.