Edward Snowden might beg to disagree, but it seems to be open season for open data. Fasten the name of your favorite city or country to that term, toss it into Google, and odds are you’ll be brought face-to-screen with a site showcasing scads of info about said venue. Indeed – this US government-compiled site:
will link you to over 280 city, state, and national vaults of information (though we’ll leave the altogether reasonable question as to how open is “open” to Mr. Snowden).
One of those link-bearers is none other than Paris, whose site
is not to be confused with that city’s transportation-specific RATP pages, the object of two May posts.
The good news is that a great many of the opendata.paris datasets come to you in spreadsheet form. (The bad news, depending on your linguistic chops, is that the site is in French.)
And it was in the course of an initial foray across the site that this sheet made a claim on my attention, which I have re-presented in Excel form:
Its name, Plaques Commemoratives de la Seconde Guerre Mondiale – Commemorative Plaques from the Second World War – presages its story, told by the 1200 plaques riveted into Paris addresses at which something dreadful happened. 347 of the plaques, for example – more than 28% of the total -bear this inscription:
Eleves de ces ecoles morts en déportation parce que nés juifs
Rough translation: The students in these schools were deported, and died, because they were born Jewish.
The psycho-historical impulsion driving the installation of the plaques, veneered atop France’s equivocal behavior during the war, is for another post (see this one for a bit of deep background); but whatever the accounting, the plaques are there.
And analyzing them takes a bit of work. The workbook comprises three sheets (Feuilles in French), one of which lists but 48 plaques and halts somewhere in the 2nd Arrondissement, or district (of which there are 20 in Paris). The other two apparently register 1219 plaques, but the sheets are not identical, though it very much appears as if they could have alloyed into one without experiencing any loss of telling information. One sheet, number 3, is the cleaner of the two, matching addresses to each of the plaques along with its arrondissement and remaining faithful to the French character set; but number 4 affects the supplementary virtue of citing address latitudes and longitudes.
Starting with sheet 3, and assuming we want to ultimately subject the data to a round of pivot table breakouts, by say, arrondissement, we need to insert a row above 2, thereby detaching data artisan Martin Maginier’s Twitter handle in A1 from the records. Next, select all the data and turn off the Wrap Text effect glowing in the Home tab > Alignment button group (the addresses have been duplexed into tiers within their cells in the B column, an Excel loss-in-translation from the spreadsheet’s native Google Doc platform).
Now because I also have reason to believe that some of the 1219 records are duplicative (a suspicion promoted by apparent discrepancies in the Sheet 4 rendition of the data), we’d be advised to roll the data through the Remove Duplicates utility (Data tab > Remove Duplicates in the Data Tools tab, after clicking anywhere in the data), having in mind to tick the “My data has headers” box en route, because if you don’t Remove Duplicates will present itself thusly:
Absent My data has headers, Excel assumes in our case that the uppermost row is nothing but another record, and hence vulnerable to excision once Remove Duplicates starts removing. That’s because the first row exhibits the selfsame formatting as all the other data, i.e., bare, frill-free text. Clicking My data has headers, then, instructs Excel about the special character of that first row, its appearance notwithstanding, and properly seats its field names in the dialog box:
Had the header row been distinctively formatted at the outset – even by the mere imposition of a boldface, for example – Excel would have ticked My data has headers without prompting.
Now the key to Remove Duplicates is to appreciate which fields need be inspected for trouble-causing, as opposed to perfectly well-intended, duplications. For example – if you tick Arrondissement alone, you know what’ll happen – about 1190 records will be carted away, because each of the 20 arrondissement names are all but sure to appear recurrently in that field. In our case, then, we’re best off probing for duplicate records that exhibit equivalent data in all their fields – because those one are surely superfluous.
And once we actually get around to clicking OK, Excel broadcasts an immediate prompt, announcing the offloading of 13 duplicate records.
Now we can begin to do something about the Arrondissement field, its contents freighted by those “…ème ARRONDISSEMENT” qualifiers that, in my view, distend the records unprofitably. It seems to me that a simple 2 in a cell would serve the analysis every bit as well as 2ème ARRONDISSEMENT; the question is how to pare the excess text.
In fact there are a couple of ways to realize that end. The fleetest, but perhaps heaviest-handed, tack is to jet through a find and replace routine:
- Copy say, any entry for 2ème ARRONDISSEMENT among the data.
- Turn on Find and Replace and paste the above into the Find field (yes, you can paste here)
- Delete the number 2, because we want to conduct a generic Find across the all arrondissement data, all of which display the above suffix.
- Leave Replace with blank and click Replace All.
- Then type or copy 1er ARRONDISSEMENT in the Find field, because we need to contend with that “er” following the 1.
- Delete the1, leave Replace with blank, and click Replace All.
It sounds longer than it takes, and once completed you’re left with a column of freshly shorn digits.
But if you’re interested, a couple of formulaic alternatives could also be put to the task. Here’s one: Name column E Arrond or something like it and enter in E3:
To interpret: only double-figured Arrondissement data exhibit a length of 20 characters (trust me; I’ve checked). Thus the formula above stipulates that two characters be wrenched from the left of any such expression – even as any other-lengthened entry will be made to surrender but one – that is, one digit.
Either way, once the arrondissements are relieved of their textual fetters (for the sake of the exposition I’ll go with Plan A, the Find and Replace), you can blueprint a simple but noteworthy pivot table (you may need to un-Wrap the text of the field headers here, too):
Row Labels: Arrondissement
Values: Arrondissement (Count)
Keep in mind that numerically-consecutive arrondissements aren’t necessarily adjacent in space; the 13th is at some considerable remove from the 12th, for example.
And I hope to have more la semaine prochaine.