Lights, Camera, Data: Paris Movie Locations, Part 1

20 Feb

You won’t be learning anything you didn’t already know when you’re told Paris is a first rank cinematic – or Seine-ematic – go-to site for hundreds of filmmakers.

But hundreds is but a number, and so once the discussion turns quantitative a turn to our favorite application might help us texturize the clichés with a sprinkling of data-based somethings we didn’t already know about the Paris big-screened experience. The Lieux de Tournage de films (long métrage) – (very freely translated, Locations for Feature Films, these dating back to 2002) –a spreadsheet on the bill at the  Paris Open Data Site that names and counts the Paris streets and allées backdropping all those movies – is playing here:

http://parisdata.opendatasoft.com/explore/dataset/tournagesdefilmsparis2011/?tab=table

Download the sheet by clicking the Export button and deciding next between the Télécharger en CSV or Excel links, stationed immediately beneath those respective headings.  Experience would point me to the CSV alternative, because I’ve met with occasional file-truncation issues in the Excel mode and portlier file sizes besides. Start by whisking both into your system – but make sure to save both varieties to XLSX format (after you’d swerved past roadblocks like the one pictured below), and you’ll be advised that the dedicated Excel offering checks in at 612 KB, about 30% weightier than the CSV 477 KB – but either way, your hard drive will find itself swallowing an eminently digestible byte count.

                       loca1

 Now, while I’ll ultimately want to migrate to the Excel version in this case – its data have already tucked themselves into their appointed columns, for one thing – there is an attention-getting contingency in the CSV that you may want to think about.

First open the CSV take and you’ll see (in excerpt):

 loca2

As often seems to be the case with Paris’ open-data workbooks, we’re brought midway into the plot and asked to supply its denouement – by separating the data above into suitable columns, via the Text to Columns feature (click on cell A2, for example, to sample its semi-colon-separated, imminent fields); launch Text to Columns (again, return to the January 13, 2013 post for a procedural review), holding to the Delimited default and ticking the semi-colon box in Step 2 of the Convert Text to Columns wizard). Answer OK to the Do you want to replace the contents of the destination cells? question, as I did initially, and you’ll get here, in excerpt:

 loca3

 Don’t worry about the shortfalls in column width here; look instead at the geo_coordinates field, comprising location latitudes – and only latitudes, the cartographic equivalent of a stick with one end. This is literally a case of you can’t get there from here – because, after all, spreadsheet mapping of any species is adamantly two-dimensional; got no longitudes, got no map. So wither the longitudes? I tweeted this question to Open Data Paris and received a couple of kind replies (one from a private Tweeter), neither of which quite relieved my curiosity, perhaps because I hadn’t specified I was looking at the CSV variation on the theme.

But when I reloaded the CSV offering and eyed the data anew I actually beheld those fugitive longitudes, presenting their credentials right there in column B. And where did had these data gone, then? They had, in fact, been dispossessed by the Text to Columns routine, at the point when we chose to click OK to the “Do you want to replace the contents of the destination cells?” request, an inquiry which means exactly what it says – namely, that Text to Columns proceeds to overrun and overwrite all the entries in the columns into which it parses and empties its data. Bottom line: indulge your itchy trigger finger on Text to Columns, and you wipe out the longitudes in column B.

The way out? I’d copy the longitudes first, paste them somewhere far away (onto a new sheet, perhaps), step through Text to Columns, and copy and paste the longitudes alongside what should now be the latitudes field (that is, the next available column).

But that episode was a trailer of sorts. Let’s get back to the main attraction – the Excel-versioned download – for some close-up shots of the data.

First, I’d click away the Wrap Text effect, which seems to have been played upon each and every cell in the sheet (note the fractionated text in the F, G, and I columns, a time-honored Wrap Text correlate) For an en-masse Wrap Text turn-off click the Select All button, that empty rectangle minding its own business in the sheet’s upper-left corner:

 loca4

Click Wrap Text, double-click of couple column autofits, and all the data will begin to squint in the daylight. Next, we see that the geo-coordinates field embraces both site latitude and longitudes (in that order), and so a necessary separation need be effected with Text to Columns (the data are Delimited, and check the comma delimiter in Step 2). Label the new independent columns Latitude and Longitude (and you’d probably want to autofit Latitude in column J).

Now those pesky green error indicators roofed in the upper-left of all the cells in the H, arrondissement (district), column signal the storied Numbers Stored as Text message, a warning we’ve confronted several times before, first in my August 14, 2012 post. I’d allow that we can let that “error” stand here, simply because we’re dealing with what are in effect Paris zip codes, and as such they’re not likely to play an authentic quantitative role in any later analysis and can remain textual. If you want to suppress the indicators, though, the protocol for doing so is actually slightly tricky. Once you select all the affected cells you need only click that exclamation-marked error flag and select Ignore Error, whereupon all the green meanies will get the heave-ho. But you can’t take the easy way out, by selecting all the H cells by clicking the H column heading; for whatever reason that simply won’t work. What you can do, though, is implement a little-known but elementary selection tool by clicking in the Name Box, the white strip just atop the Select All button:

loca5

Enter

H2:H8618

in the box, the range down which all data-laden cells happen to stretch (there are some blanks in the range, by the way), and press Enter. Once that cell-specific deed is done, click the error marker and select Ignore Error.

Now for some analysis. We could start by breaking out the numbers of scene shoots by their arrondissements, thus drawing up a kind of directors’ site popularity index. An easy pivot table suggests itself here:

Row Labels: Arrondissement

Values: Arrondissement (Count, even if you’ve retooled these into numeric values)

The results could be then sorted (in excerpt):

loca6

We see that the 8th Arrondissement (arrondissements are prefixed 750), subsuming the Champs Elyseés among other venues, ranks as number one scene stealer, with the 18th (think Montmatre) and the posh 16th holding down positions two and three (observe that the Eiffel Tower, standing bestride the 7th Arrondissement, finishes out of the medal running in a far away 9th place; but the stirring sightings of the structure obtain from the 8th, just across the river). It’s also more than a little noteworthy that all 20 of Paris’ arrondissements were deemed scene-worthy, each attracting at least 185 shootings.

But scale down the list and some problems – numerically few, to be sure – compromise the data. The arrondissements are coded in five digits, and so one should be right to assume that the outlaw 750008 entry has been overloaded with an excess zero, a defect easily repaired by conducting a search 750008/replace 75008 mission. By the same token, I’d almost bet the house that the 7502, 7518, and 7009 abridgements are really 75002, 75018, and 75009 entries, but I’d keep my money in my wallet if you’re asking odds on  750052, which could be 75005 or 75002. Faced with that kind of geographic indeterminacy I’d run the address in question – Pont (Bridge) de l’Archevêch – through Google, and come away with 75005, which is just what I did. And here it is:

loca7

Nothing if not scenic –but that explains why the 743 different movies in our data decided to plunk their cameras there. (Click here for a photo source attribution.)

And the 750O7 record is surely a well-intentioned 75007, with a capital O monkey-wrenching the code. (But the 92 and 93 prefixes aren’t misplaced, by the way; they reference Paris suburbs.)

But there’s another problem with our pivot table, a conceptual one. But this post has gone on long enough; do we need to cut 20 minutes from it?

Hope to a have a sequel soon. In the meantime, here’s some popcorn.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: