Its Byte is Worse Than Its Bark: New York’s Tree Census

15 Dec

I am two with nature, professes Woody Allen, and this New Yorker is only too happy to corroborate his math.  I know nothing about the flora that roots itself in the city’s spaces, and with my gaze locked in perceptual auto-filter, I hardly see them at all. The only branch I’ve noticed lately is the one that belongs to my bank.

In other words, I can’t see the trees for the forest, but they’re there – about 623,000 of them, and the New York City Open Data site’s borough-wide arboreal inventory fills me in on each and every one of them, should I at last decide to make the acquaintance. Look here for the data:

https://data.cityofnewyork.us/browse?q=Street%20Tree%20Census&sortBy=relevance&utf8=%E2%9C%93

The census has pretensions to completeness, and given its address-level molecularity I’d guess it comes pretty close. In the interests of cross-county comparisons (borough equals county) I downloaded and copied-and-pasted the records to a consolidating master workbook, thus harvesting all 600,000 or so trees into one Sequoia-large data set, one that at 60 MB would do severe ecological damage to Word Press’s downloading channel. In other words, you’ll need to consolidate the above link’s offerings on your own.  (Note: the Staten Island data originate in DBF, or dBase mode, but follow the standard File> Open check-in with a specification for dBase Files (*.dbf), and the data will effect its proper spreadsheet mien. Note in addition that Staten Island’s Treepit numbers – referencing a field about which I know absolutely nothing – were minted in label format, and I don’t know why; but I’ve yet to work with them, either, so the matter remains moot, though curious.)

And the meaning of the entries in the set’s fields isn’t entirely forthcoming; preeminent among the recalcitrant is Species, whose coded tree names had me searching for a Boy Scout, at least until stepped on this document in the wilderness:

http://www.nycgovparks.org/sub_your_park/trees_greenstreets/treescount/images/species_list.pdf

Courtesy of none other than New York’s tree census.

After saving it all as a text file and pasting it into a new workbook sheet, feeding the names into a text-to-columns grinder and mashing the columnized names via a concatenating ampersand or two (e.g. =B3&” “&C3&” “&D3&” “&E3), I nailed and glued the whole goopy effluvium into a tottering VLOOKUP table that seems to be standing up to New York’s December gusts and showers. After next inundating the sheet’s next available column (which I titled Species Name) with lookup formulas that exchanged the codes for the full species names, and paved the formulas over with a copy Paste Special > Values hard-coded surface, the deed was done.

Now what? Well, note for starters that the folks of the New York design studio Cloudred  have cultivated an interactive chart of the trees and their species/borough distributions, leaving your character-based correspondent here to wonder about such other tree novella in there as might be learned. So how might the tree census tally broken out by the city’s zip (i.e. postal) codes work? Try running this by or through a pivot table:

Slicer (or Report Filter): BOROUGH

Row Labels: ZIPCODE

Values: ZIPCODE (Count)

Then try Slicing Manhattan, for example, and sorting the results Largest to Smallest. I get, in excerpt:

 trees1

Now here is where some New York-percolated smarts would hand-craft your understanding of the numbers should you want to key tree demographics to zip codes and their constituent neighborhoods (and if you’re in need of percolation, look to this zip code/service-area legend:)

https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm

Heading the count is 10021, coding none other than Manhattan’s fabled Upper East Side, among the country’s poshest districts and 50% greener than silver medalist 10025, right across Central Park in the island’s Upper West Side, itself literally just above third-place 10024, somewhere in the middle reaches of the Upper West Side. The sociologists among you will wonder if class, politics, or the politics of class have committed their invidious inputs to the tree populations, or rather if territorial receptivities to the planting enterprise better explain the variance.  And for the curious, Staten Island –  containing far fewer human New Yorkers than any other borough but second only to Queens in the hardwood department  – can market its 10312 as the New York’s most tree-swarmed zip code, reporting 23,156 of those life forms to the authorities. (Note by the way the 24,000 zip-code-less entries in the data set, though, and I see no enumerations of park-resident trees, clearly an enormous and apparently uncounted cohort. It appears as if our data surveys address-bearing trees alone, unless I’ve got it way wrong.)

And you’ve been asking about New York’s tree diameters, and I hear you. They’re recorded here in what must be inches, and we can kick off with something simple:

Row Labels: BOROUGH

Values: DIAMETER (Average, squeezed to two decimal points. I get

The averages diverge notably, though you’ll need a greener thumb than mine pressing these keys to understand and explain why. Perhaps Manhattan’s penumbrous skyline stunts tree girths, but again you’re hearing that supposition from the wrong person – just ask Woody. If, on the other hand, you try this table out:

Row Labels: Species Name

Column Labels: BOROUGH

Values: DIAMETER (Average)

trees2

You’ll find some inter-borough comparisons across the same species that place the Manhattan diameters on the low side.

A few final points: As could be anticipated with a 623,000-record data set, some distinctly outsized numbers appear to have crept into the park. The 148-32 86th Avenue address in Queens’ Jamaica neighborhood boasts a 2100-inch-diametered  tree, and my Google Maps photo just doesn’t capture it. At 175 or so feet in width, I think it would show up.

The last point: the data with which we’ve been working hold up the back end of what New York Open Data showcases in the first instance as a set of borough tree-positioning maps of the census. Here’s the one for Manhattan:

trees3

Maximum coverage – maximized information? That’s called a rhetorical question.

Putting OEDC Asylum Data on the Table

5 Dec

Shopping for spreadsheets? If you’re in the neighborhood, poke through the inventory at the Organization for Economic Co-ordination and Development, acronymized OECD, you’ll be happy to know. We’ve been here before, on July 25 to be exact;  but I told you it’s a large inventory.

In fact there appear to be several points of ingress to the holdings, but here’s one

http://data.oecd.org/

a self-described Public Beta page that enables you to “Find core datasets in the data warehouse and make your own tables and charts in OECD Stat Extracts”, a triflingly ambiguous inducement, appearing as it does to conflate the term “datasets” and the “Database” informing the title atop the quoted passage, but again, that’s a trifle.  More to the point is the availability of the data in either unmediated spreadsheet form, or downloadable as such via an Export button.

In any case one such set, a brief of statistics on asylum seekers to OECD circa 2013, seized the attention of the Guardian’s datablog (note that site’s redesign, and the disappearance of its erstwhile Data link from its home page), and is the one I think I’ve managed to capture here -even as the OECD heads its source page Stacks and flows of immigrants, 2002-2012 and so seems to have made off with a year (I’ve also retitled the workbook, overwriting its long, number-beset native name):

Inflows of Asylum seekers in OECD members and the Russian Federation

A first view of the sheet exposes it in Page Break Preview, a perspective on the data that won’t particularly advantage the analyst, and so you’ll want to make your way back to the standard Normal View. Moreover, the recourse toward the Freeze Panes selection at the columnar intersection of A and B seems puzzlingly needless, as the sheet bares all its numbers and country names immediately. I’d unfreeze the panes right away, even as I continue to wonder about the dictionary form which Microsoft mines its verbs. What would your teacher say about unfreeze? In addition, the missing field header in column A needs to be supplied.

Note, on the other hand, the proper distancing of both the sheet’s title and its yearly grand totals from the body of the data with the expedient of choice, the blank rows. But at the same time recall the OECD’s invitation to its visitors to “Find core datasets…and make your own charts and tables…”.  Again, our sheet assigns its yearly data to one field per year, when in fact all the years and their associated numbers would more profitably be reined into greater, subsuming fields. Not this, then:

oedc1

 

Butt again, this:

oedc2

(For a straightforward review of the means toward the above transformation over and above what I’ve said about it, look here.)

The companion question, then, is what plans the OECD have for the data once they hand them over to us. True, the workbook’s as-it-stands configuration does happily supports a round of charting, but less certain are the data’s congeniality to making “your own tables”. Once, however, the information fills the pared-parameter shape assumed by the second of the above screen shots, the prospects for this kind of pivot table loom a lot larger:

Row Labels:  Country

Column Labels: Year

Values: Values

Now this facile breakout of the asylum numbers by year is ready to happen, because all the years share the same Year field as an item. Try the above pivot table that when the each and every year occupies a field to itself.

But hold on. When the pontificating is done, my self-assured amelioration cooks up nothing but precisely the worksheet that already presents itself at the point of download:

oedc3

Same numbers, same juxtaposition of fields. So what was my point?

That’s a sweat-inducingly good question. But once all those years and their concomitant data recuse themselves from fields into items you can also treat the values to the % of Row Total styling, and learn, for example, that 30.88% of Turkey’s asylum seekers during the 2002-13 interval made their way to that country in 2013 alone, a clear attestation of the Syrian conflict’s scattering effect upon its populace. Less obvious is Hungary’s 2013 percentage of 37.22 of all the seekers it received across the 12 years, a skew I can’t yet explain.

Now consider this OECD book, counting member populations by their residents’ country of origin:

OEDC Immigration data 2011

Compare that far larger dataset’s field structure with that of the one above, e.g.

oedc4

That’s really more like it, even if the data themselves could stand some clarification (I’d bang out a lookup table that would return country names in full, and the data are available on the OECD site in a PDF for doing so. Is AUS, after all, Austria or Australia?).

And these variously-structured workbooks drag one ore question into the fray: why should some OEDC workbooks be stitched along this, or that, pattern of design – fields as items, or items as fields? I don’t know, that answer; nor do I know if the OEDC has endeavored to think about it.

But I’ll be happy to think about it, folks.  Here’ – I’ve attached my business card.

 

Eurostat Data: Some Footnotes and NUTs

30 Nov

Sorry to butt in in the middle of breakfast/lunch/dinner (tick one – I’m requesting your location just once, and my intentions are cookie-free), but allow me to ask: if a web site makes its data – the same data – available in both spreadsheet and PDF incarnations, what are we being told? It seems to me that the latter possibility speaks to the former; that is, the shipping of information into a read-only, do-not-touch mode suggests the other shipment is meant to be used.

One man’s opinion; and having so opined, I took my self-satisfied take on the matter to the data depot in the European Union’s Eurostat site, the statistical office for that far-flung association. There’s a lot of stuff in there; climb its tree structure and you’ll come away with all manner of facts and figures and nuts and bolts about member nations, a great deal of which (perhaps even all) clothe themselves in spreadsheet raiment. And again for seekers of alternatives, a good many of these parcels of stuff can come to you in PDF suit-ups too.

And so I shot an arrow into the air, and it came down upon this spreadsheet

Eurostat_Table_guest establishments

(It’s original name isn’t nearly as graceful)

and headed in row 1 “Number of establishments and bed-places by NUTS 2 regions”, that pun-evocative acronym abbreviating Nomenclature of territorial units for statistics (look here for more), a vocabulary for hierarchically tiering EU territories by are.

In fact the downloads offer themselves to us thusly:

 euro1

For starters, at least, I went for the XLS With Footnotes and short description, the workbook I’ve linked above.

Once you’ve auto-fit the columns (and as usual, the spreadsheet author hasn’t ) you’ll understand that rows 1 and 2 have to be pushed away from the data below with your true-blue blank row in between, as does 321, directly beneath the cell that among others names the Batman region (face it – you can’t help but admire my comedic restraint here). Next lower yourself into the data and think about the colons (e.g. row 139), that plant data-not-available markers all over the sheet. True, a replacement zero wouldn’t work here, representing as it would an authentic, quantified data point; but still, the labels are data weeds in a field otherwise marked out for values (and hold on to that thought).

But then there’s column C, and E, and G, and all those other empty corridors elbowing their way between the yearly numbers. Were these vacuities nothing but nothing, that is, airy space-makers between the data, we could simply delete them. But in fact the columns aren’t quite all empty, and have after all received a marching order. Remember my download asked for a workbook with footnotes somewhere in the workbook, even as I couldn’t quite fathom what that ask was going to deliver. It turns out that C, E, and G (I know; sounds like a major chord) and their compatriots have reserved their spaces for those very footnotes. Look for example at S130:S133 and their (u) entries; and if you look in turn back to what should now be row 322, its (u) explication, flagging a datum of low reliability, reads loud and clear.

And those largely but not wholly unattended footnote columns really mean well, because had the (u)s been made to cling to the values they footnote in their several cells, the value would  have regressed to a text standing as a result, and would have repelled their data-purporting charge. Thus the sheet designer has rightly put the footnote-labels in their place – a place at a decisive remove from the numbers to which they’re keyed, in a column all their own. And yet the colons, themselves footnoted in row 322 as well, punctuate the columns of the values, albeit in otherwise unoccupied cells.

But cavils aside, what is the analyst to do with those footnote-spattered columns? How, for example, can a pivot table look past those extraneous, heartily non-contributory soft spots in the data set, and succeed in doing something manageable with the whole potpourri?

I think it can, by doing something exceedingly simple. How about naming the footnote columns? Once named, the columns advertise themselves as bona fide fields, however sparse, which can be then be properly ignored by the pivot tabler who wants to study only the annual numbers in the year fields.

Simple indeed, but putting new fields into play won’t complete the assignment. You’ll still need to do something about the colons, and because they’re sprinkled among the data-toting year columns, that something won’t be our standard sort-them-to-the-bottom-and-jab-in-an-empty-row patch, a nice but futile try here that would continue to leave colons above the bottom of the data by which you sorted in many of the other columns, by which you hadn’t sorted. Better here, then, might be a find and replace, in which the colons could be overwritten by zeros. Those numeric replenishers would at least restore some numeric standing to the cells, and so qualifying the data to run through their quantitative paces, including some grouping activity. Again, however, the zeros won’t be entirely truth-telling, as the NUT in question will be understood to have no guesting places, when in fact we don’t really know if they do or not.

The larger question, then, is what we do with data of the NA kind, however labelled. Another alternative here, then: find the colons and replace with say, the number 1,000,000, sort the numbers ascendingly, and hammer in the blank row right there. That is, a prudent surrender. There isn’t much else you can do with records that don’t know what’s in their fields.

I also feel obliged here to return the discussion to the field-item disquisition I’ve conducted on these pages, and the analytically advisable demotion of the year fields to item status, all rounded up beneath one, grand field.

And the plea to fairness again obliges me to begin to close the circle and track back to Eurostat’s square-one download options. Had I downloaded the data Without footnotes, I would have spread this sheet before me instead (in excerpt):

euro2

And that looks more like it, doesn’t it? But we still have those colons, don’t we?

But please – go back to breakfast/lunch/dinner. And sorry about the coffee spill  – dry cleaning should get it out. Don’t forget to bill me.

Territorial Imperative: The Data Behind the Map

21 Nov

Who was it that said a picture is worth a 1000 cells? I don’t know either, but while you word-search your Bartlett’s I’ll slip out the back door and tip toe to the DataShine site.

That’s where the exceedingly cool and protean map plotted by the University College London’s Oliver O’Brien and his associates shakes its booty of riches at you, the one that fixes in space the data points for all sorts of demographia about the United Kingdom population (sourced by the country’s 2011 census) – ethnicity, religion, modes of transport to work, educational, second homes outside the UK, the whole polymorphous, multi-regressed, information-rife, full whack of sociological goodies.

Just click through the options in the floating Data Chooser

 shine1

And observe the color coded (we might call it conditionally formatted) distributions splash across any one of ten UK cities; let your mouse roll atop the maps’ microscopic OAs, or Output Areas (“…the lowest geographical area at which census estimates are provided”, according to the page linked in this sentence) and watch the Key’s scaled percentages hip hop up and down its rungs (and view the fine-printed OA codes as you roll). All told, it’s one glorious depiction. Or many.

But grab the map, squeeze, and watch it crumple into a million rows and columns.  Shake it hard, and a torrent of spreadsheets showers out of that wad, each one keyed to parameter’s worth of data, each one as prosaic as the chart is brilliant.

For example, if I tune into Data Chooser > Housing > Number of Bedrooms > 3 Bedrooms, and proceed to click Data in the dialog box at the base of the map:

shine2

 

(and it’s here too where you get to identify the city to be mapped, say London for example’s sake)

 

This workbook (in excerpt) unpacks before you:

shine3

Leaving aside the perennial column-auto-fit  issue, the sheet names the pertinent OAs, their superordinate Local Authority and region, and in this case, two coded references, QS411EW0005 and QS411EW0005 (base). These count the number of three-bedroom households and all homes (I’m assuming the base denotes homes here, and not residents) respectively, culminating in the OA Percentage  – in a manner of speaking.  The numbers in OA Percentage don’t attest a simple division of the coded data by their base, because they’re integers, an order of magnitude 100 times larger than authentic percentages; and that would at least matter were an analyst to format the values in Percentage terms.

And so on. Any click upon any parameter and its specified value (i.e., 3-bedroom), and a follow-on click Data does much the same. We see then that the Datashine map(s), if you will excuse this bit of reductive sophistry, “really” comprise the additive magic of a legion of homely spreadsheets. Put otherwise, the tables give rise to the tableaus (I’m sorry – I like that).

Of course I’ve stretched the point, and probably until the rubber band has snapped. Of course the maps are hardly “nothing but” their foundational workbooks, but the arrow of dependency between workbook and map is nothing if not striking. The spectacular, emergent cartography of the Datashine maps flowers atop the grittiest topsoil, and indispensably so.

But of course the comparative dimensionalties of map and sheet suggest different functionalities. You can’t simultaneously map, say, all of London’s ethnicities by AO, at least not without resorting to a round of implausible histrionics I can’t even imagine performing. To frame the challenge more universally, I don’t think you can map multiple items of the same parameter at the same time. You could, however, map the numbers of Sikhs who own three-bedroom houses and bicycle to work – i.e. one item each of several parameters – once you generate a sheet that lifts those items, and just those items, from the Ethnicity, home ownership, and mode-of-transportation data sets.

But a spreadsheet could crosstab all ethnicities by any and all AOs, apart from the technical trammels laid upon the data by the 1,000,000+ record limit. Indeed, were it operationally possible to columnize the data for every variable for every census respondent – and given London’s population it isn’t – you could break out any variable by any other one, by pivot tabling them and returning breakouts by every item in any field you choose. But that at-least-conceivable swoop across the data simply can’t be imagined by a map, unless I’m missing something.

Certainly a single mapped data point could capture the variable magnitudes of two fields in tandem. For example, the stirring MIT Senselab viz Trains of Data/Trains in Time viz of French train activity across a day’s worth of traffic batches both punctuality and passenger capacity numbers in each point, by color-coding timeliness and registering capacity by point size:

 shine4

But the Data Shine maps can’t get with that program, because their data points are coterminous with fixed territories – the Output Areas. Resizing the points isn’t an option, not if the map wants to hold its shape beneath each and every parameter hueing atop it.

So what’s the punchline, then? Simple (alright – maybe even too simple): Sometimes you need a great map, and sometimes you need a great spreadsheet. It’s just that you can’t have the former without the latter.

Paris’ Budget Participatif: Voici le Data, Part 2

16 Nov

6,200 discrepancies can get to a guy, particularly the 6,200 uncommonly literate zero-year-olds who seem to have crept past the examiners for the Budget Paticipatif vote and had their pudgy-fingered clicks duly recorded.

OK – that diapered constituency couldn’t have done what we’ve been told they did, but how the misidentifications could be understood makes for one very good question.

Remember first of all that the peculiarly plural character of the Budget vote, in which citizens were perfectly free to vote on up to five projects – militates against a straightforward demographic summary of the actual individuals who took part in the referendum. We’ve seen this problem before, but because the same individuals with the same ids could have, and did, call upon the data one to five times variably weights their presence. And that means, for example, that any intention to learn how many discrete, voting Parisians live in the 18th Arrondissement (district) suffers from an undue complication, because Pierre from the 18th may have voted for 5 projects, even as his neighbor Amelie may have elected to ratify but three. (For a graphically-driven look at individual votes look here.)

That’s one issue, along with those zeros. First, if you peel away the now-blank row 97750 and reunite the zero records with the pack and strike up a pivot table comprising exactly one field:

Row Labels: id

You will have done what Row Labels were designed to do – enumerate unique instances of each id/voter, of which I get 24001. Now roll age into Row Labels, too, sign off on the Tabular form, and stanch its subtotals (these decisions are motivated by Tabular’s collinear effect, which levell super and subordinate labels along the same row. You also don’t really need any Grand Totals here either). The new count: 24028 rows – and that means that 27 voters have something more than dual citizenship – they have dual ages, and I’ve lost the number of my local relativity theorist.

But seriously, folks, that means 27 voters incurred a second age assignment somewhere among their one to five votes. Considered in their numeric relation to all voters, the error-bearers amount to barely, and merely, one-tenth of a percent of the electorate, and statistically speaking that’s not bad. And what that also apparently means is that all the other zero-aged voters – about 1,400, according to my stress- testing of the numbers – received only zeros across all their votes.

And so while the error-trapping yield above is scant, we might as well do something to restore those 27 multi-aged participants to the unitary value they deserve. Start by lifting Age over the great divide and settling it into the Values area, and moving to summarize the numbers therein by Max. That decision puts every voter ID in touch with his/her “largest” age, i.e. it defines the 27 by their non-zero, actual age. Then select the entire pivot table (PivotTable Tools > Options > Actions > Select > Entire PivotTable) and name the whole thing Ages.

Then click back to the source data and in F2, or the free cell hard by the first record and enter

=VLOOKUP(A2, Ages,2,FALSE)

Then copy down F (note: if your computer’s like mine, all that processing take some time). That expression matches each id in the data source with its twin in the pivot table (now operating under that Ages range name); and hence the 27 ids with zeros somewhere among their several records should experience a replacement of  those ciphers with the maximum value associated with those ids – that is, their apparently real age. Then Copy > Paste Special > Values the F results onto the corresponding entries in E (the age field), and delete the VLOOKUPs, their work having been completed. Then sort the data by age, again Largest to Smallest, and the zero-aged cells now debut at row 97792 – a felicitous demotion, as it were. In other words, we’ve refit some of the ages.

Now in fact the row numbers mean we’ve added authentic ages to 42 records, then, even as we earlier counted 27 ids suffering from that zero imputation. The answer – on assumes – is that some of those 27 had more than one zero among their multiple votes. Now you can reinstitute that blank row to send the recalcitrant zero records packing – but the row goes above 97792.

But of course we’re still left with the 455 voters falling, or crawling, somewhere between the ages of one and ten. Are they any more genuine than the zeros? I don’t know, but if you throw a pivot table at the question:

Row Labels: age (grouped in bins of ten years, for example)

Column labels: projets

Values: project (Count)

You’ll see that these petite partisans allotted greatest support to the Rendre la rue aux enfants project (approved, by the way), an idea for cordoning off play streets for the younger set, and the Cultiver dans les Ecoles educational initiative for school-based gardens. A savvy – and self-interested group. The implication, then: perhaps those ages are for real. Just implying, you understand. And as for you, Francois Hollande: keep an eye on them. Don’t say you haven’t been warned.

Paris’ Budget Participatif Vote: Voici le Data, Part 1

6 Nov

Vote early and often, the wags advise, their hard-bitten bromide borne of the storied, dodgy business conducted in certain American electoral districts. But it seems as if some jurisdictions actually promote the one-person-multiple-vote equation, and for the common good besides; witness the recent Paris’ Budget Participatif (for some unaccountable reason the web site is in French), a referendum on budget allocations to a range of projects conducted both online and in hard-copy mode (they used to call those ballots), in the week endpointed by September 24 to October 1 (ok, that’s eight days, but who’s counting?). And among other niceties, each Parisian was duly enfranchised to vote for up to five projects, the numbers from the online segment recorded on the Paris Open Data site and brought to you here:

 Budget participatif

Democracy is a messy business, and by fitting corollary perhaps, the workbook’s 104,000 votes (out of a grand total of 174,090) aren’t completely tidy either (remember that the workbook reports the online votes only). Sort the age field Smallest to Largest and a spray of negative numbers douses the top of the column, and the existential impossibility they portend needs to set aright. But those minus signs are simply errors attaching to the voters’ actual age (as confirmed to me by Paris Open Data); and because they are I’d jump into the unattended cell in F2 and pencil in the ABS function:

=ABS(E2)

Excel tells us that ABS “Returns the absolute value of a number, a number without its sign”, which sounds good enough to me, restoring as it does the putative negative ages to their intended, usable positive state. Copy the ABS down F, follow with a Copy> Paste Value > Values into and down E, and then delete all those ABSes, if that isn’t too ungrateful.

But one repair doesn’t fit all. Sort the ages again, this time by largest to smallest, and take note of the 6,200 or so voters aged zero, all too happy to add their squeaky sonorities to the vox populi. These clear and present mistakes give us literally nothing to go on, and because they do I’d bail out, by scampering to row 97750 and preparing to inlay the trusty blank row that’ll hold those these pretenders at arm’s length from the ranks of the usable. But redirect your glance just a mite upward and view the 19 votes cast by one-year olds, and the 352 more registered by a precocious electorate ranging from two to nine. Remember that Budget Participatif imposed no age limit on its potential constituents, and so apparently nothing could have detained these jejune activists from the process, unless it was past their bedtime. Absent verification of these junior politicos from the powers that be – and I’ve sought it – I’d provisionally aim that blank row right instead above 97374, where the nine-year-olds give way to the tens (again we’re assuming the ages have been sorted highest to lowest). Is that drawling line arbitrary? A bit, but I’m separating those records, not deleting them; if events warrant I’ll be happy to bring them back into the fold.

Have completed those prefatory chores we can buckle up take the data out for a spin. Start with something simple, say a pivot table that breaks turnout figures by the referendum’s eight dates:

Row Labels: date_de_vote

Values: date_de_vote

date_de_vote (again, this time via Show Values As > % of Column Total)

I get:

vote1

 

(Note: Your machine’s regional time format may trend the row-label chronology differently. My US format pushed 1/10/2014 to the top of the labels for starters, presumably treating October 1 as January 10.By right-clicking the date and selecting Move > Move “1/10/2014″ to end, I succeeded in reassigning it to its rightful place.)

What’s noteworthy here is the voter shortfall for September 26 and 27, a Friday and Saturday respectively. Remember these are internet votes, an electoral medium whose greased access route to its virtual polling place should have flattened any and all daily variation in turnout. Not so, apparently; perhaps it was the Parisian fondness for le Weekend that succeeded in distracting the collective attention away from all things political for those couple of days (on the other hand, the participatory peak in the referendum’s final two days may attest a procrastinators’ late dash to their screens).

We can next take an age-graded look at the results by lining up average voter age by project preference, something like

Row Labels: projets

Values: age (by Average, formatted to two decimals, and sorted Largest to Smallest)

age (again, by Count)

 vote2

Don’t, first of all, be flummoxed by the Count of Age heading, which in fact delivers the vote count for each project, and merely grabs onto the age data as a device for aligning the votes with their project (In fact any field could have served the same end and posted identical totals; think about what is, after all, a standard pivot table tack). But I was authentically surprised by what appear to be the meaningful associations between the projects and the average age of their sectaries.

Les Evenements Sur Grand Ecrans (Google-translated as Events on the Big Screen), a (losing) proposal to boom important events to the multitudes on a giant screen in a public space, captivated the youngest average cohort, even as Musees parisiens 3.0, i.e. an digital hoard of the holdings of fourteen city museums and another idea whose time hadn’t come, piqued a demographic more than 6.5 years older. As both failed ideas seem to have belonged to the more-or-less hi-tech species, the dissonance in age between them becomes all the more curious.

But vive la differences, or something like that.

Air Traffic Data: Some Reservations, Part 2

31 Oct

svid-calculatorThe line between instructive reiteration and a point insufferably belabored may be molecularly slight, but I’ll take my chances with your judgment and go ahead. There are, again, two types of spreadsheets: ones to be perused and read with a passive regard, and ones to be analyzed, and pushed and pulled about for some new, information-gleaning end. The anno.aero data in our collective sights must, as they’re presently organized, answer to the first criterion. After all, the spate of inconsistencies flooding the sheets won’t matter if the reader simply needs to know what’s up with air traffic numbers and move on. Maybe.

I’m equivocating even here, because similar data suited up in different clothes can give pause, even to readers happy to give the data the barest minimum of attention. I’m thinking about the Var. field at the back end of the european-airlines-traffic-trends LF (load factors) sheets, which do nothing more than subtract the previous year’s airline-specific LFs from the ones in the given year’s sheet.

Check out, for example, the 2012 LF sheet’s column P, and its alphabetically apt pp format:

 res1

We learn first of all that Excel happily attends purely textual addenda with its numeric entries, as per the screen shot. Move into Format Cells > Custom, type a 0 in the Type: field, and tack on the text you want. (That 0 is a code representing a numeric entry).  Cells subject to the makeover thus receive any numbers along with their alpha paddings, and with no loss of quantitative value – a very good thing to know, and good for me, too. The question here, though, is what the pp – abbreviating “percentage points”, I think – is doing for or with the numbers. I’d allow that, because the actual numbers in the column simply report an absolute, subtracted percent change in load factor from 2012 to 2011, the designers wanted to stave off the impression that the figures stand for some divided result, e.g., 2012/2011-1, or some sort. That’s a sage clarification on their part, but apart from a coarse rounding off of the values (the number in P4 is closer to .42% than the reported 1%) the numbers are simply hard-coded; that is, they were simply entered, sans formula. Except for the result in P22, that is, which emerges from a

=O22-B22

calculation.

And what deepens my puzzlement, though, is the Var. field in the 2011 sheet:

res2

Those +/- 0 entries are labels, and in view of the fact that real, determinate values could have issued from the data via simple cell subtractions I don’t know why they’re there; and in any case we don’t see these labels in the 2012 sheet. Load Factors for 2013? They’re not there, even though with the contributory data in place, they could have been.

Now for something completely different but problematic: open the anno.aero svid calculator below:

svid-calculator

(Sorry this wasn’t made available sooner.) The sheet sires a kind of cousin metric to standard deviation, measuring the monthly traffic variability in a given airport’s traffic and culminating in the index in D19 (the readings in F19 and H19 work with demo data). Note first of all the hidden E column and its operative formula, e.g.

=((100*D5/(D$17/12))-100)^2

 

It seems to me that

=AVERAGE(D$5:D$16)

might properly replace the (D$17/12) term, by overwriting the hard-coded 12 value (formulas respond to data-entry contingencies; the number 12 can’t), and I think the expression’s outermost open parenthesis and a corresponding closed one following the 12 could be safely banished.

The calculator asks the user to paste traffic data from one of the sheets drawn from one of the links stowed beneath the calculator (the American traffic link doesn’t work, by the way; you’ll need to divert to the anno.aero database page directly for those data); but note that the data therein are planed (sorry!) horizontally, even as the calculator wants them oriented along the vertical. As such, you’ll need to hurtle through a Copy-Paste Special-Transpose hoop, one that could have been cleared from the process.

But it’s the formula in D20 – the one that returns the textual airline rating – that’s the grabber. It reads:

=IF(D19<$A$24,$D$23,IF(D19<$A$25,$D$24,IF(D19<$A$26,$D$25,IF(D19<$B$27,$D$26, $D$27))))

 

It’s nesting a series of IF statements, each assessing a value falling somewhere between the numbers in A23:A27. (Note, by the way, the space poking between the $D$26 and $D$27 references at the back end of the formula. I didn’t know you could do that.) Now if you’re wondering why the iterations above, which propose alternative after alternative in the event the number in D19 meets or fails to meet each logical test, couldn’t have been emulated by a garden-variety VLOOKUP formula I share your curiosity, but there’s a reason – sort of.

The reason is supplied by the isolated > sign, nothing but a label all by its lonesome in A27, and in a range otherwise populated by values (those hyphenated constructs, e.g. 0.0 – 1.9 – pull across three different cells; only the 0.0 finds itself in A23). And because that “greater than” is there, VLOOKUP can’t work with it. Thus the sheet designer plumped for the formulaic plan B above.

 res3

And if you doubt me, consider this excerpt from the formula:

IF(D19<$B$27,$D$26, $D$27))))

 

All the other IFs in there test values in A23:A26, but suddenly we’re testing a value in the B column – B27 – because that’s where the actual, quantifiable, 100 sits – and because that’s where it is, it can’t contribute to the lookup column in A. And that’s why that flurry of IFs jam the formula.

But I’ll continue to allow that a VLOOKUP will work, particularly once you scan back a couple hundred words, and rethink that pp numeric format we reviewed all the way up there. If you click in A27, return to Format Cells > Custom, type >0 (again, the zero is a coded placeholder for a numeric value) and ratify it with OK. Then simply type 100 in A27, and you should see >100, in all its adorned but value-sporting glory. Then enter

=VLOOKUP(D19,A23:D27,4)

 in D20. Now when and if you paste air traffic numbers into D5:D16, you’ll evoke the same textual results evoked by the IF-laden formula (you’ll also note D20’s text-resonant, colored conditional formats).

I like that solution, if I do say so myself. As a matter of fact, I may even email anna.aero about it. And I’ll even accept frequent flyer miles in lieu of my customary fee.

 

 

 

 

 

Follow

Get every new post delivered to your Inbox.

Join 173 other followers