A Tree Grows in Paris: Les Arbes d’Alignement, Part 2

6 Jul

Our nature walk through Paris’ 100,000 trees is about to get us somewhere, now that we seem to have completed the formulaic fix that should let us distill the arrondissement numbers from the locales in the Adresse field and free us to break out the tree totals by arrondissement:

=VALUE(MID(C2,FIND(” “,C2)+1,FIND(“E”,C2)-FIND(” “,C2)-1))

That expression is still good as far as it goes, but it has occurred to me since I last went to press that we might want to invite back the 10,000 bits of flora sown in the Bois de Boulogne and Bois de Vincennes, the ones we had separated from the greater pack of tree data because they had no arrondissements to report. Upon reflection, I’d allow we still could and probably should count them, and so by deleting the interposing row separating the forests from the tree we could amend the above formula thusly:

=IF(LEFT(C2,1)=”B”,LEFT(C2,FIND(“-“,C2)-2),VALUE(MID(C2,FIND(” “,C2)+1,FIND(“E”,C2)-FIND(” “,C2)-1)))

This extension to the existing formula poses a condition that asks: If an Adresse starts with the letter B -confirming a Bois adresse – then find the sequential position of its dash somewhere in its cell and subtract 2 from it, thus ordering the LEFT function to retrieve all the text segment exhibiting that length. For example – the “-” appears in the 18th position of cells featuring the Bois de Boulogne; thus we know that a 16-character LEFT extraction will return that name.

Once this new edition is copied down the now-enlarged Arr field we improvised last week our pivot table is duly posed for action:

Row Labels: Arr

Values: Arr (Count)

Arr (again, this time % of Column Total; dispense with Grand Totals)

I get:

Paristrees21

Again, I don’t know what pretensions to completeness the census entertains, but we see the Left Bank’s 13th and the posh 16th have set aside the largest spaces for their trees. The smallish totals for the two Bois entries may be wholly complete, but they furnish cause for us to wonder just the same. The figures also need to be calibrated to arrondissement size, or course; the inconsiderable totals for the 1st and 2nd do in fact scale to their pinched areas, for example. And a correlation is available, thanks to Wikipedia’s square-kilometer enumerations of the arrondissements:

paristrees22

Throwing Excel’s CORREL function at all the numbers, I get .927, a massive association suggesting that Paris’ urban designers strove ardently to smooth their tree distributions across the districts as per their size. (Note, however, that the Wikipedia listings pair two area figures for the 12th and 16th, the larger incorporating the Bois de Boulognes and Bois de Vincennes respectively. The above table displays the former, smaller area totals.)

And some causes for wonder abound in the Date Plantation field, too. We’ve already taken stock of the two-thirds of all putative date stamps in there that are in fact inactive text, nullities that I assume mean to tell the analyst that these are trees for which no planting dates are available, though they may also want to let us know that some of these in fact took root prior to 1900, the year before which Excel just can’t compute. The great bulk of these dormancies show the 1700-01-01 pseudo-date, but some others insist on some curious variations, e.g., 1700-03-06. Either way, these are text. There are in addition exactly 100 trees submitting a 1/1/1900 planting date, a real date this time but one that, given the understanding that January 1 1900 is the earliest date with which Excel can work, one needs to ask what these are trying to tell us. We also are told about a tree with a circumference of 165 centimeters ostensibly planted on May 27, 2015, about a month-and-a-half ago. Some potent fertilizer’s in that turf.

But if, in spite of all, you do want to learn something about the planting dates, you first need to do something about all the text entries before you dare to group the dates by year, for example. I’d take over the column to the right of Date Plantation, call it Date or whatever, and enter in what should be I2 (remember our Arr field is in there):

=IF(ISTEXT(H2),1,H2)

And what is this formula doing? It’s telling its cell: if the entry in H2 is indeed text, replace it with a numeric 1. Otherwise, if the H2 is a truly dated, return that number). And why 1?  Because 1 is really January 1, 1900, the first date in Excel’s chronology, the one naturally receiving the very first number in the date/number sequence. Thus every 1700-01-01 will be treated as January 1, 1900, a date we’ll then decide to filter away from any grouping.

Once you’ve copied the formula down. A pivot table along these lines can take shape:

Row Labels: Date

Values: Date (Count, and group by Years alone; filter out the year 1900 and the residual-category <1/1/1900).

You’ll see a bounce in plantings starting with 1987; but given my caveat above I ‘m not sure what that means. The dates require a second – and perhaps even a third – look.

And what about tree heights and their averages, say crosstabbed (remember that term?) by arrondissement?  Here the data are all quantitative (remember that last week I split off all the  ,0 decimals that can’t be read by those insular US regional settings), but about half of them record a tree height of 0 meters – and that can’t be, particularly since many of these possess measurable circumference data. Moreover, one particularly strapping stripling in the Bois de Vincennes stands 881,818 meters tall, which puts the tree’s upper branches somewhere in the Van Allen belt. Ten others register heights in the four-figured meter range, turning the Eiffel Tower into a Lego knock-off by comparison. Verdict, then: these data don’t seem terribly usable.

And the circumference (Circonference) data aren’t much better. I’m thinking of the Platanus orientalis in the 18th arrondissement, and its 120,170 centimeter girth; at more than 12,000 meters around it could fence off the whole district, and tempt the 18th’s residents into declaring their independence from Paris. And about 16,000 trees state no circumference at all. (By the way, if you rest your mouse atop any Geo point in the source data on the Paris Open Data site, the tree associated with that point is promptly reckoned on a pop-up map.)

All of which may compel me to rewrite the brief I filed on behalf of trees in my previous post.  They’re not quite as cooperative as I might have thought. After all, even the most  vain of humans wouldn’t have the nerve to declare a waistline of 0 inches.

A Tree Grows in Paris: Les Arbes d’Alignement, Part 1

28 Jun

Faced with the choice between counting trees or humans I’d professionally recommend that you opt for the former. Humans, after all, are mobile and moody; they’ve been known to spurn or even flee from the ones doing the counting, or slam the door in counters’ faces, or lie to their honorably-intended questions.

Trees, on the other hand, have nothing to hide, and couldn’t hide even if they did. A wondrously compliant and agreeable life form, those trees; and the Paris’ factotums charged with counting them have put together a census of their findings in the Open Data Paris site here. (Tip:  click the Export button in order to start up the download, and follow on with a click on the Télécharger en Excel link.)

Now if you think you’ve heard the tree theme before in this space you’d be right; we’ve already foraged through the count of New York’s trees in our December 14 post of last year; and after having refreshed my own recollection I see that the Apple’s dataset, at least as of that writing, comprised 623,000 or so entries, give or take a sapling. We see that the Paris itemization, on the other hand, returns 104,000 trees, a perfectly plausible proportion in view of the fact that the city’s square-mile area measures about one-seventh of New York’s. At the same time, however, I can’t say if the data are comprehensive, or presume to be.

In any case, a first look at the data exposes an Open Data Paris commonplace: a spreadsheet-wide application of the Wrap Text option and its nettling consequent, an obscuring of much of the text in a whole lot of the sheet’s cells. I’d thus hasten to click through a global Wrap Text turn-off, and an ensuing standard but necessary column auto-fit.

And about the numbers themselves:  I don’t know about yours, but my download aggrandizes the Circonference (tree circumference) and Hauter (height) values with a comma – the French representation of a decimal point – and a meaningless trailing zero, neither of which appear in the data as framed on the Open Data site:

 paris trees1

I suspect these elements are implicit in the web data and as such are being suppressed by some preemptive formatting scheme, but explanations aside I for one don’t need them, in part because my obstinate American regional settings can’t compute the comma, and so casts all the numbers into fallback text mode.

The way out, aside from handing my regionals over into French hands: do a find and replace on columns B and D, in which a find for ,0 is to be replaced by nothing. That works – kind of – because while my version indeed restores the entries to their honorable numeric standing the find and replace also somehow performs a hyper-substitution, in which the comma has been made to give way to a decimal point and three zeros for which I haven’t asked.  Ok – the zeros will respond to three decrease-decimal clicks, but I don’t quite know why they showed up to begin with.

And the Date Plantation (planting) numbers in G are no less curious. First, of course, there are all those 1 janvier 1700 entries that the download rewrites into 1700-01-01 text form, text in part because Excel can’t deal with a date preceding January 1 1900. But either way I suspect 1 janvier 1700 stands as a placeholder for a tree sans birth certificate. The other dates, however, have remarkably rewritten themselves, guarding their true quantitative character – or at least they’ve done so on my system, e.g., 19 avril 1998 to 4/19/1998, even negotiating my American regional month-first syntax with panache. But those 1700s – of which more than 68,000 are sprouting among the records – are a formidable cluster, and will stand in the way of your attempts to group the data by date.

But the tweaks keep coming. It seems to me that, among other objectives, a simple breakout of tree totals by arrondissement (or district, of which there are 20 in Paris) would be of service, but the Adresse data in which the arrondissements are sealed aren’t yet up to the task (as per the above screen shot).

The pivot tabler would want record 1, for example, to release its 15 alone to the analysis, with the item shedding its redundant Paris and Arrdt identifiers and address. To get there, I’d first conduct a Z to A sort on the Adresse field, because around 10,000 of the tree records bear the provenance of either the Bois de Boulogne or Bois de Vincennes, large parks flanking Paris’ western and eastern edges, respectively. These carry no arrondissement information; and so the sort jams these to the bottom of the data set, from which they can (at least temporarily) be parted from the other, usable records via the faithful blank row.

Next I’d open a new column to the right of Adresse, call it Arrond or some such, and immediately format this upstart field in General terms – because Arron has inherited the text format of the neighboring Adresse field, and left to those default devices will deactivate your formulas into impressive but useless inscriptions.

And here’s my formula, entered in what is now D2:

=VALUE(MID(C2,FIND(” “,C2)+1,FIND(“E”,C2)-FIND(” “,C2)-1))

Remember that we’re looking to extract the arrondissement numbers peppering the Adresses in C, a search issuing a call for the MID function’s search-and-pluck routine. Here MID identifies its operational starting point by finding the position of every entry’s (first) space – invariably 6 here, by the way – and adding 1, because the arrondissement number itself need start at the next position, 7. It then figures the length of the MID character extraction by subtracting the position number of the ” ” from that of the E that invariably shadows the right end of the arrondissement number, and subtracting 1 in turn – because both the space and the E sit outside the arrondissement number and would thus overstate the extraction total by 1. (The E abbreviates a French numbering prefix, by the way, the equivalents of “st” or “th”. Thus onzième means 11th, deuxième 2nd, and so on. And the ER that attaches to the 1st arrondissement stands for première.) If you’re happy with all that, copy the formula down D.

And why the VALUE? Because absent that revisionist touch, the results would assume text form, by itself not dire; we’re not interested in adding arrondissements, after all. But reel text-formatted arrondissements into the Row Labels area of a pivot table and they’ll sort like this:

 paris trees2.jpg

 That’s how text/numerics sort, regarding each of their columns distinctly, and sorting these left to right.

Now we have to figure out what we can learn from all this.

UK Election Poll Data: A Vote for Change(s)

21 Jun

It was Emerson who had little time for a foolish consistency, but Emerson never designed a spreadsheet; and had that 21st Century chore been thrust upon him he might well have backed away from the adage.

Because consistency – it all its unthrilling, small-minded, philistine tenacity – is part of the deal. Nothing foolish here; datasets, and databases (I’m not equating the two, but humor me), won’t work until each of their fields treat their records with the evenest of hands. After all, we’ve more than once time observed and experienced the oily slicks a subtotal row can spill upon a pivot table, and the havoc wrought by a column of dates, half of whose entrants are text frauds, flashing their bogus IDs at the bouncer. There’s such a thing as a foolish inconsistency, too.

Bad metaphor, but get your data on the same page, then, if you want them to work properly. And the UK election poll data lined up by the Guardian and the market researcher ICM  that’s tugging at your sleeve in this spreadsheet

 Copy of All Guardian_ICM poll results

may have a lesson or two to teach about the consistency thing.

The sheet, founded in Google Docs mode (and you can get that iteration here), has an interesting story to tell, hammering in place an extended track of UK national-election poll data proceeding from June 1984, with the actual election numbers booking passage en route. First things first: I would rule the Fieldwork dates field disposable by themselves (you’ll see what that means shortly); its holdings merely submit text-formatted corroborations of the more-or-less (we’ll see) genuine dates holding down the other side of the data set in A  – and so I’d drop a blank row between G and H, thus distancing – but not obliterating – the field.

But you’ll also note the sudden capitulation of the data in A to the text format picking up in row 390, and worsened in 400 by those campaign poll entries, which harbor no date pretensions whatsoever (something about consistency). You can’t use these as they stand, of course, and I don’t know how they found their way into the field at all. In any case, rather than stepping through a gauntlet of formulaic repairs, I’d simply latch on to A389 – owning up to a date-worthy format, click the Format Painter button on it, copy its format through A391-A496, and enter the first date in the corresponding Fieldwork date hyphenated spans (e.g., entering 13 June 2014 in A390), and thus properly dating all the items in A. (Note: if you’re trying this in the US, be vigilant about the regional date format disparities.)

Next, the informational rows 407-409 must be furloughed from the principal data; again, a simple deletion or an intervening blank row will make that happen. O, the other hand the field colorings of the party-specific poll percentages hueing columns B through E are, for the investigator, neither here nor there; they bear no conditionally-formatted utility, and rather do what they purport to do – simply differentiate the parties by color.

Slightly more eyebrow-furrowing, though, is the sudden reversion in row 320 to the two decimal points now puffing all the percentages, in addition to the actual election-day rows in which the points have already been vested – except for the election report in row 405, which shows up in but one decimal:

poll1

I should add that the native Google Docs sheet doesn’t give way to the two decimals in mid-sheet. It seems the document has undergone some loss in the translation to Excel that I can’t yet explain, though the data as such are of course none the worse for the experience. 37.00% is one and the same with 37%, but the reader is left to wonder about the intermittency of those decimals as a result, particularly as their zeroes bring no material clarification of the numbers. A little more consistency would seem in order here.

And the same could be said about the most variable row heights oscillating across the records, a consequence of the Wrap Text enforcement meant to rein in the extended GENERAL ELECTION RESULT postings in the Sample field in Column. G. The simplest palliative: sizing an unvarying, maximized row height that by definition makes room for every entry.

Of course these contrarieties only beset the formatting properties of the workbook, and as such don’t adulterate the data whose values ultimately matter most. But a book’s patchwork presentational mien gives pause about the fitness of the data as well, a perception to which no spreadsheet designer wants to give succor.

Still another presentational curiosity lands itself in the CON LEAD OVER LABOUR field, wherein Labour poll margins over the Conservatives appear in reddened characters and Tory advantages effect a blue tint. These could and should have been conditionally formatted, and I see no evidence that they were so treated, neither here nor in the first-instance Google Doc (I’m prepared to be corrected here, however). Moreover, I cannot account for the surprise insurgency of a formula in F322 in lieu of the hard codes evidenced in the field till then; and after a two-row reset the formulas resume in earnest in F325 and don’t let up for the remainder of the sheet. Again, that little shock doesn’t contaminate the numbers – but why start writing formulas in F322?

And finally, perhaps, we need to take note and stock of the seven actual General Election result cells mingling with the other records. By definition of course these aren’t poll-sampled results, and as such perhaps don’t belong here at all, though they do not inflict the same double-count contortions of a subtotal. If you decide they need to be disinvited, you could conduct a seat-of-the-pants find and replace, finding the word General in the Sample column and replacing it with ZZ, sorting these new entries to the bottom of the data set, and severing them from the pack with a blank row (you need that ZZ replacement to begin with because many of the other cells in the field report an n/a, which would sort beneath the original General Election Result legends).

Now you could, for example, do some poll result grouping by date, provided you have in fact reformatted those text scammers in the End of fieldwork/election data field in A.

But we need to ask: had you or I had designed this sheet, wouldn’t you have proceeded with a bit more consistency?

Iowa’s Data on Bullying: Some Spreadsheet Mediation

12 Jun

School bullying is a serious business, of course, and data on this devastating social practice need be treated with all due sobriety. Some cheering, current information on the matter finds incidence down in US, at least, and for a state-specific look, the education department of the American state of Iowa makes its data on bullying available to you and me on its www.educateiowa.gov site. Its latest of its spreadsheets counts instances for 2011-12 and awaits here:

 2011-2012 Bullying Data

You’ll note the tabs detailing bullying allegations directed at both staff and students, and still a third sheet recalling a single, racially-motivated incident aimed at a Volunteer, whom I take to be an individual who had devoted some free time to a school in the Davenport Community School District. Clicking into the STUDENT sheet you’ll discover the incident data pulled apart into Public and Non-Public school tranches, a decision I would have overruled (you’ll be pleased that my name is nowhere to be found among the cards in Iowa’s official rolodex), consolidating instead all the records into a unitary dataset boosted by a School Status field, with institutions receiving the Public or Non-Public tag. That revamp of course would facilitate useful pivot-tabled breakouts along those lines, and in fact the deed can be done easily enough (though for purposes of exposition below I’m simply respecting Public and Non-Public’s separate spaces as they stand) ; dub H5 School Status, enter Public in H6 and copy down to row 1063 (just double-click the autofill button). Then delete rows 1063 to 1065 (that first-mentioned row offers grand totals, its retention subjecting any future pivot table to a double-count), enter Non-Public in what is now 1063, and again copy down, remembering again to delete the now-1192 and its Non-Public grand totals.

Note as well the cell-merged Public header leaning atop the data in row 4. That needs to be sent away,, either through deletion or by the interposition of the ever-reliable blank row.

And there’s something else: because we’ve taken over the H column we suddenly need to do something about the Total field in I. That field’s school-by-school, row-driven totals had been nicely distanced from their contributory data by the heretofore blank H, heading off yet another double-count prospect. The question is what to do about it.

I’m posing the question because the by-the-book counsel, the one in fact endorsed not three paragraphs ago, is to dismiss total rows and their odious double-count potential. The problem here, though, is that the types of bullying the data consider, e.g. Physical Attributes, etc., are each accorded a field all their own – and if you want for example to total all bullying instances for a school you can’t pivot table them into the very standard Summarize Values by Sum recourse. You can’t – again – because Summarize Values works upon the numbers in one field, and not the values across fields. We’ve sounded this concern many times (e.g. the August 29, 2013 post), in part because it does require repeating soundings.

Thus the data set’s first record:

bully7

Would ideally have been properly dispersed thusly:

bully1

Again, by downgrading each bullying type from its independent field status to an item in a larger, subsuming field, standard totalling, and other, no-less-standard operations (e.g. % of Row), resolve into practicability. Thus in the case before us we might as well retain the existing totals, because we’d need to manufacture them anyway, through a  messy calculated field or…a Total column in the data set that’s already there.

If, then, for illustration’s sake we do retain the services of Total, we can go on to at least break out all bullying (Public school) occurrences by district:

Row Labels: District Name

Values: Total (Sum, and sorted by Largest to Smallest).

District Name (Count)

In excerpt I get

bully2

It was the Davenport (public) School District that suffered by far the largest number of reported – repeat, reported – bullying episodes, but absent any control for district size the analysis stalls here, though the numbers are instructive nevertheless. Doing the same to the Non-Public districts (and remember that if you’ve left the Public and Non-Public data as you found them you’ll of course require a header row for each) I get (again, the screen shot is fractional):

 bully3

Far fewer reports visit the table, though I suspect the districts above are notably smaller. If they in fact aren’t, we’d then need to think about willingness to report and school decorum across institutions and districts. In short, the data don’t yet read definitively, but they supply the goad for additional researches.

Then there is the matter of what the Iowa calls report Consequence.  The public school data set counts nine of these:

bully4

The consequences critically dichotomize the reports into Founded and Unfounded allegations, with a pair of residual Consequences items filling in some of the blanks. It thus seemed to me that a global breakout along Founded/Unfounded lines would do much to inform an understanding of the reports. I’d thus take over the next free column (probably I or J, depending on whether you’ve adopted the School Status field), call it Report Status, and enter in the next row down:

=IF(LEFT(C6,12)=”Consequences”,”Consequences”,LEFT(C6,FIND(“-“,C6)-2))

What is this expression saying? It’s propounding an IF statement that asks if the first 12 characters of any entry in C contain that very word – consequence. If so, the formula returns it. If not, the statement looks for the character position of the “-“, subtracts 2 (where the pre-hyphen text in such cells would terminate), and lifts those characters into the cell. In this latter case, either Founded or Unfounded should turn up.

Once we’re happy with its workings copy the formula down C. You can then pivot table, on the public school data:

Row Labels: Report Status

Values: Report Status (Count, Show Values As > % of Column Total.

(Again, you don’t want Grand Totals; of necessity they’ll return 100%.)

I get

bully5

That’s informative. Executing the same routine for the non-public districts:

 bully6

Far fewer cases here, to be sure, but when reported, the charges seem to exhibit slightly more traction. Note also the near-identical Consequences percentages.

As earlier conceded, our looks at the bullying reports may supply some of the necessary, but insufficient, conditions for any conclusive take on the problem.  But necessary.

Big Bucks in Boston: The Checkbook Explorer

5 Jun

Money-management issues, you say? Can’t stem that outflow, fettered debtor? I hear you, but you sure didn’t put your good name to 140,179 checks last year, the number of payments flowing from the pock-marked bank account of the city of Boston in 2014 alone. That’s 384 autographs a day, even more than David Beckham signs on his milk runs to his bodega; and you can read all about it here, and test your auditing smarts against the 556,104 checks disbursed by the city from July 1 2011 to April 28 of this year, all saved to the city’s Checkbook Explorer sheet.

Sorry, folks; you’ll need to actually go there  in order to download the 556,000; click the Checkbook Explorer link, follow with a subsequent click of the Export button, get the download going with a click on the CSV or CSV for Excel download option).

As with most US Open Data governmental sites the data are pretty good, albeit with a few curiosities. Rows 426220 through 426226 (at least in my download) restate redundant header data, as do a few other rows farther down, even as columns B through F and H and I remain unpopulated in those records. The quick fix is to sort the data by one of the blank cell columns (via A to Z), thus forcing these discrepancies to the bottom of the data set, after which the good old blank-row insert will send them into oblivion.

Note in addition the Year and Fiscal Year fields. The latter begins every July 1, and so July 17, 2013 belongs to fiscal 2014, for example. The conventional Year field could, in the interests of slimming down the file for the analyst, be deleted, because years can be derived as needed from the dates staking the Enter field, via the pivot table’s Grouping feature; and the same could be said about, and done with, the months in Months (Number), and Month, the latter supplying their names. Savings: about 1.5 million cells. And while you’re at it you could probably disengage Voucher and Voucher Line as well as Dept, those latter numeric IDs receiving their user-friendlier names in Department Name. And that greater total economy of 3 million cells would doubtless please the Boston accountants.

Once those sweeps have been conducted the data read and work pretty straightforwardly.  It occurred to me, then, that a monthly aggregate breakout of expenditures might prove of interest, given June’s end-of-fiscal year status. Might that month bear more than its share of outlays, with unspent budgetary remainders burning a hole in departmental coffers? Not a bad investigatory question, and answerable by this pivot table awaits:

Row Labels: Entered

Values: Monetary Amount (grouped by Month and formatted in currency mode):

check1

In fact the results stand my question on its head. We see a remarkable downturn in July spending, remarkable because the months flanking it – in different fiscal years, at that – register far higher numbers. It seems that the bean counters in Beantown are happy to spend their due allotments with time to spare, and their creditors aren’t complaining either.

But the above table also imparts a real-world pertinence to a point to which I call students’ attention in my Excel classes. Note the data span at least parts of four years, and because they do, a grouping by month alone quite overlooks the plurality of Januarys and the eleven other 30-or-so-day units the year comprises. Our table sums spending activity for all the Januarys, then, a level of abstraction normally too high for most data-analytical intents. But because we want to understand the role of months qua months in the spending calendar, the table makes sense. Indeed, if you retool the above numbers via % of Column Totals:

 check2

You’ll see just how understated the June expenses are, though at the same time I’m not sure what to make of April’s lion’s share.

And if you next substitute Department Name for Entered, turn off the % of Columns Total calculation, run a Top 10 filter for Departments (start by right-clicking somewhere in Row Labels, not the Values), and sort the numbers Largest to Smallest you’ll see

check3

Boston’s schools rise to the top of the expenditures class, and note $3.6 billion grand total. If you turn off the Top 10 and thus expose all the departments figuring in the dataset, you’ll count 76 of these, along with their $4.1 billion grand total. And that means the top 10 departments account for about 87% of all spending. The most frugal department?  The Office of Women’s Advancement, whose one and only $92.00 payment earmarked itself for printing expenses at a local Sir Speedy. The office is connected to the Boston Women’s Commission, whose none-too-profligate $6,145.34 places it in the austere 74th position.  Conclusion, then: we need more women in the Boston Public School Department.

And as for the receiving end, switch Vendor for Department and run another Top 10/sort:

check4

The health insurance provider Harvard Pilgrim Health Care accounts by itself for more than 21% of all of Boston’s outgoings. Indeed – the top 10 takes in 41%, and that’s out of 12,643 vendors.

And among other things that means Boston has a whole lot of potholes, Mario Susi & Son, you see, sell asphalt.

UK Election Data: Results May Vary

29 May

Now that the UK parliamentary election has come and gone it’s time; not for a recount, because even Labour isn’t going to the barricades on this one, but rather a recount-ing of what happened, and where it did. And if it’s detail about the votes you crave you’ve come to the right place, namely this workbook put together by the electionresources site (and you’ll find election data for quite a few countries here):

http://electionresources.org/uk/

(Click the May 7, 2015 link).

The outcomes of all 650 House of Commons contents and their 3971 contestants, some bearing party affiliations of which you’re likely to hear for the first time, come right at you, auguring some interesting analytical prospects, and some questions too.

Beginning with the prosaic but necessary housework, you’ll need to spruce up the field headings in columns B and G that just aren’t there. I think it’s clear that the Constituency header in A was really dedicated to B, and so I’ll move it there, and name the vacated A field Number. Column G clearly computes the percent of respective constituency turnouts, thus earning the name Pct. Turnout or something like it.

And it’s that Electorate/Turnout zone that throws an organizational – that is, spreadsheet-organizational question – at us.  You’ll note the redundant Electorate, Turnout, and Pct. Turnout numbers/percentages counted off for each candidate in his/her constituency, and the clear failures of these spare parts to teach us anything new beyond their first mention:

ukvote1

Nevertheless, leaner alternatives do not come easily.

What does that mean? Start with the Turnout data, recording the absolute numbers of constituency- eligible voters making their way to the polls. Apropos the first-listed Aberavon district in Wales, its row 1 turnout figure of 31,523 is obviously in no way enhanced or value-added by the eight additional recitations of the number attaching to each Aberavon candidate. But how else, then, could turnout be represented?

The answer to that quasi-rhetorical question depends – again – on who’s doing the asking. For seekers of novellae among the data it seems to me that this pivot table:

Row Labels: Constituency

Values:  Votes (Sum)

Does the work of Turnout, and because it does, it frees us to retire Turnout from the dataset altogether. If on the other hand, you’re merely inspecting the numbers, Turnout may have an edifying role to play in the presentational scheme – though even here I’d submit the percent of turnout is what most readers really want to know; and if that be the case, Turnout could still be responsibly evicted from the sheet.

But that fix remains partial, and doesn’t look the problem squarely in the eyes, i.e., how to integrate what are in effect aggregated data – overall turnouts – with individuated records made to dangle their toes on the same row. The problem is apples/oranges, or, if you’re a UK voter, chalk-and-cheese: i.e. specific, per-record candidate results having to live with what is in effect subtotal data in the Electorate and Turnout fields. It’s one thing to derive constituency turnout from the records, as the pivot table above succeeds in doing easily enough. But it’s quite another to calculate the numbers of eligible voters on the basis of the actual vote totals we have. I’d venture to say it’s impossible, and because those eligibility numbers have to be supplied exogenously electionresources.org obviously chose to simply post them in each and every record, bad form notwithstanding. I’m not making a citizen’s arrest, mind you; I’m simply alerting the authorities to the problem.

In any case, interesting conclusions from the data are nevertheless there to be drawn, once you begin to look at them. For example – we see that the Place field furnishes each candidate’s ordinal finishing position in his/her race, and so we could pivot table a party’s average race position, as it were:

Row Labels: Party

Values: Place (average, formatted to two decimals, sorted  Largest to Smallest)

I get in excerpt:

ukvote2

Lots to be learned here, not the least of which is something of the 129 parties wanting to stand up and being counted, including the World Peace Through Song and the Birthday parties, two entrants that might do well to consider coalition. And Plaid Cymru means the Party of Wales, intending no fashion commentary.  Note as well the top-ranked Speaker party comprising precisely one candidate, House Speaker John Bercow, a Conservative who assumed this singular affiliation in consequence of his role (and I know no more about this curiosity than what I’ve just told you).

But Speaker Bercow’s one-case universe won’t satisfy your inner pollster, and so by dragging Place into Values again in order to tally candidate numbers, and summarizing by average, you’ll see (again the screen shot is fractional):

ukvote3

Here the relationships get a bit more definitive. The Scottish National Party’s (SNP) near-perfect 1.05 builds itself on the back of its Scotland-specific candidate cohort of 59, 56 of whom actually won. And we see the fuller contingents of the Conservatives and Labour exhibit pretty comparable place averages, even as their respective MP numbers stand at 330 and 232.

Those misshapen place-average/MP fractions get all the more gnarled when you slip Votes into the Values area as well and show them as % of Column Total (you’ll need to format these to five decimal places in order to bring the percentages for small parties to light). The perennially controversial UKIP made much of its one-member victory, an isolated triumph emerging in spite of the party’s 12.6% of the nationwide vote. Its calls for proportional representation likely won’t be heard – though even Labour, with 83% of the overall Conservative vote and but 70% of the Tory’s MP total – might want to think twice about the idea, too.

But the on-top Conservatives, with 50.7% of all MPs and but 36.8% of the vote, won’t even think about it once. And the SNP, with 9% of the MPs and 4.7% of the UK electorate -and only 49.9% of the actual vote in Scotland itself, by the way – won’t either.

Making Up For Lost Time: London Tube Lateness Data

22 May

If you haven’t heard it before – and if you’re espying this post in Sydney or Sofia you probably haven’t – you’re hearing it here:  ride the London underground 15 minutes longer than the ride is supposed to take and you get your money back.

In fact, I’d allow that a sizeable crew of Londoners haven’t heard that before either, but it’s true, and if you doubt me look here. Indeed – your intrepid correspondent has collected twice, a pair of lucrative postscripts to some down time I experienced down under.

It’s all true; but how do the powers that be know you’re on the up-and-up when you press your cash-back claim? They know because they’re watching your Oyster card, your ticket through the transit system that means to record your every point of departure and disembarkation, and whether you really were stuck on that Northern Line idling at Tooting Bec on the 17th at 2:12 pm (yes, editor, there is such a stop).

And that means in turn that all the trips of all the descenders into the underground have to be summed and reckoned, with much of that tallying brought to us here:

 Copy of tfl-tube-performance

The dates in the Key Trends sheet kick off in cell A2 from the second reporting period of the 2011-12 financial year, and don’t be flummoxed by the 02 at the head of the cell. That isn’t February, but rather the second of the fiscal year’s 13 28-day reporting intervals, a curious baker’s dozen that I suspect has something to do with 364’s divisibility by 13, and the cooperative absence of any underground service at all on December 25. But the apportionments aren’t quite that neat. In fact the day total for 2012-13 comes to 365 (note the 29 in cell C26, probably a leap-year emendation), while C27 registers a 27-day interval, and C39 and C52 have inherited 30-days stretches.

In any case, Key Trends gives us more to think about than period day distributions – for one thing, that arpeggio of cascading comments, in striking but largely illegible array:

late1

Apart from bidding us to think about how extended textual reflections might fitly reconcile themselves to the data to which they’re annexed, clarity calls for a most simple remedial step – just click the Show All Comments button in the Comments button group (Review tab), and make the comments disappear. To recall any comment in particular, click on its cell and try Show/Hide Comment, in the singular. Why the sheet insisted on its in-your-face comment revelation is a matter to put before Transport for London.

But once you do get to the data you’ll need to run through a number of think-throughs in order to decide what exactly it is you want to learn from them. Consider for starters the Number of Lost Customer Hours entries. These could be divided by the corresponding Excess Journey Times, thus returning the apparent number of passengers experiencing a tardy ride (of course in view of the 15-minute rule most of these won’t be compensated in any event).  You’ll also need to multiply the Lost Hours by 60 so that you’ll be dividing minutes by minutes.  If we command the J column, then, and title it Number Late, we could enter in J2:

=F2*60/I2

(Note: no parentheses required here. Excel’s order of operations of will attend to the math.) That expression gives me 23,609,932.35 or so latecomers, a rather large number to be sure, but one of course that needs be juxtaposed to the total rider figure for any given 28-day period. And to proceed here we need to hazard an educated guess. Transport for London puts the annual tube traffic nowadays at around 1.265 billon riders a year. Giving leap years their due, divide that number by 364.25 (remember December 25 takes the day off) and a day’s worth of tube travel yields an average of 3,472,889.5 passengers, which in turn needs to be multiplied by the Days in Period numbers in C. Simplest tack: enter that average in a free cell and name it daily, then slide into the K column, call the prospective field Pct. Late, and post, in K2:

=J2/(daily*C2)

(This time you do need the parentheses. Note in addition the 3.4 million daily average is a current one, and as such may artificially, if slightly depress lateness percentages for earlier years. The determined researcher could doubtless recover the actual year-specific totals.)

Copy down K and observe the notable flux among the percentages, particularly the spikes in C38 and C41. Those quantum leaps owe their king-sizes to job actions, and here you would profit from consulting the attached explanatory comments.

How you’d choose to evaluate these lateness proportions, which tend to oscillate across the 20-30% band (though the variation is real – observe the 14.88% for the period ending October 13 2012 in row 20), is perhaps a matter for the glass half empty/full disputants. A rough-hewn null hypothesis might predict that as many rides would pull into their rider-intended stops late as early, by some margin. A minute-late ride is late, after all, and so a 25% aggregate lateness measure might not fairly serve as a full-on indictment of the underground service. Still, late is late, once the baseline, expected lengths of journeys are decided.

But another simple but trenchant measure of punctuality offers itself from the data. We could divide the number of customer hours lost by Operated Kilometres (as Transport for London would spell it)counted in the G column; that is, the accumulated distance travelled by all trains in the period. It stands to reason that all things being equal greater distances traveled associate with more lost hours, and because we have all the data we need here – unlike the Pct Late metric that works with the same daily passenger average throughout – this little fraction may be all the more incisive. We could title the L column something like Ratio of Hrs Lost to KM, simply divide the value in F by the one in G, and copy down.

You’ll also note the matrix of lost customer hours in its eponymous tab, an enhanced compilation of data pushing further back in time to the 2003-4 financial year and treated to an off-the-shelf conditional format. Just one question here: the matrix provides lost customer hours for the first reporting period of 2011-12 – even as the listings for that year in the Key Trends sheet only start at period 2.

Is that what they mean by lost customer hours?

Follow

Get every new post delivered to your Inbox.

Join 180 other followers