Search results for 'baby names'

England-Wales Birth Names, 2013, Part 1: A Length-wise look

21 Aug

There are two kinds of spreadsheets, incarnating two states of readiness – the first, all dressed up with nowhere to go but right before your got-to-know gaze – and the second, an unprocessed, unpretty, whole-fibered object whose monotonic, pedal-pointed rows ask you to hold on tight to your mouse and do something with them.

State number 2 conceals a dare: if you want to really understand what’s going on with the data, you need to bring something to the show.  Stare at a sheet rattling 200,000 rows in your face and try to make some sense out of it without busting a move at the data; there is a difference, after all, between a reader and a user.

Whew – I feel better for having said that, and while I’m stoked let’s click at the brace of workbooks migrated to the public domain by the UK Office for National Statistics, counting the gender-parsed top 100 birth names in 2013 for England and Wales.

The lists made some news in England last week, but even though we’ve worked with baby name data before (e.g., my April 10 and October 24, 2013 posts) there are additional considerations to be broached. Look, for example, at the Table 1 – Top 100 girls, E&W tab in the girl-specific book (in excerpt):

 ons1

Pretty? Maybe, but look again. Why, I feel professionally impelled to ask, are the 100 names split down the middle, thus forging what are in effect two parameters comprising precisely the same kind of information, along with a doubling-over every other field? Even a sit-back reader might ask the same question; and for those who want to actually do some business here, the sheet as it stands just can’t and won’t do, and that’s quite apart from the impertinent blank row atop row 8, which should be made impermanent. And while I’m at it be advised that the Change Since… numbers attending the names are textual to a fare thee well, with authentically texted number signs in lieu of the numerically formatted kind, the occasional asterisk, and all kinds of superfluous spaces in there too. (But I’m leaving the Change data aside for the purposes of the discussion, but for the record one could reclaim their arithmetic standing through a regimen of finds and replaces and TRIM functions.)

And given the presentational morass above here’s what I do, in the interests of accessibility and maximized analytical puissance:  Copy each set of Rank, Name, and Count columns from the Top 100 England and Wales tabs on both the boys and girls workbooks all into a new sheet, and then open two columns to which gender and country identifiers are to be coded, e.g.

ons2

(I’ve also invested the copied data with the workbook default Calibri 11-point font for consistency’s sake.) You should wind up with 402 names (and not 400, because of some same-ranked names).

Now you can get back in touch with your inner user.  Remembering of course that the data enroll only the top 100 baby names into its global denominator, we could look for example at the inter-country relation between names, understanding that overall ,Welsh babies account for about 5.3% of all births (the ratios are almost identical between boys and girls). Try something like this:

Row Labels: Name

Colum Labels: Country

Values: Number (by % of Row Total)

Filter: Gender

In excerpt and filtering for Girls I get:

ons3

 

 

You should be looking here for Welsh proportions departing in either direction from around 5.3%; thus Brooke features relatively more often among Wales births, but then look at Alexis, its 33 babies all born in Wales.

Now substitute Rank for Number in the Values area (by Sum):

ons4

 

Amelia’s trans-national appeal is clear, but you’ll note significant rank disparities up and down the results.

Now think about this: might a notable difference between average name lengths obtain between gender and/or country? It might, but you’ll be quick to call the absence of a name-length metric among the source data to my attention, and you’ll be right to do so. How, then, could name-length differentials be appraised?

And that request is a multitudinous one; and the most elemental reply to it would have us assign a LEN formula alongside each name, total the accumulation of them all, and divide it by the number of names, presumably under the steam of a COUNTA. But that won’t work, even if you’re happy with the idea. It won’t work because by having copied the England and Wales names from their associated sheets, many names thus appear twice, and as such enforce a small but measurable skew upon the calculation. The alternative here, then: bang out a pivot table, and assuming you’ve brought that Length field into the data fold:

Row Labels: Name

Values: Length (Average)

The table naturally reports each name once inside the Row Label column, because that’s what Row Labels do – return every item in the Labelled field uniquely. The necessary reversion to the Average operation above owes to the fact that, even as Amelia appears but once among the Row Labels, Values defaults to a sum of her two name lengths in the original data set – the one for England, the other for Wales, or a summed length of 12. And now you can play through the formula proposed in the previous paragraph (and if you do, I’d keep away from those GETPIVOTDATA references that’ll beset the expression if you click on the cell references; I’d just type the refs in standard mode).

That should work, at least mathematically. But you may be happier with a weighted assay of the name lengths that commensurately honor the variable numeric contribution of the names. After all, average the lengths of Abigail – the choice for 1191 girls – and Alys, the name adopted by 59 sets of parents, and it comes to 5.5. But is that how you want the average to be understood? And if you don’t, and you’ve been won over by the prospect of a weighted average, can that end be pursued with one formula, and without bothering to institute a Length field in the data set?

The answer is yes, and the formula – at least one such formula – looks something like this:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

 (assuming the Names occupy D, with the number of births moving in next door to E).  And so it seems to me then that we need a part two, so think about what you see above. But remember you can always speak with me during office hours. I think they’re Wednesdays 11:12:30 and by appointment.

 

 

UK Names, 1905: The Reign of William and Mary

10 Apr

Call it Sociology Light, but birth names and their incidence say something about a culture’s here and now, and its theres and thens; and a table of the 100 most prominent names conferred upon newborns in the UK, circa 1905, and lined up with a like array from 2012 must be at least halfway instructive, it seems to me.

The data, broken out by gender and put together by ancestry.co.uk and put in a downloadable place by the Guardian’s DataBlog, are here, too:

 Data on baby names

Once you relieve the boys headers of their wrap-text trammels (and note the odd typeface discrepancies up there as well; and I’d have formatted the Change field in percent style, lest readers think all those negative numbers communicate absolute values. You’ll note in this regard that a Change of -1 suggests a name-occurrence reduction of 100%, but in some cases this rounds off what is in fact something less than that), the names numbers prove downright interesting. Tack a simple sum formula to the base of the No. and Rate columns for both genders and we learn, among other things, that the 100 1905 names contributed 240,289 of all boys’ and 197,042 of all girls’ birth names, that year, compared to 2012’s 96,534 and 56,002 respective aggregates. The Rate sums (that is, the number of times the top-100 names appeared among every 1,000 births) for 1905 stand at a remarkable 508.14 and 431.73, meaning that the boys’ 100, at least, identified more than half of the male babies that year. By decided contrast, the 2012 figures came to 257.86 and 157.6. But then, we’re not comparing like for like.

And that’s because like for like would require us to call in the top 100 names for 2012, a distribution furnished not at all by the data before us. All we have here are the counts for the 2012 births of just those names that cracked the top 100 in 1905. We can, however, download the 2012 names  for both genders here in separate workbooks, on that page’s second and third data links (use the names on Table 6 for both sets; these array the names down a solitary column and exhibit all the birth names bestowed upon at least three children, thus counting off a top 6,200 for boys, and over 7,700 names for girls. We’ll want to access all the names, for reasons I hope to justify shortly.) What these data, however, don’t afford us with is rate data, and only the absolute birth numbers; but as we can see, we should be able to draw the rate inferences nonetheless.

So what do we want to do here? First, copy and paste the 2012 names (all of them, as I’ve advised) in distinct columns for both boys and girls into a new sheet in our workbook, and separate these by at least a few columns. Name the range of names (names only) for boys Boys, and for girls Girls (I’ve let my imagination run riot here). Then return to the original 1905 Boys sheet and divide the number of 2012 names for William in D1 by its rate in G1. The yield – 374.39 – can be adopted as a more-or-less constant, signifying the number of actual name-bearing births in 2012 accounting for 1 out of 1,000 births that year (more or less, because of a rounding off of the rate numbers down the column, I think). In other words, about 374 actual male births in 2012 compute to 1 per 1,000.

Then alongside the most numerous name in the 2012 boys’ range – Harry – divide his 7,168 total by the 374.39 (presumably in the column to the right of the actual birth numbers). I get 19.15 (per 1,000); then copy down the column. Add the outcomes for 1 through 100 and I get 542.23- an even mightier hegemony for the top 100 boy names in 2012 than held for 1905. Move to the Girls range and try the same tack. I’m using a rate constant of 355 here, in the interests of again looking past the variation inflicted by rounding off. Dividing the girls’ names by the 355 and summing the rates, I get 444.83 – again, a heightened share for the 2012 top 100.

And there’s something else we can do – we can go on to figure the 2012 rankings for the 1905 names.

It works like this. First, and I’m sparing you the trial and error, it appears as if all the boys’ and girls’ names in their native Table 6 sheets in the 2012 workbooks have been set upon by an old nemesis – the superfluous space. Because we’ve already copied these names to that separate sheet in the 1905 workbook (to the ranges we’ve called Boys and Girls) apply the TRIM function (see the March 27 post, for example) there to some nearby columns, and Special Pasted > Values atop the original names. Then delete the column bearing the TRIM formulas.

Then do something simple. Cut and paste the 2012 ranking numbers (in the sheet containing the Boys and Girls ranges) to the next free rightward column, which should flank the rate numbers we calculated earlier. That is, you should see something like this for Harry:

 na1

Then scurry back to the original Boys sheet. In the next available column – mine is H – write the following:

=VLOOKUP(B2,Sheet1!B$2:E$6195,4,FALSE)

Where the Boys range in my case holds down B2:E6195 in my Sheet1. Thus we see that William – the boys’ name frontrunner in 1905 – turns up in the 9th position in the Boys range:

 na2

Norman, on the other hand, has plummeted from 26 to 4805 (note that many names in 2012 share the 4805 position; they’re the ones with three babies each). And now you can do the same for the girls (but be prepared for some #N/A messages, these pointing to those 1905 names which no tots in 2012 received, e.g., Gladys.) And there’s something about Mary – number one in 1905, down to 241 107 years later.

So what’s in a name? Not much, perhaps; but names? Squint, and you can see the zeitgeist moving.

Vacation time this coming week – Word Press is a most enlightened employer.

 

 

English-Wales Name Data, Part 2: Sum Interesting Formulas

28 Aug

My bookmark tucked its ribbon here:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

Last week’s post, you recall, paused at the above teaser, a solitary, unexplained but efficacious formula for coaxing a weighted-average length for 2013’s top 100 boys’ and girls’ birth names from the data. Now I need to reward your patience by explaining.

A weighted average, after all, insists that a proper, proportioned due be paid every name’s incidence across the birth cohort, and the formula in our inbox appears to due just that. It enlists the SUMPRODUCT function to the task, a predominantly simple mechanism that in the first instance works this way:

Suppose I’m presented with this range of transactions in A1:C5:

(It’s clear I have no idea what these comestibles really cost.) What I want to know, of course, is what the bill comes to in toto, and the standard means toward that end for is pretty evident: enter a bank of formulas, each multiplying every Unit Sold by its allied Cost per unit, and add the four purchases.  But SUMPRODUCT lets us unify the process:

=SUMPRODUCT(B2:B5,C2:C5)

That is: simply enter the paired ranges, or arrays, as they’re known in these parts, and SUMPRODUCT does the rest – it multiplies each Units Sold by its collinear Cost per unit – B2 by C2, B3 by C3, etc. and then sums them all to boot.

But what are we hoping to do? Again, we’re aspiring to learn the weighted average length of all the birth names in our workbook, and SUMPRODUCT will consummate that hope – provided we do a bit of a syntactical rethink.

Observe that nowhere among our five extant England-Wales-name workbook fields will you find name length data, and we even politely declined the offer of the suitable, supplementary Length field that was tendered in the previous post. Again, we want to see if we can figure average name length without a length field’s well-meaning intercession.

So get back to our bookmarked SUMPRODUCT, which is, and is not, essentially coterminous with the unassuming apples and oranges computation. You’ll note first of all that both SUMPRODUCTS reference two arrays, but our average-length-finder couches the first of these with a LEN, one that stuffs a 402-row range inside its bulging parentheses, even as our previous meet-ups with LEN in previous posts sized the contents of but one cell per formula. And the very fact that we can grant LEN this superimposition atop an array reference at all is, to me at least, imperfectly obvious. But you can.  But we also need to understand in turn that SUMPRODUCT is an off-the-shelf embodiment of an array formula, a class of expressions that power through multiple calculations in but one go, a capability that in fact was made clear by the more rudimentary fruit-and-veg example, when you think about it – and you should (for an instructive introduction to array formulas look here).

Thus the LEN(D2:D403) excerpt measures the length of each name in the D column – and then multiplies each length by the corresponding number of children bearing that name (reported in the E column) – and then adds all those results…and divides that grand total by the sum of all the births. And that’s a weighted average. Here’s the formula again:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

Thus the calculated length of D2 is multiplied by the associated number of births in E2, as are the like values in D3 and E3, and so on, all the way down to D403 and E403. These are then all added, turning out an aggregate length in characters. That number – which happens to be 2,238,510 – next submits to the divisor comprising the number of all births – 343,022; and 2,238,510/343,022 confers a weighted name-length average of 6.25285. Got that?

Of course it makes sense – it has to, or else let me be the first signatory to your class action suit against Microsoft. But array formulas do require a think, because the kind of syntactical sense they espouse doesn’t always tally with standard user expectations.

But I for one was perplexed by a more generic thought-provoker, one that brought me back in touch with that ancient mathematical fundament, the order of operations. It turns out, for example, that the SUMPRODUCT under our microscope could have been alternatively written this way:

=SUMPRODUCT(LEN(D2:D403),E2:E403)/SUM(E2:E403)

Look closely, because that variant hasn’t slithered out of a photocopier. It is different from our earlier take, in virtue of the parenthesis (or bracket, depending on your IP address) right-flanking the E2:E403 segment before the divisor symbol. Here, then

 

(LEN(D2:D403,E2:E403)

 

are regarded of a piece, and indeed, the entire SUMPRODUCT formula ends there:

 

=SUMPRODUCT(LEN(D2:D403),E2:403)

 

and the /SUM(E2:E403) simply tacks itself discretely to the above taking the SUMPRODUCT result further, but from the outside, as it were.

 

But everything you see in our original SUMPRODUCT belongs to it; nothing there is external to the function as it is written. Here, SUM(E2:E403) appears to button itself to E2:E403 alone, and then divide it; but apart from the fact that that fraction doesn’t seem to make much sense, it isn’t what’s happening anyway. Consider the difference between

 

=6*(4/2) and =(6*4)/2

 

Guess what – there is none. Either way you get 12.

 

P.S. If you’re a name devotee, the US Social Security site Popular Baby Names page issues a free pass to an enormous holding of birth name data dating back to 1883 in easy-to-convert text file mode, apparently listing every first name given to at least five babies each year. And they have state-by-state breakouts, too.

Je M’Appelle Abbott: Comparing Paris and New York Name Data, Part 2

24 Oct

Now that our data infrastructure has been firmed and its component fields, including the ones forged to our specifications, have been welded into place, we can begin to think about how this all might be put to good analytical use. As observed last week, each field is fetchingly substantive and groomed to contribute to the business of learning about New York-Paris naming practices. And had New York allowed its data to stretch across multiple years, over and above its monadic 2009, another field for our delectation would’ve sweetened the mix.

But let’s work with what we have. Your data right now should track somewhere along these (grid)lines:

 nyp1

And that refinished data set cues a return to the question with which I closed out last week’s post. While our intent of course is to frame a series of comparisons between New York and Parisian names, the numbers as they stand simply don’t directly support any like-for-like valuations, simply because the New York data comprises 2.7 more births (as opposed to actual names – a point over which we need to loll a bit later) that those in the French capital. One could, one supposes, multiply each Paris record by the 2.7 differential and as such normalize their numbers, after a fashion; but force-feeding the data with that kind of virtual numeric parity would mar the presentation prohibitively, I think. Reminding readers that the New York birth totals emblazon the actual numbers, and at the same time begging their imaginative indulgence over the Paris figures throws too many balls in the air – particularly when the 2.7 multiplier sprays decimal points all through the Paris totals. Do we really want a reader to know that 869.4 Louises were born to Paris mothers in 2009? Some rhetorical questions were meant to be answered – and the answer to this one is: No.

What we can do, of course, is proportion the data, by fractioning say, the ratio of A-initialed names to the respective New York-Paris wholes. This too, normalizes the data, but after a different fashion. For example, we could try out this pivot table:

Row Labels: Initial

Column Labels: City

Values: Initial (again, by Count, and % of Column total)

Report Filter: Gender

I get:

 nyp2

(N.B. One need only click the % of Column Total here option once. The NY and Paris values are exactly that – plural items or values in the selfsame City field; the two cities don’t stand alone as discrete fields.)

There are comparative differences in there (i.e., the J and M initials), but quite a few distributional resemblances, too. And if you’re wondering about the absolute numbers conducing toward these results, throw Initial into the Values area again and comply with the default Count operation:

 nyp3

And of course, all of the above could be filtered for gender, whose field is peopled by a data curiosity. Some of the Paris names have been spooked by the X factor – that letter having stolen into their Gender fields, presumably in virtue of some documentary/administrative complication. There are but five such names, however – Noha, Felicite, Keziah, Jessy, and Alexy, all of which are sufficiently indeterminate to beg the question of how gender is registered by the French.

Now we need to amplify an issue to which we earlier devoted a parenthetical aside. The name data before us can in fact be addressed in two modes: First, as a collocation of names, each considered a solitary, once-occurrent datum – the tack we’ve taken until now – or rather, as the sum of all the names borne by the babies born in 2009, a very different conceptual and numeric understanding. Thus by substituting the Number for the Initial field in the Values area for the above pivot table (and remaining here with the default Sum operation), you’ll realize this breakout:

 nyp4

(Yes, feel free to rename the fields if you want to.)The numbers now are much larger, naturally, but by-and-large they jibe with the Initial Count data, and that’s probably none too surprising. And of course all of the above could be filtered by gender, once we look past the five Xs. Thus we see, for example, that a remarkable 19.26% of all New York males received a name beginning with J, even as that letter starts off but 5.16% of the girl names.

Now substitute Length for Number, summarize by average, and format to say, two decimal places (here you need only recruit Length once for the Values area):

nyp5

The variation is not stupendous, and overall New York names average 5.84 characters to Paris’ 5.78 – a virtual draw (note again, however, that these averages treat the names as one-time iterations, that is, as per the first method we described above. Resorting to the Plan B, in which the aggregate length for all names of all babies (i.e., name Number times Length, all totaled and broken out by each city, and divided by the Number total for each city) I wind up with a 5.89-5.69 New York Paris average name-length split. For the record, you can bring those results about via these expressions:

=SUMIF(D2:D2834,”NY”,G2:G2834)/SUMIF(D2:D2834,”NY”,B2:B2834)

=SUMIF(D2:G2834,”Paris”,G2:G2834)/SUMIF(D2:D2834,”Paris”,B2:B2834)

(Begging forgiveness, a detailing of the workings of the above with take us a digression too far here.  But See, for example, this discussion: http://www.techonthenet.com/excel/formulas/sumif.php)

One additional finding that I, for one, attended with some surprise emerged from a simple question about the names shared by New York and Paris, i.e., how many names appear in both cities’ birth rosters? The answer can be pivot tabled thusly:

Row Labels: Name

Values: Name (Count)

And sorted descendingly by the Values column.

What interests us here, put simply, is the number of names evaluating to a count of 2 – meaning of course that these are the ones which find their way into the birth data of both cities. My pivot table counts 2360 individual names, of which only 446 – or about 18.9% – occurred to parents in both cities. I would have thought that the selection overlap, as it were, for these two Western countries would have canopied far more names. Anthropologists, I sense a post-doc opportunity in the making.

But these data also blow the whistle on two data quirks. A number of Paris names, e.g. Kevin, Sekou – appear twice in the original Open Data Paris download, and in association with the same gender (doubles dot the New Yorks data, too, but these are gender parsed). By the lights of pivot table protocols these redundant mentions are merely flesh wounds, because squeezing repeated instances of the same item into one, and only one, row or column label appearance is precisely what pivot tables are cut out to do. Still, the duplicated names – which were first brought to my attention by a student – are worth asking, or at least wondering about.

The other quirk is even quirkier. When I ran the above names-shared pivot table, its upper tier read as follows:

nyp6

Now, pivot tables should immediately sort Row and Column Label entries; but the first four labels above clearly aren’t defaulting.  I’ll admit to a thick, initial measure of puzzlement over these out-of-sorts entries, until the light bulb shined a few watts on me. Apart from qualifying as bona fide baby names, Jan, April, June, and April happen to populate Excel’s custom sort lists of the months and their three-lettered abbreviations (see File > Options > Advanced > the Edit Custom Lists button). As such, they’re indeed earmarked to rise to the top of any sort, at least by default (and note that the four month/names sort themselves in chronological order). But if you want the members of this quartet to assume their conventionally sorted place, initiate the pivot table via Insert > Pivot Table, and immediately thereafter click PivotTable Tools > PivotTable button group > Options > the Totals & Filters tab, and deselect Use Custom Lists when sorting.

But alright, mes amis; you’ve worked hard enough today – I’ll spring for the croissants.

The Skinny on UK Obesity Data: Work to Do

8 Feb

The reported levelling of the rates of childhood obesity in the UK has naturally been accounted a good thing (for the article that made the news in full, consult this URL), although for dieters who like their glass half empty the news should perhaps be imbibed with slow, deliberate slips.  After all, the study’s abstract prefigures its findings with this bit of curbed enthusiasm:

“More than a third of UK children are overweight or obese, but the prevalence of overweight and obesity may have stabilized between 2004 and 2013.”

It’s possible, then, that the discovered equilibration has more to do with some saturation effect that has already visited its full-on impact upon the cohort of obesity-vulnerable youngsters, but I take no credit for the worthiness of that conjecture, only blame for the conjecture itself.

Be that as it may, the National Child Measurement Programme (NCMP) has something to say about the development as well, here for the years 2007/8 through2012/13, in the form of this workbook available for download here:

http://www.noo.org.uk/visualisation

(Click the Download link for Child Obesity and excess weight prevalence by Clinical Commissioning Group.)

The book breaks out its populations by both “excess weight” and “obesity”  measures  (the former standing as a superset of the two) for two school-year age cohorts, what the UK calls Reception (ages 4-5) and Year 6 (10-11) (look here for the BMI formulas, both for kilograms and pounds; the workbook’s NOTES tab sketches additional methodological background. This study’s sample sizes also operate within the confidence interval limits limned in the Excess weight and Obese tabs. Remember again that our spreadsheet doesn’t invoke the same data compiled in the journal piece. Remember too to review the Notes tab’s explication of data cells marked “s”).

Now to those data, about which there is a good deal to say. Or orrection: now to the data organization.  First, and perhaps most prominent, are the fields. We’ve seen this before, and the question begs reiteration: why are data which, for all analytical intents and purposes, belong to the same parameter – e.g. Reception – stationed in a plurality of columns? The resulting barrier to analysis is formidable, though not insurmountable; but surmounting requires us to cut and paste kindred data to the same field, copy and properly align the CCG (Clinical Commissioning Group) identifiers down columns A and B via a duly-diligent copy and paste, and make room for a year field in which that differentiating datum needs to be copied down, too. And you’l probablyl need to cut-and-paste the Year 6 data, too, and presumably to a worksheet all its own the better to disentangle these from the Reception figure, though that larger positional issue needs to be thought through – because in fact the workbook draws itself around two pairs of variables – the Reception/Year 6 axis, and the Excess weight/Obese binary. (It could at the same time be possible to jam all of these into a single grand sheet, by introducing, say, new School Year and Weight Category fields, thus lending themselves to additional breakouts along those lines, but I’m holding my jury out on this unifying tack.) The larger point, though: if you’re serious about doing something serious with the data, you’ll need to think about all these prior somethings, too before your analysis proceeds.

But quite apart from those necessary considerations you’ll also have to do something about the superabundance of merged cell in rows 1 through 4. These can’t coexist with your pivot-tabling intentions, and neither can the blank row

obesity1

That prohibitively separates those potential field headers from the data. The simplest rejoinder, of course: move the putative field names above down a row, but only after you’ve made sure that the now-unwelcome text entries still higher above these perch at least one blank row away from what you’re envisioning as the finalized data set (and if you’re going this far you’ll find that, once unmerged, the cell now invested with the “Number measured” header, for example, has been bumped two rows above its eventual place atop the dataset). And, for crowningly good measure, you’ll likewise have to unmerge the sheet’s title wrapped across the A1 supercell. (I also don’t see the immediate pertinence to our data of the 32,000 LSOA-CCG lookup records in that so-named tab, apart from the “The analysis uses a best-fit 2001 LSOA to 2013 CCG lookup created by PHE” declaration in the Notes tab. LSOA abbreviates lower layer super output area; for more, look here.)

To repeat – if you’re not prepared to do the reconstitutive work schematized above I don’t think you’ll be able to take the data very far; and why the analysts and/or story-seekers should be made to steel themselves for a rough ride across this obstacle course makes for one very good rhetorical question, though of course we’ve asked it before.

In fact, though, the data look pretty interesting, if now untamed. Your mission – should you choose to accept it – is to domesticize it all.

But don’t worry – the workbook won’t self-destruct.

Air Time, Part 1: UK Home Office Flight Data

12 Dec

I seem to have gotten on a vehicular jag of late – first bicycles, then cars, and now jets, more particularly ones that deliver functionaries at the UK Home office on their appointed, and perhaps sometimes, disappointed, rounds, rowed and columned here (for 2011):

 Home_Office_Air_Travel_Data_2011

The spreadsheet is perhaps admirably straightforward. I was preparing to dismiss the Departure field in column C for its at-first-blush redundant character, appearing to roll nothing but “UK” entries across its cells, but in fact 79 entries in the field offer up something other than UK, e.g., EEA (the European Economic Area, an organizational composite enrolling European Union and European Trade Association members), and Non EEA embarkation points.

And therein is promoted a measure of mystery. The data’s departure and arrival points are drawn in the most general terms, and one could properly wonder if that manner of indefiniteness means to divert the reader from exactly where the Home Office has been taking itself. That could be, but an enterprising journo could perform a bit of triangulation by correlating departure date, airline, and flight fares in order to educate a guess about the airports associated with those departures and/or destinations, and all with information unambiguously and publically there on the workbook. Nothing treasonous there – I think. But if I’m wrong about that, I need to finalize that reservation at the Ecuadorian embassy.

Be that as it may, taxpayers may want to know how fares fare, and those totals are most easily pivot tabled, broken out first by say, month:

Row Labels: Departure_2011

Values:  Paid Fare (format as you wish)

Paid Fare (again, this time Count). You should see:

air1

Considerable inter-month variation obtains; how that squares with the rhythms and syncopations of policy-making remains to be considered. Note that March hands in the dearest expense report, but trip count honors go to November, at 1128, or 37.6 a day. Those disparities put Marchs’s per-trip average at £260.41, and that for the eleventh month at £144.10 (I couldn’t output the averages via a calculated field, but I’m still thinking about why not and how the why the might be articulated), begging more than a few questions in turn. For one – rather than having traveled farther distances on average, is it possible that March’s passengers were more likely to have taken off from a seat in one of the upper classes?

That question is easily answered:

Row Labels: Departure_2011

Column Labels: Ticket_Class_Description

Values:  Ticket_Class_Description (PivotTable Tools >  Options > Show Values As > % of Row Total)

Click On for Columns Only among the Grand Totals option in the Design> Layout button group.

 air2

March Business Class flights blurt well above the 7.48% per-month average (that average, by the way, proportions all business class to all flights), but cannot explain the better part of that month’s elevated fare-per-trip. But try swapping Paid _Fare for Ticket_class_description in the Values area (by Average):

air3

That adds a jot of definitiveness to the query. March’s Business Class-ruling £993.28, far higher up the axis than any other month save Aprii, appears to account for its dilated overall fare average. Note in addition the very substantial January-May Business Class averages and the commensurate fall-offs, timed with June.

If we next substitute Destination for Ticket_class_description (and all those underscore-connected field names suggest a data inception pointing back to an application other than Excel), work with the Count operation in the Values field and show the values as % of Row Total (turning off Grand Totals for rows if necessary):

air4

It’s easy to be struck by the Home Office’s near-absolute moratorium on non-UK-destined flights after May. Indeed, no such sorties toward EEA nations streaked down any runway after that month. Of course I can’t explain the EEA front-loading in 2011; It would prohibitively difficult to point to an Office strategic deliberation to fly to only certain venues, and only then; one would think the commanding political exigencies would goad the flight patterns instead, if there is a “pattern”. We’ll need to pass this one over to you, Mr. and Ms. Journo.

We can begin to prepare for landing here by breaking out flights by airline, one of the parameters that again might hint at actual flight destinations:

Row Labels: Supplier_Name

Values:  Paid_Fare (Count)

Paid_Fare (Sum)

(In excerpt:)

 air5

 

55 airlines figure in the manifest, and these data require a bit of study as well. Note the unsurprising hegemony of the British-based British Airways and Flybe (along with BMI and BMI Baby, neither currently trading, as they say here).  Again, exactly how these plot against destinations might take a bit of reflection, or more than a bit.

So fasten your seat belt.