Marathon Data, Part 1: A Matter of Time(s)

24 Apr

Slow and steady wins the race?  Bromidic license aside, that can’t be true, can it? When I last looked, it was fast that was copping the money and the medals – but it’s the second adjective that might just hold up to scrutiny.

And I’m directing my scrutiny to some marathon data, the inspiration and workbook data flashing from runner/triathlete Peter Whent and his blog post on the 2012 London marathon, which placed the matter before my mind’s eye.  Peter’s concern is with the problem of steadiness, denoted here as the comparative timings of a runner’s first and second halves of the 26-mile race, and with an expressed interest in the number of entrants managing a negative split – that is, a second half run more speedily than the first. His inquiry confirmed the expected – that runners overwhelmingly (about 96%) stepped through their latter 13 miles more haltingly than their former. And the data, which Peter has kindly afforded you and me, is here:

 London2012 marathon results

(Note, by the way, that the second-half splits in Peter’s workbook were returned via a subtractive inference, e.g., the Finish Times minus the (first) Half times.)

All of which points our sneakers toward a corollary question: do split margins covary with overall times? That is – and sight unseen I’d promote the idea – are elite runners more likely to equilibrate their halves than the long-distance hoi polloi? Superior at pacing and strategizing their course, I’d assume that occupants of the world-class stratum would endorse, and act upon, the aerobic virtues of matched halves. Or could it be that first-to-second half ratios – as opposed of course to actual times –might be found to incline toward something like a constant proportion?

Needless to say that’s an empirical question, and given the data we can make some tracks toward an answer.

First, you’ll note that a good many of the 42,000-plus runners pulled up somewhere short of the finish line; you’ll know them by their blank Finish Time cells, and because they have no splits to report we should sort the runners by Finish Time (A to Z or Smallest to Largest, depending on the cell in which you’ve clicked.), thus casting the 6,000 or so non-finishers to the data’s ground floor. Then introduce a blank row between rows 36673 and 36674, thereby striking the uninhabited cells from the active data. Remembering that we want to affiliate split disparities to aggregate time, next squeeze a new column between K (the Finish Time field) and L, name that upstart field something like Split Pct Difference and enter, in what is now L2:


Simply put here, by subtracting a runner’s second half time from his/her first half and dividing that difference by the first-half time we deliver a percent change across the two halves, and here I get


Which is not the kind of number you were expecting. And that surprise has been sprung by a formatting disjunction.

Recall that, as with dates, time entries are nothing but numbers. Thus 12:00, or noon, is really .5, the quantity metering one-half of a day. Look harder at 8:00 (am), then, and you get .333333, and so on. But because the Split Pct Difference field has inherited the Time format of the next-door Finish Time, we seemed to have ticked off seven minutes and 43 seconds, or what is really .54% of a 1440-minute day. And indeed, when we reformat our field into Percent Style and unpack two decimal points for the value in L2, 0.54% resolves before us.

But there’s a bit more neatening up to do. I noticed a handful of #DIV/0! gatecrashers glowering in the Split Pct  cells, their scruffy appearances reflecting the absences of split times in their records. If you sort Split Pct Differences smallest to largest seven dissident records plunge to the bottom, whereupon another blank row, this one interlarded between rows 36666 and 36667, will send them packing.

Now, at last, we can attend to our question. First, we can try this pivot table out:

Row Labels: Place Overall (grouped by say, 1000)

Values: Split Pct Difference (Average, and Number Formatted as Percentage to two decimals)

In excerpt, I get:


What we find is a fairly persuasive association between runners’ finish position and what could be termed their split discrepancy. The closer to the front, as a rule, the closer the timings of the two halves. Expressed in straightforward correlational terms, e.g.


The association comes to .442 – again fairly pronounced, at least by the lights of social science.

We also need to observe that our necessary analytical confinement to runners who actually completed the marathon imposes a skew of sorts upon the data, and the absence of a forthright gender-discriminating field complicates the relationship as well, but there may be a workaround here that I hope to discuss next week.

We can now lumber toward today’s finish line by adding that the pivot table above expresses the splits/Place Overall connection in ordinal terms; that is, we’ve weighed a runner’s splits against his/her finishing place in the competition. We could alternatively measure splits against grouped actual times, and we hope to do that next week. Just be advised there’s a pivot table-specific issue that needs to be understood and confronted before the data can do what we want them to do. In the meantime, do a few more sprints and break for lunch.

UK Names, 1905: The Reign of William and Mary

10 Apr

Call it Sociology Light, but birth names and their incidence say something about a culture’s here and now, and its theres and thens; and a table of the 100 most prominent names conferred upon newborns in the UK, circa 1905, and lined up with a like array from 2012 must be at least halfway instructive, it seems to me.

The data, broken out by gender and put together by and put in a downloadable place by the Guardian’s DataBlog, are here, too:

 Data on baby names

Once you relieve the boys headers of their wrap-text trammels (and note the odd typeface discrepancies up there as well; and I’d have formatted the Change field in percent style, lest readers think all those negative numbers communicate absolute values. You’ll note in this regard that a Change of -1 suggests a name-occurrence reduction of 100%, but in some cases this rounds off what is in fact something less than that), the names numbers prove downright interesting. Tack a simple sum formula to the base of the No. and Rate columns for both genders and we learn, among other things, that the 100 1905 names contributed 240,289 of all boys’ and 197,042 of all girls’ birth names, that year, compared to 2012′s 96,534 and 56,002 respective aggregates. The Rate sums (that is, the number of times the top-100 names appeared among every 1,000 births) for 1905 stand at a remarkable 508.14 and 431.73, meaning that the boys’ 100, at least, identified more than half of the male babies that year. By decided contrast, the 2012 figures came to 257.86 and 157.6. But then, we’re not comparing like for like.

And that’s because like for like would require us to call in the top 100 names for 2012, a distribution furnished not at all by the data before us. All we have here are the counts for the 2012 births of just those names that cracked the top 100 in 1905. We can, however, download the 2012 names  for both genders here in separate workbooks, on that page’s second and third data links (use the names on Table 6 for both sets; these array the names down a solitary column and exhibit all the birth names bestowed upon at least three children, thus counting off a top 6,200 for boys, and over 7,700 names for girls. We’ll want to access all the names, for reasons I hope to justify shortly.) What these data, however, don’t afford us with is rate data, and only the absolute birth numbers; but as we can see, we should be able to draw the rate inferences nonetheless.

So what do we want to do here? First, copy and paste the 2012 names (all of them, as I’ve advised) in distinct columns for both boys and girls into a new sheet in our workbook, and separate these by at least a few columns. Name the range of names (names only) for boys Boys, and for girls Girls (I’ve let my imagination run riot here). Then return to the original 1905 Boys sheet and divide the number of 2012 names for William in D1 by its rate in G1. The yield – 374.39 – can be adopted as a more-or-less constant, signifying the number of actual name-bearing births in 2012 accounting for 1 out of 1,000 births that year (more or less, because of a rounding off of the rate numbers down the column, I think). In other words, about 374 actual male births in 2012 compute to 1 per 1,000.

Then alongside the most numerous name in the 2012 boys’ range – Harry – divide his 7,168 total by the 374.39 (presumably in the column to the right of the actual birth numbers). I get 19.15 (per 1,000); then copy down the column. Add the outcomes for 1 through 100 and I get 542.23- an even mightier hegemony for the top 100 boy names in 2012 than held for 1905. Move to the Girls range and try the same tack. I’m using a rate constant of 355 here, in the interests of again looking past the variation inflicted by rounding off. Dividing the girls’ names by the 355 and summing the rates, I get 444.83 – again, a heightened share for the 2012 top 100.

And there’s something else we can do – we can go on to figure the 2012 rankings for the 1905 names.

It works like this. First, and I’m sparing you the trial and error, it appears as if all the boys’ and girls’ names in their native Table 6 sheets in the 2012 workbooks have been set upon by an old nemesis – the superfluous space. Because we’ve already copied these names to that separate sheet in the 1905 workbook (to the ranges we’ve called Boys and Girls) apply the TRIM function (see the March 27 post, for example) there to some nearby columns, and Special Pasted > Values atop the original names. Then delete the column bearing the TRIM formulas.

Then do something simple. Cut and paste the 2012 ranking numbers (in the sheet containing the Boys and Girls ranges) to the next free rightward column, which should flank the rate numbers we calculated earlier. That is, you should see something like this for Harry:


Then scurry back to the original Boys sheet. In the next available column – mine is H – write the following:


Where the Boys range in my case holds down B2:E6195 in my Sheet1. Thus we see that William – the boys’ name frontrunner in 1905 – turns up in the 9th position in the Boys range:


Norman, on the other hand, has plummeted from 26 to 4805 (note that many names in 2012 share the 4805 position; they’re the ones with three babies each). And now you can do the same for the girls (but be prepared for some #N/A messages, these pointing to those 1905 names which no tots in 2012 received, e.g., Gladys.) And there’s something about Mary – number one in 1905, down to 241 107 years later.

So what’s in a name? Not much, perhaps; but names? Squint, and you can see the zeitgeist moving.

Vacation time this coming week – Word Press is a most enlightened employer.



My Little Town: London Real Estate Data

3 Apr

Location, location, location; and no, it’s not the real estate that’s got me waxing bromidic, it’s the spreadsheets, of course, and those recurring second thoughts about how their data are conveyed, and exactly how and where they’re…located.

Recurring, indeed, at least on these WordPressed pages; and apropos that perennial restiveness let us pack up and move to the Average House Prices Borough workbook on offer at the site, and learn a little about that city’s combustible and ever-newsworthy real estate sector, and its no-less-combustible spreadsheet. It’s coming to your neighbourhood here:


The data here clamber about eleven scarily similar sheets, strewing price and sales volume numbers across the 33 London boroughs (they’re counting the interstitial City of London as a borough here). And while the data are readable, are they usable? That is, can you do something with them?

A loaded question, replete with half-full answers, because of course the answer depends. Take, for example, the identically-structured Median and Mean Annual sheets, asking after their receptivity to pivot tabling. My answer: iffy. The sheets, you will note, drape the data in year-per-field mode, e.g.:


And that peculiar positioning, as I took some pains to urge upon you in my August 22 and 29, 2013 posts, clogs the pivot tabling enterprise, and so if you want to effect the kind of data transposition I described there (but won’t belabor) here, you’ll need to:

  • Delete the vacant row 2.
  • Consider this: because the restructuring espoused in the August posts trisects the data into Row, Column, and Value fields only (which in turn are cast into a three-legged, or field, Excel Table), and because it is invariably the leftmost field in a data set that resets itself in Row terms, the Area field text data will assimilate into the Column field, even though its borough names are not of a piece with the numeric year data. What you could do, then, is simply delete the not-terribly-meaningful Code field, thus vesting Area – now the leftmost of the fields – with Row status.
  • Earmark only A1:S34 for the pivot table-eligible makeover, that is, just the London borough-identifying rows. The summatory data beneath row 34 communicate different kind of data.

You could do much the same with the Mean Quarterly sheet, but their double-headed fields – Q1, Q2, etc. supervened by the year-referencing 2010, 2011, etc., in the merged cells in row 1, won’t work. Delete row 1 and rewrite the Quarterly heads to a differentiating Q12010, Q2010, and so on. (You’d need to do something similar with the Upper Quartile Quarterly, which for all the world sounds like a scholarly journal. Here, though, you’d have to pad the 1,2,3,4 quarterly headers with the year data instead.)

It’s also occurred to me that once you’ve submitted the data to the above process you could march on and develop pivot-table breakouts along Inner-Outer London lines, as per the aggregated Inner-Outer mentions in the lower reaches of most of the sheets and which won’t appear in the pivot tables we’ve commended above. (For a detailing of the Inner-Outer borough opposition see this piece. There is, by the way, a dispute about where the borough of Newham goes.) So once you’ve forced the data into the Row, Column, Value triumvirate you can then add a fourth field (say, called I/O, and that’s a not a computer throughput allusion) stocked with borough Inner-Outer identifiers.

To head in that direction you first need to hammer together a lookup table, the data for which I’ve carried off from this bit of real estate:



Greenwich I
Hackney I
Hammersmith and Fulham I
Islington I
Royal Borough of Kensington and Chelsea I
Lambeth I
Lewisham I
Southwark I
Tower Hamlets I
Wandsworth I
Westminster I
City of London I
Barking and Dagenham O
Barnet O
Bexley O
Brent O
Bromley O
Croydon O
Ealing O
Enfield O
Haringey O
Harrow O
Havering O
Hillingdon O
Hounslow O
Kingston upon Thames O
Merton O
Newham O
Redbridge O
Richmond upon Thames O
Sutton O
Waltham Forest O


You’ll then need to edit the borough names here so that they correspond to those in the workbook, e.g., trim Royal Borough of Kensington and Chelsea, and keep an eye on Hammersmith and Fulham. Best practice: copy one instance of each of those two from the Row data into the lookup table.) Copy the table somewhere into the workbook, and call it say, InnerOuter. Then once you’ve put Row, Column, and Value in place, say with the Median Annual data, write this formula in D2 (after naming this new table column I/O):


Tables being tables, the formula automatically writes itself down the D column (the [@Row] syntactical bit is how Tables express a cell upon which you’ve clicked, instead of the standard A2, for example. They’re called structured references, and aren’t quite a day at the beach; see a Microsoft discussion here).

Once you fused the data thusly, e.g.:


(the formatting is basic Table defaulting at work)

You can pivot table this take, for one:

Row Labels: Column

Column Labels: I/O

Values: Value (Average, and formatted as you see here)


Remember those are median house prices cross-tabbed by year and Inner/Outer values.

I live in Outer London, by the way. Is that why Outer prices are so much lower?

$5,000,000,000 Worth of Salaries: The Maryland State Worker Data

27 Mar

Transparency, it’s wonderful; but when the sun trains its rays on government data full on, sometimes the red faces you see out there aren’t sunburned, but rather embrowned with embarrassment.

Point your TomTom toward the sovereign state of Maryland somewhere in America’s Mid-Atlantic in order to see what I mean. For some time now, the state has made its employee salary data available to inquiring minds – and by employee name. And that means if your neighbor’s, or friend’s, or cousin’s check has the governor’s name on it, you know exactly how much they make. Will you ever look at them the same way again?

In any event, all that illumination, or incrimination, shines right here, on the Baltimore Sun’s site:

(Click the “For this data in CSV form, click here.” link.)

The workbook will run roughshod over 12.8 MB of your hard drive, but that’s what 135,000 employee records will do to a file.

In any case, once there, you’ll find the data quality pretty good, but not flawless. For one thing, the first two data rows clearly aren’t fit for purpose; row 3 in particular seems to have played host to some manner of dummy data, as all its cells report some variant of the 123456789 sequence. You need then to delete these mislaid rows, though you probably didn’t need me to tell you as much. I’d also probably delete the A column (field-named orig-sheet) that appears to share sheet-identifying codes with us that can’t imaginably advance our narrative intents.

But once you get past the quotidian but necessary column auto fits, it gets a little more exciting. Interested as I was in Maryland’s salaries for its university employees, I pushed the On button for a pivot table, beginning with something like this:

Filter: organization-name (select the STATE UNIVERSITIES & COLLEGES item)

Row Labels: organization-subtitle


What’s wrong with this picture? Hint: We’ve seen it before, and now we’ve seen it again. But before the suspense gets intolerable, I’ll let on: it’s those two entries for Towson University – in plain, apparent violation of the pivot table dictum that field items shipped to the Row Label area can be listed only once. And in fact, the dictum holds; note the discrepant lengths of the Towsons, intimating loudly that some of the data here have been superfluously spaced. And that’s true – and for the quick fix, insert a temporary column in the data alongside the organization-subtitle field, and enter


(Your reference here would be F2 if you had deleted that A column.) Copy the above expression down the column, copy all these results, click G2, and aim a Paste Special > Values at the cell, thus imprinting the whole column with hard-coded space-lopped data. And then delete the temporary column, which can now be honorably discharged.

And while we’re at it, note the two entries for what is surely the same Frostburg State University both times. These several spellings can be brought into line by selecting the G column, and issuing something on the order of this Find and Replace:


 Refresh the pivot table as it stands, and the Row Labels reappear, duly rectified.

And given the workbook’s potpourri of fields there stand to be any number of interesting investigative possibilities in there. For example, why not try to associate salary with what Maryland calls eod, or entry on duty (which I take to mean employee start date; some field headings are elaborated here, though I had to do a site search for the meaning of eod. I’m also not quite sure about Samuel Esposito, eod dated April 21, 1921 – soon to celebrate his 93rd year of service to the state, along with the coterie of staffers with eods dating back to the 50s):

Row Labels: eod

Values: ytd-gross (Average)

(The site observes that “The ‘YTD’ refers to the time specified in the database query by the state, not to the current date.” (quoted from the linked location above). I for one am construing the data, perhaps mistakenly, as the effective yearly totals for calendar 2012. After all, it’s now 2014, and the yearly results should be in by now. But note as well, by the way, the small packet of 15 records bearing 2013 dates, all but one of them time-stamped January. Because some of these report overtime earnings, I’m assuming these are live, actually-compiled data, their post-2012 eod notwithstanding.)

Be that as it may, over 10,000 dates stake the Row Labels area in our table, and so some grouping might the parsimonious thing to do here. But my Group Selection button lets me know I Cannot Group that Selection. Why? Sort the dates and you’ll see why; six of the records date their eod 1900-01-00, a dummy, text placeholder that as a consequence stymies any attempt at grouping (all presumptively grouped data have to be up to the grouping task). I’d take the relatively easy way out here; click among those ersatz dates, sort A to Z (if you click an actual date instead in order to start the sort you’d sort by Oldest to Newest instead), and beam a row insertion between 135722 and 135723, thus casting the stand-in dates away from the data source. Then you can try, after resetting the operative pivot table range to $A$1:$p$135722 (via PivotTable Tools > Options > Change Data Source in the Data button group; again, the P may be lettered differently in your table if you’ve deleted some fields):

Row Labels: eod (group, say by year)

Values: ytd-gross (Average, currency format)

Ytd-gross (Count)


(That’s an excerpt, of course.) The correlation between year and average salary: -584, which is not insensible; all things being equal, the longer the tenure, that is, the “smaller” the starting year, the higher the earnings (Note: if you essay the correlation by invoking the numbers in the A and B columns of the pivot table itself, that is,


you’ll be slapped with a #DIV/0! message for your troubles. That’s because that even though the dates lining the Row Labels are full-fledged numeric data, they revert to a temporary label standing when they’re shunted to Row Labels. To realize the correlation you’ll have to copy the pivot data elsewhere).

But whatever you do, just remember: the next time you visit Maryland, be nice to the 135,728 red-faced citizens there trying hard to keep a low profile. Don’t let them know what you know.



Left-Handed Complement: A Look at (Some More) Data on Handedness, Pt. 3

20 Mar

OK then; so if the Left Hander Club survey data have to relent at last to the null hypothesis – that left-hander births accumulate in their own time, owing no statistical fealty to the months in which they come to term. But if that be so, it might then behoove us to ask about alternative data sets which might enable us to pose the birth month/handedness question anew.

And I think I’ve found one.. The Sean Lahman baseball database, a free (but he’ll be happy to receive your contribution) downloadable must-have for devotees of the national pastime (and please indulge that Yankee jingoistic blurt) has just the parameters we want on offer, in its master-listing of what purports to be just about every player who ever whacked the dirt off his cleats for a major-league team. It’s here:

 handedness – Lahman data

And the data are large, if not big; close to 18,000 player names and cells packed with the biographical bits a baseball fan hankers for, among these player height, weight, birthplace, and birth month, and two handedness fields – one for the players’ batting orientation, the other for the hand with which they threw, or throw,  a ball.

The Lahman population, of course, is exceedingly specialized, and as such is assuredly skewed. For one thing, all its members are male, a gender constant which in fact probably serves the data well, given our intentions for them. And names here are most obviously not self-selected; neither Sean Lahman nor the players themselves had the minutest foreknowledge about the research interest that brought us to them.

Now there’s that multiple-handedness-fields matter to which I pointed a paragraph ago. The data record what I’m calling the players’ batting orientation in one of the fields, and their throwing hand in the other, and it is throwing that stands as the more primeval or “natural” proclivity. In fact 6.6% of the players for whom the data are available have received the code B in the Bats field, that letter standing for a player who can hit either right or left-handedly, depending in turn on the throwing hand of the pitcher whom they’re currently facing (I’ll short-shrift the technicalities, but it’s easier for a left-handed hitter to contend with a right-handed pitcher and vice versa). It’s clear that these players – known in the trade as switch-hitters – have tutored themselves into this duality, but I can’t think of any player who has wilfully re-schooled his throwing orientation.  Hence, I’d work with the throwing hand data.

But again, let’s not take our eye off the ball. We’re looking toward the Lahman list in order to learn something about handedness and its affinities – if any – with birth month, and it appears to me that its entries, in virtue of their number and the unawareness of its population of our plans for the data – better suits them to the task than those of the Left Handers Club survey. So let’s look.

Try this pivot table:

Row Labels:  birthMonth

Column Labels: throws

Values: throws (Count; the data are textual). Click PivotTable Tools > Show Values As > % of Row Total)

After filtering the 800 or so blank records (a process set in motion by clicking both Filter buttons – the ones alongside the Column Labels and Row Labels legends), I get


And another skew issue implores our attention here. If lefthanders amount to about 10% of the general American population (although ballplayers in the US are increasingly foreigh-born) it’s clear that major-league baseball players, on the other hand (felicitous bit of word play there), are about twice as left-inclined as everyone else. But again, that tilt, incontestable as it is, isn’t quite the issue. Our concern, again, is with inter-month percentages; and the player numbers above, quite apart from their departure from nationwide percentages, don’t appear to rise to the standard levels of significance. That is, we seem to have come face-to-face with the null hypothesis again – and we’ve blinked.

The overrepresentation of lefthanders here seems to be due to the belief – not entirely groundless – that lefthanded hitters are advantaged, in view of the preponderance of right-handed pitchers in the game (though that edge appears to be blunting; see this citation in the Wall Street Journal of my little study of the left-right batter data). But that is a recruitment issue, one that speaks not at all to the birth-month question.

It is noteworthy, if only in a subsidiary sense, that distributions of handedness seem to have demonstrated some longitudinal movement, as per this pivot table:

Row Labels:  birthYear (grouped in intervals of 10 years)

Column Labels: throws

Values: throws (Count)


Again, the fairly dogged ascending slope in lefthander percentages owes its shape to influences other than birth month (the apparent fall-back in the 1990-1999 lefthanded numbers may be nothing but an artifact of the smallness of its bin – 106 names).

So if you ask me – though I can’t recall that you did – birth months and left-handedness have next to nothing in common. And I’ll dare to say the same about birth months and right-handedness.

Left-Handed Complement: A Look at Data on Handedness, Part 2

13 Mar

Before I screw up the temerity to forward a new and different set of handedness data in your direction, a few residual thoughts on the data under our current consideration might be worth thinking through.

The Lefthander Birthdates Analysis sheet charts a number of data series in the interests of promoting birth-total comparisons, including therein a Flat series that presumably draws a baseline of birth-month expectations defaulting to a “chance”-driven birth distribution. That series simply divides 100% by 12, thus granting an 8.3% birth share to each month; but of course the variable day allotment of the months should manufacture a set of variable fractions instead. Each month should have submitted its days to a divisor of 365.25, the approximate average duration of each year (there’s that leap year in there); and so the 31-day months would chance-predict an 8.49% birth proportion, 30-day entries would figure to 8.21%, and the 28.25-day February predicts 7.73%. It thus seems to me, then, that the baseline is a wavy one. (Note also that the chart’s country birth percentages emanate from governmentally-compiled, not Lefthanders Club survey data).

Moreover, I would allow that the Daily Analysis sheet – a breakout of respondent birth dates by each of a year’s potential 366 days (below, in excerpt)–
















































could have, and probably should have, been entrusted to a pivot table:

Row Labels:  Date (Grouped by days only)

Values: Date (count)

Because these Daily Analysis dates seemed to have been appropriated from the text-populated MonthDay field in the birthday-dates master sheet, the days can’t be grouped, for example.

In any case, the Left Hander numbers declare an August plurality, though consecutive months plot a rather jittery trajectory (this is my pivot table):


And what could all of the above mean? Remember the survey’s raison d’etre: to associate the incidence of left-handedness (however that trait might be denoted) with certain birth months.  Granted, I’m not a neurologist, but if handedness is in fact month-bound I’m at a bit of a loss as to how to reconcile the oscillation in handedness percentages across adjacent months, and on the face of it the notion that manual preferences should somehow exhibit seasonality sounds like something of a reach – with either hand. Do we want to run with the proposition that children born in late August experience a stepped-up gestational likelihood they’ll emerge lefthanded relative, say, to a babies born in the first week of September? And if so, why?

But there are additional considerations. Is the Lefthander Club survey sample skewed? Probably. After all, its respondents were self-selectees, the sort of volunteer-led cohort that bodes badly for samples. But why, on the other hand, should self-selection even matter here? Why should a lefthanded respondent’s decision to respond somehow tie itself to the month in which she or he was born? Skewed samples can cause all kinds of problems, but only when the skew pushes directly at the theme with which the survey is concerned. Asking a readership to complete a questionnaire on political party affiliation might inspire the members of one party to tick those boxes more eagerliy than the other. Indeed, a given readership itself might very well be skewed (think of the Literary Digest 1936 presidential survey debacle) to begin with. But what selection contaminants could befoul a handedness-by-month survey? Why should the August-born be prepared to assert their birth month with a conspicuous, invidious ardor? What’s in it for them?

The point is that not every variable impacts every equation, and the default premise grounding the lefthander survey – that respondents have no particular reason to extol this or that birth month, along with other conceivable sources of skewedness – for example, a greater willingness of women to participate – shouldn’t alloy the birth-month numbers.

And so we might be ready to declare the survey sample functionally random. But there is a complication, apart from the necessary considerations of sample size (and it does appear as if the monthly variations are statistically significant, at least at first glance), one that was brought to my attention by Chris McManus, an authority on handedness who teaches at the University College of the University of London (he’s also cited by the Lefthander’s newsletter).

If you return to the Daily Analysis sheet, row 226 looms large:

That’s August 13th – which happens to be International Left Handers Day, the day appearing on the birth certificates of 1.04% of the American respondents and .94% of the British submitters. This, then, appears to exemplify a case of self-selection that does contribute to a meaningful skew, and Professor McManus adds that once the August effect is chased from the data the remaining months fail to comport significantly with handedness.

But I still haven’t said anything about that alternative data set. I think we need a third installment.

Left-Handed Complement: A Look at Data on Handedness, Part 1

6 Mar

I’m left-handed, sort of, predilection enough to qualify myself for a subscription to the Left-Handers Club newsletter, a miscellany of products and disquisitions about the minority group whose members belong to all the other ones, too. (OK; I’m an in-betweener – writes, eats, and dribbles a basketball with his left hand, throws, bowls, and actually shoots a basketball with his right.)

It was a recent piece in the newsletter that raised at least one of my eyebrows (can’t recall if it was the left or right one), and got me to thinking about possible rejoinders to its central claim – namely, that  handedness somehow associates itself with birth month. A survey of its left-leaning readers, respondents gathered via self-selection, measured notable left-handed disproportions for August and July, with smaller precipitates stacking in January and February. Moreover, the survey has nicely packaged its data in spreadsheet form, complete with none other than a pivot table to season the analysis. Let’s devote this installment to the workbook itself, after which we hope to think about the results and position a supplementary data alternative to the thesis.  You can take grab the workbook with either hand here:

 Lefthander Birthdates

Note first, and for what it’s worth, the Line field numbering the 3000+ survey records in apparent chronological sequence of submission as counted off in neighboring Entry field. Again for what it’s worth note that the sequence is not unrelievedly consecutive; dozens of gaps poke at it throughout, and so the 3148 records top out in the Line field at number 3348. Does any of this matter? Probably not here but it could, depending on where you’d want to take the data. If, for example, you wanted to group the records by Line number in bins of 100 each, you’d pull up short of 100 in nearly every bin. (I’m not sure what you’d do with the Entry data themselves and their five survey submission days flanked by the 9th and 13th of January, but that’s a quibble.)

More than a quibble, though, could be registered over the piecing together of the DATEVALUE expressions in the Date field. We spent some time on DATEVALUE in my February 21, 2013 post, but that introductory note doesn’t speak to the data contingencies placed before us here.

DATEVALUE remixes a collection of (basically) text data into actual, quantity-bearing dates, but it’s quirkier than I would have supposed. For example(s):



=DATEVALUE(“6/Aug/2013″)  (as per European date syntax)

Will both yield 41492, the numeric identity of August 6, 2013. Format the container cell in Date mode and the date obtains (one of those aforementioned quirks allows you even to enter Augu and the like in the expression without impairment, so long as the month name invokes at least its first three characters. The names whip up date values even If the attending year and day data are numeric, and not text. See the discussion immediately below in that regard. I told you it was quirky).

The birthdate workbook, however, concatenates, or combines, three cell references instead, e.g. cell H9:


 The references signifying Day, Month, and Year respectively (as per the Left Handed organization’s UK-based  time protocol). The ampersands concatenate, or join, all the references into a single expression, along with the hyphens sidled by quotes in our case. The result is captured in H9.


But the formula raises some syntactical  questions. For one thing, the + sign grazing E9 is superfluous – and so, for that matter, are the quote-surrounded hyphens. Had the formula read


7/25/1958 would have worked itself out just the same. Edit the formula that way if you want to see what I mean. (Note by the way that the hyphens turn into slashes, because the H-column cells had received this format:


in which the slashes rule.)

Now check out the formulas in the J and K columns, each directing a VLOOKUP to a table in the Countries tab. The plan here is to additionally locate each respondent’s country of birth inside what are labelled the Zone and Hemisphere rubrics, and deliver these to the J and K column. Note the interjection in all cases of the FALSE term (or argument, as it’s officially called), e.g. cell J9:


FALSE is a compensatory means for negotiating a lookup table whose first column – the one comprising the values to be looked up (be they text or numeric) – isn’t sorted ascendingly. FALSE means that the lookup is to confine itself precisely to the contents of that first column, and not entertain any value that squeezes between two rungs in the column. Thus you need FALSE in order to work with this lookup table of students and their grades:












Once you sort column one, though, you can drop FALSE from the formula – truly.

Thus if we sort the first column in the data set on the Countries sheet – and then, in the service of additional formulaic understatement name the range there (let’s call it Regions), our lookup statement slims down to something like this (in J9, for instance):



The expressions in the Hemisphere field imitate the above formula, save one departure – the 2 is replaced by a 3, the column number in which Hemisphere info awaits looking up. But are the lookups as we found them “wrong”, then? Not quite – they all issue irreproachably correct results, after all. But formula efficiency is better served by the rewrite.

Next, push over to cell M9:



What’s this formula striving to do? This: reap four-character yields down the M column by concatenating the respective contents of any given row’s I and E cells (in that order). Thus M9 returns 0725 – ultimately =I9&E9 – but because I9 offers but one character, a 7, the formula tacks on a zero (remember that the result is textual and possesses no direct quantitative standing) in order to pad the yield to four characters; and should the day value in E likewise be single-digited, the formula rolls in a zero there, too.

But again, the formula could be pared, first by formatting both the E and I column thusly:



As advertised, you’re looking at a custom number format comprising a two-digit 00 code. That code supplies a 0 alongside any single-digit number, but if the number already features two digits, the format simply leaves the value alone.

Once the E and I values are so formatted the data will take on this appearance:

Then get back to M9 and enter:


And copy down the column. I get, in excerpt:

But that isn’t working, is it – not if we demand four-digit entries in each and every cell. It isn’t working because the & sign, which concatenates values in specified cells, demotes the values into text, and the 00 format only attaches to bona fide values (keep in mind that combining a 7 and a 25 here isn’t an act of addition; rather, it’s stringing the 7 and 25 together). The way out? Try this:


And the 725 acquires numeric status. Then format the M column:



Four zeros this time, because you’re aiming for four-digit numbers.

Got all that? This stuff is likely to be on the mid-term.


Get every new post delivered to your Inbox.

Join 147 other followers