Remember the Olympics, that quadrennial competition between the world’s elite corporate sponsors? McDonald’s-by-the-Thames, Lebron and Kobe on the same team, scandal on the badminton court, wholesome sublimated paint-ball nationalism (you know, my guy ran faster than your guy, especially if you live in Jamaica) and all that?
It was in all the papers. In any case, it seemed to me that the topical concern of a previous post (August 24) – namely, the modal incidence of August-born baseball players in the major leagues – could be naturally transposed to the Olympic context. Does birth month in any way broaden or constrict an athlete’s (Summer) Olympic prospects, or is the correlation spurious, or scant?
Postscript and Prelude
But before we put the lens to the data, you recall that the August birth-month skew among ballplayers is typically ascribed to the July 31 birthday cut-off for kids enrolling in any given year’s Little League cohort. That closing date makes the August-born oldest in their admitting class, so to speak, and thus bigger stronger, more adroit in the baseball arts, and thus possessive of a sustained developmental edge.
In this regard, and you may have missed it, reader James Zhang’s September 9 comment thumb-tacked to the baseball birth-month post asks a most salient, purloined-letter of a question: what if the August birth-month differential could be more properly laid not to Little League registration policies, but rather to the simple demographic possibility that more children are born in August?
A very good question, Mr. Zhang, and it turns out there’s something to it. By peeling off birth data from a melange of sources on the net (including Vital Statistics of the US) and grafting these to equivalent-year stats from Sean Lahman’s baseball database, I got, by way of a sample:
Birth Month |
US August Births |
Baseball August Births |
1941 |
9.13% |
11.21% |
1950 |
9.15% |
14.39% |
1960 |
9.21% |
12.23% |
1965 |
8.93% |
10.34% |
1970 |
8.86% |
7.78% |
1975 |
8.90% |
11.11% |
The baseball numbers aren’t enormous (averaging about 168 players per year), but the aggregate effect is pretty evident, particularly when keyed to the 8.48% “chance”-driven expectation for August births. The conclusion then is multivariate: both national and ballplayer August births better chance likelihoods, but it is the baseball effect that pulls farther away from default expectations. In any case, a birth-month tug does seem to be in force here, something to be kept in mind.
But what then about the Olympics and its vastly more heterodox, Little League-less athlete pool? What sort of birth-month association, if any, marks the data?
Let’s see. Our data source comprises the Guardian’s spreadsheet of Olympic 2012 medal winners saved to Excel and made available here:
All London 2012 athletes and medal data
(N.B. These medal data are incomplete, the workbook having been compiled before event outcomes were realized in the Games’ last three days, e.g., basketball and the marathon. I think it would be a reach to insist that the deficit degrades our analysis, but in any event John Burn-Murdoch, who works on such projects for the Guardian, tells me the completed rendition should be out there soon. For a 2008 census of National Basketball Association player birth-months see
http://statsheet.com/blog/applying-outliers-does-birth-month-matter-in-basketball ).
Some Notes About the Book
Before we broach our question – whether a birth-month effect in any way correlates with Olympic medal success – a few words about the workbook’s design would be instructive.
- Note the coupling of first and last athlete names in column A – not the by-the-book way of handling such data, particularly if you need to sort these by last name. That task can be dealt with even as the data stand, but the task can get messy. On the other hand of course, the names are extraneous to our purposes right now – but maybe not next time.
- The Age field could be regarded as an ever-so-slight redundancy, given the companion Date of Birth field and the latter’s enabler role in helping derive real-time athlete ages. Example: if you want to learn A Lam Shin’s right-now age, replace the 25 in cell C2 and write
=(TODAY()-H2)/365.25 (the denominator represents the average number of days per year).
Her age result – 25.97 as of today – will change daily. Then copy down the C column. Just remember to send Ms. Shin a birthday card. (Of course, this formula can fall victim to its own volatility. View the workbook in ten years and you probably won’t need to know that she’s now 36.)
- Moreover the Age Group field, impressed into Row Label service in Pivot Table 10, comprises text data, and as such cannot be further manipulated absent a concerted round of hoop-jumping. If in fact you want to group the athletes by age tranches you’re far better advised to substitute the Age field and dice the data with the Pivot Table’s Group Selection option.
- By sounding an unrelieved drone of YESes in its cells, the Medal Winner? field consigns itself to dispensability, teach us nothing. By definition, the workbook means to catalogue none other than medal winners. On the other hand of course, if you have no plans either to use these data or present the field to readers you can merely ignore it.
- Note the G, S, and B fields (Gold, Silver, and Bronze) and their fractional medal representations, the Guardian’s attempt to align overall country medal totals with the grand total of actual athlete medallists, an obeisance to team-based competitions. Thus Croatia’s Valent Sinkovic in A932 comes away with .25 of a silver medal, having competed in the Men’s Quadruple Sculls Rowing event. These apportionments do make a certain sense, but other analytical necessities may bid you to award him one, indivisible silver.
And Something Else…
Now here’s the real problem, to which you may have already altered yourself. Remember that we want to see if athlete birth-month data stack higher for this or that month, and if so, why. But unlike the Lahman baseball records that informed our baseball birth-month post, no immediate month data avails in the Olympic workbook. However, Excel’s MONTH function enables us to do what we want, and very simply:
=MONTH(cell containing date).
Thus
=MONTH(A2)
will return the value 9 (September) for Ms. Shin. Once in place, we can copy that expression down a column, and we’re done – maybe.
But look at cell H11; the date for Mr. Zielinksi exhibits a left alignment that intimates, without quite proving, that H11 contains a text entry, which simply can’t comport with MONTH, which is looking for numerical data. Remember that dates, beneath all the formatting rouge and lipstick, are numbers (see the August 20 post), and while it’s quite possible and perfectly legal to left-align a number, the anomaly in H11 hints otherwise. To clinch the point, click in any blank cell and type
=H11*2
Don’t you just love error messages?
I can’t explain why these text entries infiltrated the larger complement of workable dates but they did, and to dangle a preposition, it’s something we need to deal with. Before we break out birth months we need to see to it that all our data-to-be in fact qualify as months.
One possibility: we could sort the DOB column by Oldest to Newest and eye the bottom of the record stack. I see 21 text-formatted, faux-date entries; and while we could return to the insert-a-blank-row-above-these-records expedient (something I’ve explained in earlier posts, including August 30), thus estranging the bad apples from the usable data (they only contribute about 2% of all records, after all), we don’t have to. The reality is that we can retrieve the months from these date pretenders, and here’s how I’d to it.
Insert a column to the right of DOB and title it Birth Month. Select the new column and select Number in the drop-down menu in the Number button group:
We’ve taken that step to insure that our results here look like numbers, and not dates. Then in cell I2 write:
=IF(ISTEXT(H2),VALUE(MID(H2,4,2)),MONTH(H2))
OK – you’re entitled to an explanation of this rococo expression, though you’ve probably managed to make some fledgling sense out of it already. The ISTEXT function, written simply
=ISTEXT(cell reference)
is invariably soldered to an IF statement, and inspects the referenced cell for its data type. If the cell comprises text, one thing happens; if it’s other than text (e.g., a number), something else happens. In our case we stipulate that if the cell evinces a text content, Excel will proceed to extract the month from the cell via the nested MID function (how that happens will be detailed in a moment); if the entry is not text, that is a number, then the formula will apply MONTH to the cell’s date/number.
And how does MID work? Its three elements, or arguments as they’re known in the trade, do the following, respectively
- Identify the cell in which MID will perform its work (H2 above)
- Cites the position of the character at which MID will begin to extract characters
- Declares how many characters will be appropriated from that inception point.
Thus if I type
SPREADSHEET
in say, cell A7, and enter
=MID(A7,3,4)
in A8, I should realize the character sequence READ.
Thus for our text-date data, those data to which we want to direct MID, we see that the month segment of text-bearing cells always situates itself in characters 4 and 5 – a good thing, because had single-digit months (January through September) been conveyed in single-digit terms (9 instead of 09) we’d have had to wrap our hands around a much stickier wicket; we’d have to have written a formula that sometimes extracts 1, and sometimes extracts 2, month characters.
(Note also that the text-date cells are expressed in European date format, in which day of month precedes month, even as the other data appear in American, month-first style).
And finally, we need to brace MID with the VALUE function because our results would otherwise remain text, and we want the resulting month extraction to hold numeric status. VALUE simply mutates a number formatted as text into its quantitative equivalent:
=VALUE(A7)
would realize the value 7, had you entered a text-formatted 7 in the cell.
Now that you’re panting from all that heavy lifting, kick back and smoke ‘em if you’ve got ‘em, because the rest should be a comparative day at the beach. Copy the formula you’ve inscribed (and you need all those parentheses in place) in I2 down the I column and you should be treated to a medley of values ranging from 1 to 12, something like this:
(Ignore the decimal points if you see them; they don’t matter.) Next, let’s pivot table the months. Boot up a table and
Drag the Birth Month field to the Row Labels area.
Drag Birth Month into the Values area. Click in the Values area, and in turn click PivotTable Tools > Options > Summarize Values As > Count (the data have defaulted to Sum simply because they are numeric). You’ll see:
I’ll take a wild guess and allow that the largest birth month happens to be…August, the only one stepping up to three figures. Note in addition the no-less-striking, precipitous fall-off in the succeeding months.
Now remain in that field and click Options (if necessary) > Show Values As > % of Column Total (again, I’m directing you through the Excel 2010 interface). You should see:
August rules – again – but of course a satisfactory accounting awaits.
Now there is some evidence for a global August birth predominance, but a closer look is clearly in order. A UN international, country-by-country birth-month spreadsheet (click the small download link, but there’s some hoop-jumping required here in order to whip the data into shape) shows a July-September birth skew, but you’d have to toil to reconcile Olympic countries and team size with these data.
Note as well that if you swing the Country field into the Report Filter area and click on United States of America (athlete total here: 120) you’ll get
Still another win for August, albeit for a relatively small universe. Another remarkable stat: filter for People’s Republic of China and you’ll behold a 23.68% birth contribution from January (athlete total: 76), an extraordinary outlier I’ll leave to the Sinologists.
Sure there are other pivot table permutations to be crunched (you could break out for gender, for example, and sport), but the principal research remit is already out there: tracking and explaining the birth-month curve. An artifact of fertility trends, or the consequence of subtle, worldwide athlete recruitment protocols – or a bit of both?
Well, there’s your assignment, and I know you can handle it. It’s why, after all, you make the big money.
Leave a Reply