Search results for 'Baseball august'

The Summer Games, 2012: The Birth-Month Question Redux

13 Sep

St. Pancras Station, London, Aug 8.

Remember the Olympics, that quadrennial competition between the world’s elite corporate sponsors? McDonald’s-by-the-Thames, Lebron and Kobe on the same team, scandal on the badminton court, wholesome sublimated paint-ball nationalism (you know, my guy ran faster than your guy, especially if you live in Jamaica) and all that?

It was in all the papers. In any case, it seemed to me that the topical concern of a previous post (August 24) – namely, the modal incidence of August-born baseball players in the major leagues – could be naturally transposed to the Olympic context. Does birth month in any way broaden or constrict an athlete’s (Summer) Olympic prospects, or is the correlation spurious, or scant?

Postscript and Prelude

But before we put the lens to the data, you recall that the August birth-month skew among ballplayers is typically ascribed to the July 31 birthday cut-off for kids enrolling in any given year’s Little League cohort. That closing date makes the August-born oldest in their admitting class, so to speak, and thus bigger stronger, more adroit in the baseball arts, and thus possessive of a sustained developmental edge.

In this regard, and you may have missed it, reader James Zhang’s September 9 comment thumb-tacked to the baseball birth-month post asks a most salient, purloined-letter of a question: what if the August birth-month differential could be more properly laid not to Little League registration policies, but rather to the simple demographic possibility that more children are born in August?

A very good question, Mr. Zhang, and it turns out there’s something to it. By peeling off birth data from a melange of sources on the net (including Vital Statistics of the US) and grafting these to equivalent-year stats from Sean Lahman’s baseball database, I got, by way of a sample:

Birth Month

US August Births

Baseball August Births

1941

9.13%

11.21%

1950

9.15%

14.39%

1960

9.21%

12.23%

1965

8.93%

10.34%

1970

8.86%

7.78%

1975

8.90%

11.11%

The baseball numbers aren’t enormous (averaging about 168 players per year), but the aggregate effect is pretty evident, particularly when keyed to the 8.48% “chance”-driven expectation for August births. The conclusion then is multivariate: both national and ballplayer August births better chance likelihoods, but it is the baseball effect that pulls farther away from default expectations. In any case, a birth-month tug does seem to be in force here, something to be kept in mind.

But what then about the Olympics and its vastly more heterodox, Little League-less athlete pool? What sort of birth-month association, if any, marks the data?

Let’s see. Our data source comprises the Guardian’s spreadsheet of Olympic 2012 medal winners saved to Excel and made available here:

All London 2012 athletes and medal data

(N.B. These medal data are incomplete, the workbook having been compiled before event outcomes were realized in the Games’ last three days, e.g., basketball and the marathon. I think it would be a reach to insist that the deficit degrades our analysis, but in any event John Burn-Murdoch, who works on such projects for the Guardian, tells me the completed rendition should be out there soon. For a 2008 census of National Basketball Association player birth-months see

http://statsheet.com/blog/applying-outliers-does-birth-month-matter-in-basketball ).

Some Notes About the Book

Before we broach our question – whether a birth-month effect in any way correlates with Olympic medal success – a few words about the workbook’s design would be instructive.

  • Note the coupling of first and last athlete names in column A – not the by-the-book way of handling such data, particularly if you need to sort these by last name.  That task can be dealt with even as the data stand, but the task can get messy. On the other hand of course, the names are extraneous to our purposes right now – but maybe not next time.
  • The Age field could be regarded as an ever-so-slight redundancy, given the companion Date of Birth field and the latter’s enabler role in helping derive real-time athlete ages. Example: if you want to learn A Lam Shin’s right-now age, replace the 25 in cell C2 and write

=(TODAY()-H2)/365.25  (the denominator represents the average number of days per year).

Her age result – 25.97 as of today – will change daily. Then copy down the C column. Just remember to send Ms. Shin a birthday card. (Of course, this formula can fall victim to its own volatility. View the workbook in ten years and you probably won’t need to know that she’s now 36.)

  • Moreover the Age Group field, impressed into Row Label service in Pivot Table 10, comprises text data, and as such cannot be further manipulated absent a concerted round of hoop-jumping. If in fact you want to group the athletes by age tranches you’re far better advised to substitute the Age field and dice the data with the Pivot Table’s Group Selection option.
  • By sounding an unrelieved drone of YESes in its cells, the Medal Winner? field consigns itself to dispensability, teach us nothing. By definition, the workbook means to catalogue none other than medal winners. On the other hand of course, if you have no plans either to use these data or present the field to readers you can merely ignore it.
  • Note the G, S, and B fields (Gold, Silver, and Bronze) and their fractional medal representations, the Guardian’s attempt to align overall country medal totals with the grand total of actual athlete medallists, an obeisance to team-based competitions. Thus Croatia’s Valent Sinkovic in A932 comes away with .25 of a silver medal, having competed in the Men’s Quadruple Sculls Rowing event.  These apportionments do make a certain sense, but other analytical necessities may bid you to award him one, indivisible silver.

And Something Else…

Now here’s the real problem, to which you may have already altered yourself. Remember that we want to see if athlete birth-month data stack higher for this or that month, and if so, why. But unlike the Lahman baseball records that informed our baseball birth-month post, no immediate month data avails in the Olympic workbook. However, Excel’s MONTH function enables us to do what we want, and very simply:

=MONTH(cell containing date).

Thus

=MONTH(A2)

will return the value 9 (September) for Ms. Shin. Once in place, we can copy that expression down a column, and we’re done – maybe.

But look at cell H11; the date for Mr. Zielinksi exhibits a left alignment that intimates, without quite proving, that H11 contains a text entry, which simply can’t comport with MONTH, which is looking for numerical data. Remember that dates, beneath all the formatting rouge and lipstick, are numbers (see the August 20 post), and while it’s quite possible and perfectly legal to left-align a number, the anomaly in H11 hints otherwise. To clinch the point, click in any blank cell and type

=H11*2

Don’t you just love error messages?

I can’t explain why these text entries infiltrated the larger complement of workable dates but they did, and to dangle a preposition, it’s something we need to deal with. Before we break out birth months we need to see to it that all our data-to-be in fact qualify as months.

One possibility: we could sort the DOB column by Oldest to Newest and eye the bottom of the record stack. I see 21 text-formatted, faux-date entries; and while we could return to the insert-a-blank-row-above-these-records expedient (something I’ve explained in earlier posts, including August 30), thus estranging the bad apples from the usable data (they only contribute about 2% of all records, after all), we don’t have to. The reality is that we can retrieve the months from these date pretenders, and here’s how I’d to it.

Insert a column to the right of DOB and title it Birth Month. Select the new column and select Number in the drop-down menu in the Number button group:

We’ve taken that step to insure that our results here look like numbers, and not dates. Then in cell I2 write:

=IF(ISTEXT(H2),VALUE(MID(H2,4,2)),MONTH(H2))

OK – you’re entitled to an explanation of this rococo expression, though you’ve probably managed to make some fledgling sense out of it already. The ISTEXT function, written simply

=ISTEXT(cell reference)

is invariably soldered to an IF statement, and inspects the referenced cell for its data type. If the cell comprises text, one thing happens; if it’s other than text (e.g., a number), something else happens. In our case we stipulate that if the cell evinces a text content, Excel will proceed to extract the month from the cell via the nested MID function (how that happens will be detailed in a moment); if the entry is not text, that is a number, then the formula will apply MONTH to the cell’s date/number.

And how does MID work? Its three elements, or arguments as they’re known in the trade, do the following, respectively

  1. Identify the cell in which MID will perform its work (H2 above)
  2. Cites the position of the character at which MID will begin to extract characters
  3. Declares how many characters will be appropriated from that inception point.

Thus if I type

SPREADSHEET

in say, cell A7, and enter

=MID(A7,3,4)

in A8, I should realize the character sequence READ.

Thus for our text-date data, those data to which we want to direct MID, we see that the month segment of  text-bearing cells always situates itself in characters 4 and 5 – a good thing, because had single-digit months (January through September) been conveyed in single-digit terms (9 instead of 09) we’d have had to wrap our hands around a much stickier wicket; we’d have to have written a formula that sometimes extracts 1, and sometimes extracts 2, month characters.

(Note also that the text-date cells are expressed in European date format, in which day of month precedes month, even as the other data appear in American, month-first style).

And finally, we need to brace MID with the VALUE function because our results would otherwise remain text, and we want the resulting month extraction to hold numeric status.  VALUE simply mutates a number formatted as text into its quantitative equivalent:

=VALUE(A7)

would realize the value 7, had you entered a text-formatted 7 in the cell.

Now that you’re panting from all that heavy lifting, kick back and smoke ‘em if you’ve got ‘em, because the rest should be a comparative day at the beach. Copy the formula you’ve inscribed (and you need all those parentheses in place) in I2 down the I column and you should be treated to a medley of values ranging from 1 to 12, something like this:

(Ignore the decimal points if you see them; they don’t matter.) Next, let’s pivot table the months. Boot up a table and

Drag the Birth Month field to the Row Labels area.

Drag Birth Month into the Values area. Click in the Values area, and in turn click PivotTable Tools > Options > Summarize Values As > Count (the data have defaulted to Sum simply because they are numeric). You’ll see:

I’ll take a wild guess and allow that the largest birth month happens to be…August, the only one stepping up to three figures. Note in addition the no-less-striking, precipitous fall-off in the succeeding months.

Now remain in that field and click Options (if necessary) > Show Values As > % of Column Total (again, I’m directing you through the Excel 2010 interface). You should see:

August rules – again – but of course a satisfactory accounting awaits.

Now there is some evidence for a global August birth predominance, but a closer look is clearly in order. A UN international, country-by-country birth-month spreadsheet (click the small download link, but there’s some hoop-jumping required here in order to whip the data into shape) shows a July-September birth skew, but you’d have to toil to reconcile Olympic countries and team size with these data.

Note as well that if you swing the Country field into the Report Filter area and click on United States of America (athlete total here: 120) you’ll get

Still another win for August, albeit for a relatively small universe. Another remarkable stat: filter for People’s Republic of China and you’ll behold a 23.68% birth contribution from January (athlete total: 76), an extraordinary outlier I’ll leave to the Sinologists.

Sure there are other pivot table permutations to be crunched (you could break out for gender, for example, and sport), but the principal research remit is already out there: tracking and explaining the birth-month curve. An artifact of fertility trends, or the consequence of subtle, worldwide athlete recruitment protocols – or a bit of both?

Well, there’s your assignment, and I know you can handle it. It’s why, after all, you make the big money.

Birth Month and Tennis Rankings: Part 1

23 Dec

We’ve batted this ball around before, but those hacks were taken on other fields. Still, a recent (UK) Times piece by Daniel Finkelstein on birth order and its association with soccer players’ ascent to the British Premiership league returned the analytical ball to me on a different court – in this case the one earmarked for tennis.

We’ve looked at tennis, too, but with a consideration of country and age-driven breakouts of mens’ tennis players – not their birth months. So I booked some time on the tennisabstract site and its current, online-sortable rankings of the male of the species, which you can copy and paste from here.

The rankings seem current indeed, by the way; an ascendant Andy Murray in the pole position attests to their recency. In search of some deep background on the matter, I Googled my way into the menstennisforums site, and its precedent discussion of the birth-month-rankings relationship (you need to join the forum, by the way; a free enrollment entitles you to limited access to its holdings). In this connection a Taiwanese contributor screen-shot this birth-month-rankings distribution for 2014 player-rankings data:

tennis1

 We see that the birth months of all ranked players skew heavily toward the first half of the year, and rather discernibly, though occupants of the top-100 exhibit a far evener natal distribution, among that far smaller sample (if in fact the cohort can be permissibly understood as a sample. A sample of what, after all?) Yet 54% of the top 500 present a first-half birth certificate, as do 55% of top-1000 position holders. The proportion for all 2221 ranked players: 56%. Something, then, seems to be at work. So what about 2016 data?

That sounds like a question we could answer. But before we give it a try, a pre-question of sorts could be posed at the activity: does it pay to bother? If the 2014 data above have been faithfully compiled – and they probably have – would much interpretational gain be realized by another look at the men’s rankings, but two years’ later? With a player cohort exceeding 2000, would statistical sense be served by recounting the birth month distributions?

Well, they said Clinton would win, too. Distributions change, and testing the data anew – which after all are not wholly coterminous with 2014’s player pool – is worth the try, especially since we’ve budgeted for the project (a bit of blog humor, that was).

So let’s see, starting with this pivot table (note: 13 players have no birth dates to report, and are to be filtered away throughout):

Rows: DOB (grouped by Months only)

Values: DOB (Count, then % of Running Total In (this against the DOB baseline, the only one undergirding the pivot table. Turn Grand Totals off, too).

I get:

tennis2

The running totals’ month-by-month accumulation indeed emulates the 2014 56-44 first/second-half yearly breakout, along with the respective monthly contributions to the whole. No surprises, then – but replication does have its place.

And how do our month distributions compare with the 2014 top 100, 500, and 1000? We can start by dragging DOB into the Columns area and grouping these into bins of 100, retaining the running total effect. Isolating the first bin in the screen shot, I get:

tennis3

 Here, and unlike the 2014 figures, the first/second-half differential breaks 59-41%, comporting with the rankings’ overarching tendency, although again, of course the universe of 100 players will not mollify a statistician.

For the birth-month distribution for the top 500, group the rankings by that interval:

tennis4

Pretty much more of the same. Then group by 1000:

tennis5

The approximate 56-44 weighting runs through the data and its several granularities; and remember that the third, 2001-3000 bin, comprises only 65 players.

Now what if we isolate the contingent from the US? We’ve learned in a previous post about the August birth-month effect that seems to prefigure the career prospects of baseball players from that country. First, in view of the likely diminished US-specific aggregate that’ll sprinkle just a few numbers across the rankings I’ll remove Rank from the table, introduce a Slicer for Country and click USA, and restore Grand Totals. I’ll also tap DOB a second time for Values duty, one instance to convey the straight sums, the other to record that running column percentage. Here I get:

 tennis6

Note first of all that only 164 Americans appear among the 2087 ranked players, around 7.9% of them all, even as that proportion leads all nations. Second we see that no Jan-Jun differential obtains for the US, though the 23 Americans born in October could perhaps be wondered about.

But the global birth-month disparity holds, and as such calls for an accounting. Tennis players, after all, are among the most international of sporting populations, the rankings admitting players from 98 countries. The simple, but yet-to-be-substantiated hypothesis, would maintain that January 1 cut-off dates for age-specific tennis youth programs advantage older players, but that’s an early surmise. (Note by the way that UN birth data by month across the 1967-2015 periods reveals no January-June skew.)

First conclusion: more work needs to be done here. And while we’re at it, think about Michael Grant, an American ranked 836 and born in 1956, having earned his highest rank of 96 in…1979. Well done, Mr. Grant, I’d say – and he was born in Februrary.

But what about women players? Good question.

Team USA Stats, Part 2: Some Data Gymnastics

4 Sep

You can’t tell the players without a scorecard, they’ll tell you in the States, and you can’t tell the data without the formulas.

You’ve heard more memorable pronouncements than that opener, I’ll grant, but that less-than-bromidic avowal above makes sense. We saw in the last post how the Team USA height, weight, and birth state data threw more than a few curves at us, and baseball isn’t even an Olympic sport any more (it’s coming back in 2020, though); fail to straighten out the data curves and your analysis will be straitened.

Now that line’s a few sights more memorable, but I do go on. Our next round of data inspection takes us through the DOB (date of birth) field, and we want to get this one right, too. Quality control here starts by aiming a COUNT function at the rows J2:J559 that populate DOB. Since dates are numbers, and since COUNT only acknowledges a range’s duly numeric data, we should aspire here to a count of 558, or one date per cell. But my COUNT totals 520, a shortfall that exposes 38 dates manqué that, facades aside, are nothing but text.

Our COUNT is all the more incisive in view of the fact that every entry in the J column really does look like an unassailable date. But now that we’ve made ourselves aware of the turncoats among the loyal numerics, we’d do well, if possible, to rehabilitate them into the real things too. Here’s a pretty painless way: commandeer the next free column (it could be S, if you’ve left the formulas from last week’s exercises in R alone; if so, you’ll have to format S in Date terms), title it Birth Date, and enter, in row 2:

=IF(ISTEXT(J2),DATEVALUE(J2),J2)

And copy it down. The formula asks if the entry in J2 is textual. If it is, the DATEVALUE function – a rather useful transformative means for turning pure text such as 12/14/1989 – if that expression has been formatted into text – into 12/14/1989, the date version. If the entry in J is an authentic date, on the other hand, the formula simply invokes it its cell entry as is.

Surprise – my eminently sensible tip doesn’t always work. Copy the formula and you’ll be treated to nine #VALUE!-laden cells; a first review of these implicates an old nemesis – a superfluous space, in the source field e.g.:

olyp1

DATEVALUE can’t handle that textual intrusion, and so this refinement:

=IF(ISTEXT(J4),DATEVALUE(TRIM(J4)),J4)

Looks right, because TRIM’s job is to make superfluous spaces unwelcome.
But that revision doesn’t work either. Puzzled but intrigued, I went for the next move: a copying of one of the problem entries in J into Word, wherein I turned on the Show/Hide feature (in Home > Paragraph) that uncovers normally unseen codes. I saw:

olyp6

Look closely and you’ll detect that special character to the immediate left of Word’s paragraph symbol. That Lilliputian circle appears to signal a non-breaking space, and apart from any conjecture about what and why it’s doing there we know one thing: it isn’t a superfluous space, and thus won’t be trimmed.

Thus again, the fix here may be short on elegance but long on common sense: edit out the circle from the nine problem entries in J (in fact some of the cells require two taps of Backspace in order to rid them of their codes. If you sort the dates by Largest to Smallest by first clicking on an actual date all the #VALUE! errors will cluster at the bottom).

And that works, leaving us with yet one last irritation – the birth date of Shooting team member Daniel Lowe, which reads

11/181992

No need to get fancy here – just enter that slash.

Now, I think, you can go about your analytical business, e.g., breaking out athletes by birth month. You may recall my consideration, in a piece on 2012 Olympic data, of the alleged August effect, in which American athlete births in that month appeared to significantly depart from a chance prediction. Let’s see what the current team data tell us, via a pivot table:

Rows: Birth Date (grouped by Month)

Values: Birth Date (Count)

Birth Date (again, by % of Column Total)

(We don’t need Grand Totals here). I get:

olyp2

Here August – generally the most fecund month in the United States – shares the modal figure with June and March, its proportion substantially smaller than August’s 2012-team contribution. The numbers here simply suggest no special birth skew impacting the US complement, at least for this Olympics.

We now can also calculate each athlete’s age in conjunction with the most able assistance of the nifty and unsung YEARFRAC function. Enter the Olympics’ start date – August 5, 2016 – in any available cell, name the cell start, and proceed to column T, or whichever’s next available on your sheet. Name it Age and in row 2 try (assuming the corrected dates lie in the R column):

=YEARFRAC(R2,start)

YEAR calculates the distance in years between the two dates on either side of its comma. Thus, cell-reference four-gold-medalist Katie Ledecky’s birthday – March 17, 1997 – in YEARFRAC, and with the start date cell holding down the second argument you get 19.38, Ledecky’s age in years on day one of the Olympics (note that can’t actually enter 3/17/1997 in the function, because YEARFRAC will treat the entry as text. You need to either reference the cell bearing that date or enter 35506, the date’s native numeric equivalence).

Copy down the column and this pivot table beckons:

Rows: Sport

But guess what…

olyp5

Yep, it’s that superfluous space thing again, this time practicing its mischief on four records among the Track and Field data. The simplest repair in this case, as it turns out: select the Sport field and run a Find and Replace at the column, finding Track and Field[space] and replacing it with Track and Field. That works, because in this case each of the errant four have incurred one space.

Now introduce the Age field to Values (Average, formatted to two decimals). Bring back Age a second time, now exhibiting Count sans decimals. If you sort the results Largest to Smallest you’ll see the 12-member equestrian team holding down the age-senior position, with Team US’s eight boxers computing to a lowest-age 20.75.

We could also correlate average athlete weight by event, an association which might drum up some less-than-obvious numbers, e.g.

Rows: Sport

Columns: Gender

Value: Weight (Average, formatted to two decimals)

I get:

olyp4

Of course the per-team numbers are small, but they make for interesting reading, particularly the respective by-sport gender disparities (and note some absent teams among the men’s delegation).

I was surprised by the greater average weights of the two basketball teams measured against their rugby colleagues, even if the latter is (officially) the contact sport. And I did a double-take when I caught up with the respective boxing team weights; women boxers outweigh their male teammates by an average of 18 pounds. But here we’ve been thrown a sampling curve – the six male pugilists are concentrated in the lower weight divisions, even as the women – comprising exactly two boxers – weigh 132 and 165 pounds.

Eek  – there was a lot of hard work to do in there; I think I deserve a podium finish for this one.

Birth Months and Budding Ballplayers: The Little League Thesis Revisited

24 Aug

It’s August – dog days, the vamp-til-ready for the fall, the months in which the school year that impends begins to fill children across the land with unnameable dread.

But August brings with it a perquisite, too: if your next career move points you toward a stint in baseball’s major leagues, August is the month for you.

By that bit of vocational counsel I’m directing you to a truth – which can hardly be held to be self-evident – disclosed by studies that corroborate a small but palpable disproportion among major leaguers born in August (along with a Freakonomics post citing birth-month advantages turning up in an array of sports).

The standard accounting for the August baseball edge goes something like this: Little League teams enforce a July 31 birthdate cut-off for any season’s eligibility, and so boys (and girls nowadays) born in August – the oldest and likely the more physically prepossessing of their peers – continually dominate, and through the experience are thus better positioned to ultimately win that call-up to the majors.

Truth to be told, my contrarian self resisted the claim. As the Rabbis say, it appeared to my impoverished mind that the performance differential stoked by so scant a datum as birth month couldn’t materially pump up, or by extension depress, a player’s employment prospects. But what do I know? It seems as if the numbers respectfully disagree with me; so let’s then see what they might mean, and where else we might take them.

My data spring from an Excel workbook I’ve adapted from the storied, freely-downloadable baseball player database managed by Sean Lahman (who won’t turn away contributions, by the way), one of the go-to sites out there if you’re into these kinds of things (it purports to contain, after all, statistics for every season played by every major leaguer since 1871).

The workbook about to unfurl on your screen reports basic demographic information about the players, and Lahman was kind enough to jam in a birthMonth field into the mix, exempting us from the chore of wheedling months from birthYear (not necessarily a big deal, in any event via the MONTH function, similar to the WEEKDAY we encountered in a previous post). You can get the workbook here:

Ballpayer demographics

But before we fashion the pivot table that’ll help us replicate the birth-month claims, time first to put some spreadsheet into spreadsheet journalism. Some of the records before you are missing birth month data (primarily from pre-1900 players); and as these can’t help us here, I sorted the worksheet by birth month,  having the effect of consigning the vacant birth month cells to the very bottom of the ream of records (this should happen whether you sort by smallest to largest or in the other direction). I then inserted a new row in 17454, thus severing the blank birth month records from the usable remainder. So why not simply delete these inert records, then? Because I may want to recover them for some different analytical purpose later, and should that eventuality present itself I need only delete row 17454, and restore the records below it to the larger body of data.

In any case, now we can insert a pivot table and slide in these fields:

Drag birthMonth to the Row Labels area.

Drag birthMonth again to the Values area (we’re in effect breaking out birthMonth by itself). Change the Sum operation to Count if necessary (click anywhere in Sum of birthMonth, and then click PivotTable Tools tab at the top of the screen, click Options> Summarize Values By> Count.

Drag birthCountry to the Report Filter, and why? It’s because we want to confine our scrutiny to American-born players (at least initially) – the ones most likely to have played in Little League. Click the Filter arrow, and select USA (tip: you can accelerate that trip to the lower reaches of the alphabet by typing U). You should see:

Note the quite discernible August margin (month 8). To concretize this outcome, click anywhere in Count of birthMonth, Show Values As (assuming haven’t strayed from the Options button group) > % of Column Total. Now you’ll see:

That’s fairly definitive (remember our universe comprises 17,000+ records, so questions about statistical significance should be safely pre-empted). Note in addition that the next largest birth cohorts populate September and October, findings which would appear to comport with the theory, too, as would the shortfalls in April-June (even the diminutive February tops May and June). Only the uptick in July seems anomalous, its offspring the youngest per the Little League’s July 31 threshold. One wonders if some Little League affiliates hold to a June 31 demarcation instead – something to research.

In any case it all seems pretty confirmatory, particularly if for comparison’s sake you apply the filter to a different country – say the Dominican Republic, a nascent demographic power in baseball’s workforce. Replace USA with D.R. in the filter (that’s how the country is represented here) and you’ll see:

(Universe size here, by the way – 542 players).

Not terribly much pattern in this case, but what we do see is that August doesn’t own the modal representation here – October does, and by a lot (a predominance in its own right that might justify investigation). August’s 8.86% here doesn’t distance itself very far from the chance expectation of 8.49% peculiar to any 31-day month. (Of course you can now filter for any country, and if you want to assess birth months against the actual number of players contributed by that country, drag birthMonth a second time into the Values area and do the Summarize Values By > Count thing. Note then that both Value fields will sport the Count of birthMonth rubric, but the first of these will have been subjected to that special % of Column Total tweak. I’m sure, by the way, that my co-residents in London will be pleased to learn about the 34 major leaguers born in the UK.)

OK – this is all interesting and instructive, but something tells me we’ve withheld a term from the larger equation. After all, explanations of the August syndrome invest in the Little League eligibility rule, as it were – that July 31 cut-off. But the Little League didn’t debut until 1939, and hadn’t widened its ambit beyond the state of Pennsylvania until 1947. And it wasn’t until around 1949 that the Little League idea went viral, at last affording countless parents nationwide the opportunity to live the American dream – the chance to act like a lunatic in public without reprisal.

The corollary point then, is that large numbers of major league players never joined a Little League, and never even had the chance to; and given that proviso, we need to reassemble the percentages along Little League/pre-Little League lines, and then see how the numbers break.

So let’s proceed. First, click back on USA in the filter and drag the birthYear field into the Row Labels area, stacking it atop the birthMonth field already in place, because we want Year to serve as the superordinate breakout field as it were, such that years break out into months, as you’ll see:

(By the way – the Compact Form Report Layout is perhaps the most legible one at this point. To introduce it to the pivot table click PivotTable Tools > Design tab > Report Layout > Compact Form.)

Then click on any year and click PivotTable Tools (if necessary) > Options > Group Selection. In the resulting window type

OK – I sense a need to explain myself here. In order to demarcate pre and post-Little League major league players, we need to identify the start year from which boys (they would be boys, here) began to have a fighting chance to sign on to a Little League team. If the League went big-time around 1949, I’m estimating (at least for starters) that 12-year-olds – that is, those kids born in 1937 – could have plausibly made themselves available to suit up. That’s an educated guess, of course, but it’ll do for now. In light of that conjecture, I’m thus interested in grouping players born before 1937 – hence the two years entered above. The 117 counts the number of years spanning 1820 and 1936, entered here because we want the ensuing pivot table results to bundle all 117 years into a  unitary total (you’ll see what that means momentarily).

Click OK, and you should see:

No, that’s not what we want to see, because Excel continues to aggregate data for all the years, bunching the post-1936 birth years into a residual category that completely confounds the outcome we’re seeking. But by clicking the filter down arrow hard by the Row Labels title and ticking >1937 off, we get

Well, that’s interesting too – because an August birth-month differential persists, not as pronouncedly as for the post-1936 years, to be sure, but it’s there.

Of course you can fool around with different Ending At years in the Grouping window to varied effect. When I group for birth years between 1820 (the oldest recorded birth year in the Lahman database) and 1900, the aggregates break rather differently and incline toward birth month parity:

But if I establish the upper year limit at 1920 – and virtually no boy born that year or earlier played in Little League – August still pulls into second place behind October, at 9.18% (the continuing prominence of October in these snapshots might be worth pursuing as well). Explanations anyone?

Well, you can think about all this over the weekend. As for me, I’m thinking about my brother – pretty decent Little Leaguer, and born in…August. He became a neurologist, heads a division at the FDA, gets his name in the Times now and then. But yo, bro’ – maybe you should get back into the batting cage and starting taking your hacks – your next career move awaits.