Sociology Section Enrollments, Part 2: Collective Behaviors

16 Oct

Curious but methodical, did I quotably describe the American Sociological Association’s section membership data’s spreadsheet, calling attention first of all to the sheet’s functional header row in 2 that manages to ignore the data resting atop it in 1; and in that connection I owned up to my own curiosity about row 1’s resistance to any sort applied to the data set beneath it. It turns out that curiosity was apparently misplaced. If, for example, you proceed to sort this data set:


Its numbers will sort properly – for whatever reason – even if you click among the data in the top row. even as that attainment seems at least slightly curious too, at least to me (for a resume of some of the vagaries of sorting see this post by Excel ace Allan Wyatt).

And while I’m mid-course correcting, you’ll recall the supposition I entered in the previous post – namely, that ASA members are confined to a maximum of three selection enrollments, a deduction I prised from the enrollment averages springing from the race data workbook. It turns out, however, that I was wrong again; my misunderstanding was righted by ASA Director of Academic and Professional Affairs Teresa Ciabattari, who tipped me to the fact that no enrollment limitations are imposed on members. But that very open-endedness rouses the question why per-member enrollments thus hover around the 2 mark. Academics are busy people, of course, and their collective professional experience may have centered the normal curve around that number. What isn’t here is a frequency distribution of sorts, one that would count the numbers of members who affiliated with this or that many sections.

And I learned something else not entirely extraneous to the above: that section memberships must be paid for, the costs of which are interestingly variable. I was thus bidden to wonder if the dues might be negatively, or even positively, associated with section enrollment numbers (the most expensive: $52, for the Community and Urban Sociology section, a fee that at the same time grants members a subscription to the City & Community quarterly).

They aren’t, apparently. The correlation between section dues and their numbers comes to .127, a value that surely won’t quicken the heart of a sociologist, but at least tells us that money isn’t everything.

But back to the curious/methodical theme. Consider the organization of the membership-by-race worksheet as it was conceived by the ASA:


Now suppose you’re been entrusted with the raw membership data in their primeval, pre-spreadsheet form, along with a remit to put the numbers to some good analytical use. As such, you need to bang together a sheet that serves your purposes and only yours, and not the interests of a general readership who intends nothing more than a breezy scan of the numbers. What, then, would your sheet look like?

Here’s the form that functional recreation might assume: First, I’d efface the gray banding that could discolor a piquant conditional formatting of the cells. Next, I’d decouple all those merged cells in row 1 and see to it that each and every column/field was properly headed, e.g. in excerpt:


And by extension, I’d instate Row 1 as the data set’s header.

But these steps toward a remake of the worksheet conceal a false start – because If I’m shaping the sheet for myself and not for the general reader, I’ll thus decide that I don’t need the section % fields in the data set at all, headers notwithstanding. Once I’ve subjected the set to a proper restructuring on my terms, I can tap into the Show Values As > % of Row Total option and derive section percentages by race when needed; and if I don’t need to know those percentages, then I don’t need the section percentage fields either.

So on behalf of a prudent minimalism I’d delete all the section % columns, and by the same token I’d delete the Membership field; its aggregates can likewise be derived in a pivot table from the individual race section numbers.

But a question needs to be asked about row 2, the sheet’s erstwhile header. That row records ASA membership totals by racial category, and those data are not of a piece with the holdings in the following 52 rows. Row 2 enumerates the actual totals of ASA members by claimed racial identity, but they do not inflict a double-count upon the sheet’s fields just the same. Because members can enlist in as many sections as they wish, the totals for the eight race fields won’t add up to the figures in 2. Thus, the enrollment numbers in row 2 possess a discrete, non-duplicating character, but at the same time they don’t belong with their companion rows. On balance, then, I’d say the row has to go.

And for something like the coup de gras –remember, I’m earmarking this data set chiefly for myself – I’d run the records through the Get & Transform routine described here, thereby unpivoting (blame Microsoft for that word; I take no responsibility) the columns into this most pivot-table-friendly arrangement, in excerpt:


And of course I can rename the Attribute field Race, or whatever suits me (but if I’m working in the privacy of my own room and left to my own analytical devices, I may not even bother to rename the field).

And once I’ve gotten this far, my three fields can do almost all of the work of the original sheet design, and some more besides, for example:


That is, here I rank section enrollment preferences by race (the ranks read down the column), and thus learn that Sex & Gender proves the most popular section for White sociologists, while Racial & Ethnic Minorities stands in the first position for both African American and Multiple Selected ASA members. And for the 1162 members who, far whatever statistical or preferential reason slip into the Missing category, Political Sociology tops the enrollments. And again, the Get & Transform rendition of the data set enables the user to tee up a Slicer, though which race or section-specific data – whose items are all now domiciled in a single field – can be neatly filtered.

But the larger point is that even if my revamp does nothing more than emulate the potential functionality of the sheet I download from the ASA site, I’ve achieved in three fields what it took the original 18. That’s prudently minimalist, isn’t it?


Sociology Section Enrollments, Part 1: Collective Behaviors

5 Oct

I try to keep this close to the vest but the firewalls have ears, and in any event, I seem to have leaked this datum into my About link elsewhere on this page in a moment of inexplicable indiscretion, and it’s even true, besides – yes, I have a doctorate in sociology.

There – I’ve manned up and told the truth, and I haven’t even been nominated for the Supreme Court. Mea culpa. But there’s more than head-bowed contrition to the confession; it also explains why I was moved to have a look at the spreadsheet devoted to the section choices plied by members of the American Sociological Association (ASA) nestled into the Scatterplot blog. You can devote yourself to it here:

Copy of 2017_section_membership_by_race

In fact, the workbook cross-references the items of two parameters – the sub-disciplinary section enrollment preferences of ASA sectarians, aligned by what Scatterplot calls race. ASA members were asked to supply the latter information here:


You’ll note the liberty granted members to self-identify by up to two backgrounds, which “ASA translates…into 8 mutually exclusive categories: Black or African American, Asian or Asian American, Hispanic/Latino(a), Native, White, Other, Multiple, Missing.” (Note that the Asian designation denotes Orientals and not denizens of the Indian Peninsula, as that category is understood in the United Kingdom.) The Multiple (or Multiple Selected) rubric thus conceals the paired backgrounds self-imputed by some members, and as such it’s impossible here to know which two were selected in any instance and how often, though we do know the total multi-identity cohort contributes 4.97% of all ASA members. We also don’t know who populates the Other affiliation, a status characterizing another 3.38% of respondents.

Moreover, blog-post author Dan Hirschman of Brown University adds in a comment that “…the Other/Multiple/Missing categories are non-trivial for most sections (ranging from about 7-20%)”, an important, but not incapacitating, constraint on the data. In this connection, the legend on row 56 (which you may eventually want to delete; it’s a little too close to the actual data for comfort) observes that any cell value falling below five has been reassigned to the Missing field (we’ve seen the five limit before in a similar connection, e.g. this post); thus the 27-member Native American contingent totalled in cell I2 reports only 19 section enrollments, all of which do exceed five.  And while of course an N of 27 may beat back all claims to statistical significance in any case, the far more considerable Other, Multiple, and Missing cohorts leave us to yet ask exactly who is being counted there.

But notwithstanding those cautions material findings do await, emerging from the worksheet even as it stands. The sheet has been curiously but methodically designed; and we could pointedly aim both of those adverbs at the de facto header, installed in row 2.  Its contents are not typically the stuff of which headers are made; the entry 11551 doesn’t smack of grade-A header material, after all.

But it appears row 2 holds down its header status because the sheet designers wanted its numeric, aggregating entries to remain atop the sheet when the remaining data are sorted, and because row 1 merges its cells in order to center its race designations over two columns each – one conveying absolute membership totals, the other, each category’s percentage of the whole. But I’ve been unable to determine how the sort invariably staves off and excludes header 1. Suggestions are welcome.

But before we propose any revisionisms for the sheet we can learn a number of insights right now. First, and perhaps most obviously, the aggregate section totals in the B column can be sorted highest to lowest (remember that B2 heads the field), after which you’ll learn that the Sex & Gender section is the best-attended, attracting 1099, or 9.5%, of the ASA fold. The least popular? The Ethnomethodology and Conversational Analysis unit (honk if you like indexical expressions), whose name length stands in elegantly inverse proportion to its following.

But don’t get unduly confused by the numbers. Add the section enrollments and you get 26,628, per the members’ prerogative to enlist in multiple sections (probably no more than three). Divide 26,628 by the ASA actual membership of 11,551 and we see that the typical constituent has signed into 2.27 sections.

Proceeding with that understanding, we can simply (somewhere) compute the average number of member section enrollments by race, e.g. for African-American members:


(It’s C2 in which the ASA African-American membership in toto – 748 – is recorded.) That fraction comes to 1.95 section affiliations per member, and after conducting parallel calculations for the other races this comparison resolves:

African-American 1.95
Asian/Asian American 2.35
Hispanic/Latino(a) 2.48
Native American .70
White 2.28
Other 2.17
Multiple Selected 2.52
Missing 2.11

(Again, the data for Native Americans are obstructively small.) I’m not sure what the above variation, and variation there is, suggests; that question is better put to a…sociologist, e.g. why black disciplinarians opt for fewer section associations than the bearers of other identities.

The section data lend themselves to a number of other looks, including Dan Hirschman’s charted assays of races and their enrollment percentages by ASA section, e.g.:


(Note that each chart sorts the X-axis data items – the section names – highest to lowest, that is, by a race’s extent of participation; and as such, the items are variably situated on the axes.)

It occurred to me at the same time that a spreadsheet-specific, conditionally-formatted framing of the data could add appreciable value to the analysis. For example, we could color any section enrollment figure exceeding a race’s overall ASA membership percentage.

We can naturally and alphabetically enough begin with the African American data. Select the percentages in D3:D54 (note the formulas within that multiply the authentic percentages by 100, souping up the results by a couple of orders of magnitude. 5.27, after all, isn’t .0527.) With that selection in place, plug into Conditional Formatting and enter the following formula:


Note I’ve nominated orange as the if-true cell fill color. and you need the dollar sign in order to properly compare all the African-American enrollments with the aggregate 6.48 in D2. And because we’ve already selected D3:D54, all those cells will receive the format.

You can next leave the D3:D54 selection in place and apply the range’s format to the other percentage fields across the data set via a Format Painter reproduction. If you double-click Format Painter you’ll be able to instill the format immediately by clicking F2, H2, J2, L2, N2, P2, and R2.

When the clicking stops I get, in excerpt:


Here of course, and unlike the Scatterplot charts, the section names remain in place; and I would submit there’s lots there to think about.

Cool. Now does anyone have Max Weber’s email address? I’d like to show this to him.

Tracking the Fields in Track and Field: Olympic Athlete Data, Part 3

21 Sep

Call it a footnote, addendum, postscript, or rather call it for what it is – unvarnished forgetfulness – but my longitudinal appraisals of Olympic athlete heights and weights last post walked right pass the most obvious case – basketball.

What other sport, after all, jams its goal ten feet above terra firma, and so casts its net – pun intended, heh heh – for the tall and majestically wing-spanned? Jump on this pivot table, then:

Rows: Year

Columns: Sex

Values: Height (formatted to two decimals)

Weight (format as above)

Slicer: Sport > Basketball

Nothing that women’s basketball took off in 1976 I get, again in excerpt:


You’ll observe that, dating from the 1976 baseline, the heights of both women and men divulge a five-centimeter rise (about two inches), more or less; but scroll back among the men’s data to the 1936 inception point and the end-to-end differential spikes to almost 18 cm, or about 7.5 inches. Male basketballers have also gotten about 41 pounds heavier across the same interval; measured from the 1976-2016 gender-shared interval, the net average increase for men comes to around 25 pounds, and 11 pounds for women. And for what it’s worth, the average women hoopster nowadays is taller than the typical male competitor in 1936.

And while I was browsing, it occurred to me that at least one other sport likes its entrants long and tall – volleyball, introduced to both genders in 1964. Slice for volleyball and:


The average 1964-2016 height boost for women and men: a notable, nearly-identical 12-plus centimeters.

Now you’ll recall that Part 1 of this trilogy issued a clarification about the data set; namely the fact that many athletes are recorded multiple times in a given Games, if they competed in a plurality of different events in any year. I allowed then that the reasons for counting the same entrants repeatedly in the various pivot tables I compiled could be defended, and I proceeded from that conceptual springboard.

But that policy could be reframed and executed anew. If in fact unique Olympic appearances of athletes by Games could be shaken out – that is, if we counted but one instance of each athlete per Games – would the age and gender distributions I reported in Part 1 depart materially from the ones I could collate here?

Let’s see. First, we could shear the redundant athlete records from the data set via the stalwart Remove Duplicates routine, selecting the records by their ID and Year fields; that decision should preserve one iteration of an athlete for each Game (if you want to give this revamped data set a good you may want to save the workbook under a new name). Going ahead, that run-through discharged more than 80,000 now-excess records from the set, leaving a mere 187,000 or so behind to submit to a round of pivot tables replicating the batch I minted in Part 1 (in the interests of presentational lucidity I won’t paste the Part 1 tables here; they’re still back in there whence they came, though).

I began with a breakout of average athlete ages by gender and sport:

Rows: Sport

Columns: Sex

Values: Age (average, formatted to two decimals)

I get, this time (in excerpt):


The differences are very small and indeed in most cases simply not there at all, and for a good reason: most athletes tried their hands and feet at one event per Games. Small differences do attend the Diving averages, because some divers essayed multiple events in the sport (remember Sport defines the general category beneath which the various events are gathered). You can’t see the figures above, but even the age averages for swimming – a sport in which entrants often plunge into several pools – are near-equivalent, at 19.48 and 21.38 for the entire data set and 19.53 and 21.51 for the unique compendium.

Since the numbers appear to match so thoroughgoingly let’s move to but one more comparison, the alignment of age averages by gender and year of Games. Replace Sport with Year and I get now in excerpt:


You’ll see again that the divergences are largely minute, though we the unique average here for women in the 1932 games – 26.99 – distances itself appreciably from Part 1’s 29.22 (the respective counts for the two tables: 369 women in the full data set, a significantly trimmer 222 in the unique-record rendition).

But the redundant-record entanglement snares us anew if we move to answer perhaps the most straightforward of the questions we could ask of the data: How many medals has each country managed to win, and by which type? (One preparatory briefing before we advance toward our set of totals – you’ll note the outpouring of NAs flooding the Medal column; but here that classic placeholder in lieu of an unavailable datum means, more meaningfully, that the record in question belongs to an athlete who failed to win a medal.)

But the medal question’s defining detail consists of the means by which the country-by-country medal counts are to be totaled. Consider an obvious case: the United States men’s basketball team brought back the gold medal in the 2016 Games – but that victory, for standard reporting purposes, comprises exactly one medal. Yet the Kaggle data set marks the gold attainment for each of the 12 members of the team, thus unleashing a massive, prohibitive overcount of the medal numbers for this and any other sport contested by teams of any size. That sort of capitulation to the data as they stand can’t work, of course, and thus calls for another dial-up of Remove Duplicates (I’m assuming you’re working with the original, full data set), this time selecting for Year, Event (not Sport, each one of which again spans several events), and Medal, because we want to sieve but one of medal dispositions for each event outcome. Implementing the routine rid the data of over 246,000 redundant records this time, per the selection criteria.

Following that massive jettison, we can apply the remaindered records to this pivot table:

Rows: NOC

Columns: Medal

Values: Medal

(Note: NOC, or National Olympic Committee, assigns a three-character abbreviation to each participating nation, the legend to which may be found here. I’ve applied NOC, and not Team, to the table, because the latter parameter also admits, for whatever reason, the names of actual teams that fielded contingents at the Games, e.g. the Bagatelle Polo Club Paris).

In view of the preemptive steps we carried out above, the table should now properly cross-count medal victories by country, e.g. in a pinched excerpt:


It’s starting to look intelligible, but the article isn’t quite finished just the same. For one thing – and we’ve seen this before – the curious sort priority accorded MAR and MON, or Morocco and Monaco in full, needs to be explained. Those countries are stacked atop the column because Excel reads their codes rather as March and Monday, date data that the pivot table insists on sorting first; and in fact a manual A-Z sort won’t restore them to their proper alphabetical place because they’ll continue to retain what is in effect their mathematical character, and so force themselves again to the head of the list. If you really want Morocco and Monaco to assume their orderly mid-column positions you’ll need to right-click each name and apply the Move option, selecting Down repeatedly until each is bumped into the berth you assign them.

And Move gets to do its thing again for the column entries, because you’ll want Gold – the winners’ precious metal of choice – to station itself at the head of the row, followed by Silver and Bronze. As for NA, I’d filter it out – because Remove Duplicates has preserved but one instance of the non-winners for each and every event and year and that’s misleading of course, because most athletes won’t win any event, and there’s a lot of them.

And if you’re still with me you can sort the values by Grand Totals Largest to Smallest, realizing this hierarchy, in excerpt:


The United States wins gold for most golds and all medals, and by a margin far more extensive than I would have supposed. URS is a legacy reference to the former Soviet Union, but note the simultaneous tally assembled for Russia.

And I bet you didn’t know that Luxembourg has won four gold medals – including two for Mixed Painting, as many as the United States.

Tracking the Fields in Track and Field: Olympic Athlete Data, Part 2

6 Sep

Have Olympic athletes gotten taller and heavier? Common sense, the hypothesis builder of first, but not last, resort, submits a yes. After all, competitors in other sports seem to have bulked up and out; so why wouldn’t entrants to the global quadrennial enlarge as well?

The conjecture makes sense, common or otherwise, to me, but fortunately more persuasive corroboration is at hand. The Kaggle Olympic athlete data set features height and weight data, thereby adjudging my sense-making a false start.

Nevertheless, a few important caveats need be pronounced upon the data. First, a notably large portion of the height and weight entries – around 60,000 in both cases, or around 22-23% of all athlete records – are missing, and I haven’t paced my way through the data to learn if those ciphers are systematically distributed (and by the way, about 9,500 age records, or 3.5% report no ages either). Second, one must take some care to think about the like-for-like proviso, in view of the turnover of events appearing in this, but not that, Olympic year. There may be little analytical gain, for example, by according a place to the weight and heights of the combined 3,578 aspirants in the Art Competitions last conducted in 1948. One assumes their vital statistics contributed little to their medal fortunes, though one never knows (and since you asked, their average weight came to 165.64 pounds, with a mean height reaching to about 5’9”); but because the demographics for those pastimes simply aren’t there in the later Games their confounding effects must be considered, and probably discarded.

In addition, because boxing necessarily cordons its contestants by weight classes, its 6,047 contestants should be expected to evince little variation along that parameter across Olympics, and they don’t (they haven’t gotten much taller, either).

With those cautions in mind, investigative prudence might justify a concentrated look at those durable sports expected to win a continual place in the Games. We could then proceed to pivot tables buffered by a Sports Slicer that would more afford a cleaner look at the question, one sport at a time.

If we’re satisfied with that recommendation, any pivot table that would spring therefrom sounds like a rather straightforward proposition, but grouping Olympic years by a chosen interval – say a 12-year tranche that would parse three Olympics at a time imposes a small presentational disquiet.

My initial intent, for example, was to group the years into bins of 12 in the interests of averaging heights and weights for three Olympics at a time (but recall that the eventual rethreading of Winter Games into alternative even-numbered years would step up the number for any 12-year stretch). If, then, we move Year into Rows and frame the 12-year grouping, we’ll see:


Now that bit of consolidation works, but note the disclosure above of some years in which no Games were held (e.g. 1907), and which in fact don’t even appear in the data set. But the pivot table insists on interpolating those years in order to preserve the grouping’s consecutive character, however misleading to the reader. But remember that Row Labels can be edited for display purposes, that is to say you can rewrite the above spans 1896-1904, 1908-1916, etc.

On the other hand, because the data set counts a manageably grand total of 35 modern Olympics, we may be happy to leave the years as we find them in their singular, ungrouped state. If so, we can begin this way:

Rows: Year

Values: Height

Weight (both computing Average and formatted to two decimals)

Columns: Sex

Slicer: Sport

And turn off Grand Totals.

With all those material variables in play – Year, Sex, Height and Weight – the pivot table may tend toward a certain untidiness and hyper-density. Truncating the Average of… labels to something like Avg Height should sharpen its legibility.

Because I’ve delimited my interest to those mainstream sports likely to have been fielded across most or all Olympics, I can aim my first Slicer tick at Gymnastics, mustering these outcomes in excerpt:


Note, my fellow Americans, that height and weights are expressed in centimeters and kilograms, respectively. (To restate the former in inches you’d need to multiply each entry by .39701, one formulaic way or another. Poundage is expressed by multiplying a weight by 2.20462 (I had long thought the increment was exactly 2.2)). As for the #DIV/0! messages besetting the table that allude to sliced fields containing no data, these can be air-brushed out of sight via the PivotTable Tools > Analyze > Options > Options > Layout & Format tab and leaving the relevant field below blank:


Returning to the gymnasts, note the decided lightening of women gymnasts from 1948 through 1980, an average weight reduction, so to speak, from 57 to 41.42 kg, or well over 30 pounds. Indeed, women gymnasts were on the average 5.89 kg, or 13 pounds, lighter in the 1980 Games than in the 1976 competition; but the poundage reverts upward again from 1980.

Explaining the variations requires something more than a top-of-the-head, knee-jerk, or any other such anatomically-driven response. Remember for one thing that many women gymnasts are counted multiple times among the data, and so their weights are submitted to the averages as often. Secondly, a survey of the particular gymnastic events in force for any given Games would be in order as well (the Gymnastic heading spreads over a variety of events), as new or abandoned events could have skewed the totals. In addition, the boycott by 66 nations of the 1980 summer Games may have likewise wrenched the average down.

On the other hand, the weight trajectory for male gymnasts is considerably smoother, tossing but one more complication into the mix.

We could next slice for swimming, another Olympic perennial. I get, in excerpt:


Here we see both men and women getting taller and heavier, with the paired slopes lifting upwards determinedly. That male average weights have burgeoned 16 pounds between 1972 and 2016 and the male-female weight differential has gradually ascended surely means something, though again some deep thinking about conclusions would be in point.

For a look at a winter sport, let’s slice for speed skating, opened to women in 1960:


Here the fluctuations are gentler, perhaps remarkably so, along with a small but measurable closing of the gender weight differential. You’ll also note that gymnasts tend to be shorter than both swimmers and speed skaters. But swimmers and skaters need to get to their finish lines sooner.

Tracking the Fields in Track and Field: Olympic Athlete Data, Part 1

26 Aug

In the interests of enhancing my medal prospects, I’m lobbying hard to have freestyle array formula writing approved as an Olympic sport. Preposterous and self-serving, you say? Well you’re surely correct on the second count, but a scan across the events that have, at one time or another, qualified for Olympic standing – all remembered for you on the Olympic athlete data set shelved on the Kaggle web site (sign up for free) – will reestablish my credibility. The set unfurls 270,000 entry records throughout its 21 MB of athletes who’ve mined for the gold and other precious metals and the events in which they hoped to find them, and those events make for a motely collection indeed.

Who’s in training nowadays for the Art Competitions Mixed Literature event, for example? Well, no one, in actuality, as the competitions for the best sport-themed literary submissions were last conducted in 1948, the year when all four of art events categories – art, architecture, music, and literature –  were finally edited out of the Olympics. But the Kaggle data set also records all the events of which you’ve heard, too, and its medley of columns provide for a host of interesting findings.

Start with a primordial consideration: the relation of Olympic events to their competitors’ average age and gender. Here a methodological caution need be sounded. Because athletes very often compete in different events and different Olympics, a coarsened look at average ages will of course come to admit the same athletes many times. While it would be possible, on the one hand, to account for athletes uniquely via a Remove Duplicates routine, that recourse would leave the data set with one record per athlete – and only one of his/her ages, a reduction a sight too far. It seems to me, rather, that a plural counting of the athletes – each instance of which nevertheless uniquely permutes each entry per age and event – would work here.

That’s my take anyway, but you’re welcome to download the data and subject them to your own assumptions. In any case, I’d start the analysis with this pivot table:

Rows: Sport

Columns: Sex

Values: Age (average, formatted to two decimals)

(Note that Sport is the superordinate rubric beneath which particular events are then subsumed. Thus the sport Swimming spreads its umbrella above 55 different events, for example.)

I get, in excerpt:


Yes, Aeronautics was an Olympic sport – once, in 1936, including a gliding event, though apparently only men took part (I don’t know, however, if the sport was gender-specific). There’s lots of sporting exotica in there (Basque Pelot – an assortment of raquet sports –  was featured as a demonstration sport in 1924, 1968, and 1992 but had bona fide standing in the 1900 Games), but a close look evinces a very general age parity across genders, if not sports. We do see, and for what it’s worth, that men competitors in sailing and shooting are significantly the older (and the numbers are pretty sizable here, if you retool the averages into counts); and it’s probably noteworthy that the men are nearly two years older than women in the very global Swimming sport, in which the men-women participant numbers stand at 13,345 and 9,850 respectively.

And what about average age by gender by Olympic year, and gender distributions (keep in mind that until 1992 the Winter and Summer games were conducted in the same year)? Substitute Year for Sport and I get in excerpt:


The strikingly higher averages for the 1932 games need to be explored; Wikipedia points out the economic privations wrought by the Depression pared the 1928 athlete complement by a half four years later. It nevertheless remains to be understood why those who did make their way to Los Angeles, where the games were contested, were the older ones. Perhaps they had to pay their own way, and could better afford the trip.

An additional curiosity is the age trough bottoming for women in the 1972 games (again, the numbers above reflect both Games that year). The screen shot clips these data, but in fact their average age of 25.57 for the 2016 (Summer) games pushes nearly five years higher than that for the Games 44 years earlier. Explanations, anyone? And you’ll note the far flatter arc for mens’ ages across the same span.

In connection with the above, you can also drill down the numbers by Season, the heading beneath which the Summer and Winter games are distinguished. Slide Season under Year and you’ll see, in part:


Again, a close look is called for here. The pronounced Summer/Winter women’s disparity in the selfsame 1932 competitions may be reconcilable in part by the grand total of 22 female participants in the latter Games, a figure not likely to gladden a statistician’s heart. Yet the impressive men’s Summer margin for that year of more than eight years is founded upon a more workable base of 2,622 and 330 athletes, the latter count compiled for the Winter Games. As for the watershed 1972, the women’s Summer-Winter averages come to 20.53 and 21.79. And if you’re interested in the overall, undifferentiated aggregate gender-age averages, they look like this: Women-23.73, Men-25.56. Of course, those computations have controlled for nothing (e.g. event category), so to speak, but probably mean something just the same. And the total number of entries by gender (remembering that these gather many instances of the same athlete): Women -74,522, Men-196,594.

Nevertheless, those numbers should and could be associated with the years of the competition. The obvious intimation here is that womens’ rates of Olympic involvement could have been expected to rise. Thus, we could proceed:

Row: Year (grouped in tranches of five years)

Columns: Sex

Values: Sex (% of Row Total; turn off Grand Totals, which must invariably yield 100%).

I get:


Not particularly shocking, but a detailed confirmation of the “obvious” can’t hurt just the same. The proportion of female Olympic athletes has about doubled in the past 40 years, a development heavily abetted by the expansion of events open to women; committed to primitive chart mode, the gender curves look like this:



And now I need to get back to that letter I’m sending to Nike, the one requesting sponsorship for my array-formula training. Look – it’s either that or crowdfunding.

Data Stories: NY City Property Sales

14 Aug

You’re wondering what to do with that discretionary $2,000,000 you have laying around – you know, that sheaf of bills that’s been angling your pillow up to untenable heights. In the interests of a good night’s sleep, then, how about dropping those two big ones on some New York real estate?

It sounds like a prudent bet; even the president of the United States has been known to commit some rocks to the idea, after all, and if that example doesn’t inspire you you may find yourself on the wrong end of an all-caps tweet from you-know-who. So now that you’re properly inspired and keen to know what you can get for your seven digits, you’ll want to consult the data set of the New York’s property sales for the year Feb 1, 2017 through January 31, 2018 – 60295 of them, or 165 a day – brought to you by the folks at Enigma public data, who’ve joined the sales records from a pair of City agencies. You can download the set over there, once you’ve signed into their site (for free).

Shift the default Save As format from CSV over to standard Excel and you’ll wind up with a 13 MB workbook – not a prohibitively weighty file, but with 62 fields that pull all the way into column BJ you may find yourself grabbing onto the scroll bar with prohibitive pique. You may thus want to abolish fields not likely to serve your analysis, starting with borough_code in Column A; you’d doubtless prefer to work instead with the five actual borough names filed in the curiously named borough_code_definition field in B. Other candidates for excision, at least in my view and at least for starters: community_district_definition in column Z, school_district_defintion, and city_council_district_definition. All three in effect reiterate the New York neighborhood names promulgated in the neighborhood field in what was, at least at the outset, housed in column B. Those curtailments kick back a megabyte to my hard drive, and every million counts, after all.

We can next move on to some standard, obvious need-to-knows, e.g, the number of properties sold in each of the city’s boroughs, understanding at the same time that obvious questions don’t always yield obvious answers. We could try this:

Rows: borough_code_definition

Values: borough_code_definition

I get:


You may be surprised by the smallish sales total for Manhattan, as I was. But the boroughs feature different kinds of properties in different proportions. Drag borough_code_definition into Columns and substitute building_class_category_code_definition in Rows, and you’ll get the picture, which looks something like this in excerpt:


Observe the One Family Dwellings entries, and their deeply understated Manhattan representation. While these standard residential homes contribute few jags to the island’s storied skyline they loom rather larger among the byways of New York’s more suburban boroughs.

But what about property values? We could plausibly guess that Manhattan sales will lift themselves atop the listings, even as a thorough appraisal would control for building class, but for a preliminary assay:

Rows:  borough_code_definition

Values: sale_price (average, formatted to two decimals with the comma)

That yields:


Our guess is ratified, and unsurprisingly. Now augment the Values area with these two fields, both formatted as above: gross_square_feet and floors_number_of:


Manhattan properties are clearly larger – and higher (note, however, the blanks in the floor field).  A 19-floor apartment can have a penthouse; the top one in a three-floored one is called the third floor.

But it next occurred to me that those averages may have absorbed multiple apartment purchases in the same properties, a possibility that could skew the numbers – provided one is concerned to isolate unique parcels.

With that thought in mind I ran a Remove Duplicates at the data, identifying address and zip code as sources of potential redundancy (zip code, in the event properties in two different boroughs had the same address, a possibility). 12251 records were thus uprooted, after which I reran the above pivot table. This time I got:


I was startled by the Manhattan decrement, so much so that I initiated the table twice for verification. Sort the floor numbers largest to smallest in the original data set, however, and you’ll indeed discover repeated sales in the same, very tall properties. Note as well the heightened average sale price for the borough, a figure that needs to be considered carefully; after all, if a single property enjoyed multiple sales at conceivably various price points, only one of them remains here.

And since the data span precisely twelve months, a look at sales by month might prove instructive (I’m back in the original, full dataset here). But here, again, we’ve run across – or afoul – of an Engima protocol, the decision to portray dates, the ones stored in the sale_date field, in textual terms, e.g.


In fact, because all the dates possess equal character length, a formula such as =VALUE(LEFT(W2,10)) could be put to the task, refashioning the entries into authentic values that could then be formatted in date terms. But a slightly swifter if less elegant workaround might be the thing to do here: since each entry in sale_date bears the suffix T00:00:00 (for whatever reason), selecting the sale_date column and targeting the data with a Replace All Find and Replace routine, e.g.


And that restores quantitative standing to the field’s labels. Subtract T00:00:00 from 2017-02-01T00:00:00, and you’re left with 2017-02-01, a usable date.  And it seems to work all 60,295 times.

And once that deed is done, try this:

Rows: sale_date (grouped by Month)

Values: sale_date (count).

sale_date (again, count > % of Column Total)

(Note that January data emanate from 2018.)

I get:


Note the conspicuous shortfall of sales in January, which again, is the most recent of the months. Apart from that discrepancy I’m not sure if a pattern could be properly declared, though the tycoons among you may have something to say on that count.

Now to take the analysis slightly off-piste, what about the average antiquity of the properties sold, that is, years of construction? The information is there, courtesy of the year_built field. Here’s one look at the question:

Rows:  borough_code_definition

Values: year_built (average, to two decimals, sans comma)

I get:


Note the relative recency of the Manhattan structures, befitting the borough’s protean impulses. Yet the collective provenance of Staten Island builds are far newer, a likely signifier of its relatively late-in-coming appeal as a residential habitat. By extension, the notably older properties of the Bronx appear to reflect that area’s comparative loss of appeal – but being New York, watch that space.

But bear in mind that the year_built data calls for a measure of review. Sort the field Lowest to Highest, and the commercial garages on 7th Avenue in Manhattan’s Chelsea district rise to the top. If in fact the facilities were completed in the reported year 1111 – about 800 years before the advent of the car – we need to commend the owners for their foresight.

Also clearly wrong is the attribution for the property in sort position 2 – the two-story home hammered together at 509 Seguine Avenue in Staten Island. Its 1680 time stamp should qualify it for landmark status, but a couple of web sites let us know that the building in fact became available in 1920.

In addition, some pause should be given to the 28 sales of properties dated 1800, most of these ascribed to a batch of condos on offer at 365 Bridge Street in Brooklyn Heights – an elevator building. In fact, it seems it was built in 1929.

And I thought the elevators were autographed by Mr. Otis himself.

World Cup Data: Goal Seeking- and Finding

31 Jul

Like American presidential elections, the World Cup doesn’t seem to end; the four-year interregnum separating those two events seems ever more ostensible; and because some prognosticators have instated Brazil as winners of the 2022 Cup it may already be time to wonder if the tournament should be held at all.

But my latest information is that it’s all systems go, and anyway, Qatar is lovely in November; so in the interests of limning some helpful deep backgrounding of the picture, you may want to download and kick around the data set of all World Cup match outcomes extending through the 2014 go-round, niched here in Kaggle’s repository of holdings.

The set then records all 836 matches contested from the Cup’s inception in 1930 through the immediately previous competition, in relatively self-evident fields, with the exceptions of the Round and Match IDs in columns Q and R. Kaggle maintains that the identifiers are unique, but the Round IDs exhibit a qualified uniqueness, i.e., they appear to signify a certain stage in the tournament (e.g. semi-final) that by definition would have been reached by multiple teams bearing the same id. And the match ids display curiously variable lengths, suggesting a mid-data shift in their coding protocol. The 2014 matches, for example, sport nine-digit identifiers; in 1998 their lengths have shrunk to four characters.

More troublesome is the small but measurable number of redundant game records, signaled by shared match ids. A Remove Duplicates routine earmarking MatchID as the offending field discovered 16 doubled records, which were promptly shown a red card.

Once you’ve stretched all the columns via the requisite auto fit, you can begin to think about what’s interesting in here. What, for example, about putative home field advantage? That vaunted residential edge is something of a legal fiction here; the first record, for example, names France as the home team and Mexico as the visitors, in a 1930 game set in host country Uruguay. But that only nominal imputation spices the question; might even a desultory home team standing impact game outcomes?

Let’s see. Any formula presuming to calculate win percentages needs to reckon with a common soccer/football eventuality – the fact that many games culminate in a draw. As such, we can take over next-available column U, title it Home Win Pct., and enter in U2:


That simple expression means to ask: if the goal entry in G exceeds the figure in H – that is, if the home team wins, then enter 1 the appropriate U cell. If, however, the values in G and H are identical – signifying a draw – then assign .5 to the cell, the standard evaluation of an outcome in which each team in effect arrogates half a win. Once we copy the formula down U, we can simply add all the numbers and divide the total by 836, the number of records in the data set (remember we deleted 16 of them). The result: a home-team win percentage of 68.42, a disproportion that piques the question as to exactly how home teams are nominated.

For example: in the 1930 debut Cup, Argentina received home-team standing for four of its five matches, its sole “away” status assigned to its role in the final, which it lost to the authentic home team, Uruguay. Mexico, on the other hand, competed under the away rubric for all three of its games that year. And the home team in 1930 – however denoted – won all 18 matches during the tournament.

Explanations notwithstanding – though they do need to be pursued – we can calculate Cup-by-Cup home-team winning percentages via a rather neat deployment of the AVERAGEIFS function.

First, enter a bare section of the spreadsheet and list the years in which the Cup was held, starting with 1930 and coming to a stop at 2014 (I’m commencing in Y3). Once you type 1934, of course, you can drag the remaining years down their column by autofilling their four-year interval, remembering, however, that the Cup was suspended in 1942 and 1946. Then name the Year field in A yr, the winner field in U win, and enter in Y3:


And copy down the Y column.

How does simply averaging the win data – which after all, comprise either a 1, a .5, or a 0 – forward our objective? Contemplate this example: a team winning two games and losing one receives 1, 1, and 0 points for its exertions. Average the three values and the resulting .6667 returns the winning percentage for two wins and one loss.

If we’re happy with that understanding and then proceed to format the results in percentage terms, I get:


It is clear that somewhere, perhaps in the 70s, the idea of a home team underwent a rethink; I’m not sure what drove the apparent definitional overhaul, but it seems to have been put into place (for a possible partial accounting see this discussion). We even see an away-team edge gained in the 2010 Cup. I’m happy to entertain surmises about these disparities.

In any case, what about goals – e.g., have their outputs ebbed or surged across the Cups? If we want to figure a winning-to-losing team metric, say the average winning and losing goal total – or really, the average score  – by game by Cup, we’ll have to improvise, because those data aren’t expressed in existing fields. A couple of simple formulas should be able to answer our question, however. I’ve moved into column V, called it Win Goals, and jotted in V2:


That expression simply declares that if the goal total in G exceeds the one in the corresponding H cell, then return the value in G; otherwise report the number in H. If a game was drawn the logical test will not have been met, of course, but no matter; since in such a case the G and H figures are identical it matters not which one the formula returns.

I next head into to column W, label it Lose Goals, and write what is in effect the flip side of the above formula in W2:


Both formulas are copied down their respective columns, of course, and conduce toward this pivot table:

Rows: Year

Values: Win Goals (average, formatted to two decimals)

Lose Goals (same treatment as above)

I get:


The marked downturn in goal scoring is associated with the recency of the Cups; indeed, the overall winning-game average of 2.18 goals was last attained in the 1970 tournament, and the average victory margin of three goals in the 1954 contests exceeds the average per-game combined goal total for the last 14 Cups. Average winning margin for all games: 1.51 goals.

And let’s see VAR verify that .51 goal.