New York City Restaurant Inspection Data: Tips Included

9 Jan

Expressing an interest in getting a bite to eat in New York calls for a bit of narrowing down. You will need to get a little more specific about your preferences, in light of the 26,000 or so restaurants in the city happy to fill your mouth, and evacuate your wallet.

Indeed – advising your New York crew “Let’s go to a restaurant” reminds me of the woman who stood in front of me in a Starbucks and requested a cup of coffee, the kind of order that drives baristas into a hand-crafted frenzy.

But once you’ve finally sat yourselves down you may want to learn a little more about what exactly it is you’ve gotten yourself into – literally – and the restaurant inspection data reorganized by the Enigma public data site may go a ways towards telling you more than you wanted to know (the data are free to you, but they want you to sign up first. Remember that the lower the inspection score, the more salubrious.)

I say “reorganized” – although Enigma will tell you they’ve “curated”-  the data, because the inspection outcomes have presumably been culled from New York’s remarkably near-real-time and far larger official data set, available on the city’s open data site (and Enigma’s too, though their version is three months old). The revision opens an interesting point of entry, then, to an understanding of how someone else’s data have been re-presented by someone else.

In what, then, does Enigma’s remake of the original data consist? For one thing, they’ve proposed to distill the source data set down to a unique entry for each restaurant (keep that that stratagem in mind), each of which, after all, have been subjected to several inspections.  By means of verification I aimed a Remove Duplicates check at the camis field comprising restaurant ids, and came away with but six redundancies – not too bad for a compendium of nearly 25,000 records.

And once having completed that chore we can run a simple but revealing pivot-tabled census of New York’s eateries by borough:

Rows: boro

Values: boro (count)

boro (again, by % of Column Total)

I get:


No one will be surprised by Manhattan’s restaurant plurality, though it should be added that the residential populations of both Brooklyn and Queens far exceed that of the storied island. In addition, keep in mind that the endless turnover of restaurants (the Quora article linked above declares an annual restaurant closure rate of 26%, though that assertion should probably be researched), turns the count into an implacably moving target.

And for another thing, the Engima set has padded the progenitor data with each restaurant’s geo-coordinates (latitude-longitude), thus priming a mapping capability. But they’ve also, befitting one of Enigma’s enigmatic apparent first principles, reformatted the inspection dates into text mode.

And Enigma’s alternate take has also put the scissors to some of the set’s original fields. The Critical Flag field – naming restaurants that incurred what the Department of Health and Hygiene terms critical violations, “…those most likely to contribute to food-borne illness”, is gone, and I’m not sure why. Those data sound like something you’d want to know about, and analyze.

But there’s a pointedly more serious issue besetting the data that I haven’t quite figured out. Because Engima determined to squeeze the data into a one-record-per-restaurant yield, it had to decide exactly which record would be earmarked for retention; and common analytical sense would commend the latest such record, conveying the current inspection standing for each restaurant. But it appears that Enigma hasn’t always nominated the latest record. A spot comparison of the records across the two datasets turned up some Enigma selections that predate more current inspections for the same restaurant in the official New York workbook. And if those kinds of discrepancies riddle the Enigma data, then we need to wonder about the decision rule that authorized their inclusion – and I don’t know what it is. What would an aggregate averaging of inspection scores purport to say, if some of the scores have been superseded by newer ones? (My emailed query to Enigma about the matter remains unanswered as of this writing.)

Moreover, because the one-record stipulation is in force, Enigma was impelled to collapse disparate violation codes in that eponymous field. The very first record, for example, for the Morris Park Bake Shop, reports two violations coded 10F and 8C, both filed on May 11, 2018. But New York’s precedent dataset has assigned a distinct record to each of the two, easing a pivot table breakout by code.

And those code consolidations – an ineluctable follow-on of the one-record-per-restaurant decision – probably explains Enigma’s omission in turn of the original Violation Description field. Boxing multiple violations in the space of one cell might confound legibility for both researchers and readers, and so Enigma likely concluded the whole field was best expurgated – at a price of course, because now we don’t know what the violation codes mean.

Now to be fair, Enigma also furnishes a worksheet-housed directory of those codes, which make for a most serviceable lookup array; but the multiple-code cell structure of its inspection data makes for an exceedingly messy prospect for 24,000-plus lookup values, which must be individuated somehow.

But all these cogitations have given me the munchies. Where do you want to eat? You want Chinese? Fine – that pares the choices to around 2,400. Your treat.

Walking Across the Brooklyn Bridge: Stepping Through the Data, Part 2

24 Dec

The decision to walk across the Brooklyn Bridge is a distinctly multivariate one, even if the internal equation that sets the walk in motion doesn’t chalk its terms on the walker’s psychic blackboard.

That preamble isn’t nearly as high-falutin’ as it sounds. Nearly all social activities negotiate trades-off between these and those alternatives, and a promenade over the bridge is no different. We’ve already observed the great, and expected, variation in bridge crossings by hour of the day in the previous post, and we could next consider the impact – if it’s proper to think about the matter in those causal terms – of month of the year on journey distribution (remember that our data records bridge crossings from October 1 2017 through July 31 of this year).

That objective calls for this straightforward pivot table:

Rows: hour_beginning (grouped by Year and Month. You need to put both of those grouping parameters in place in order to properly sequence the months, which straddle parts of two years.)

Values: Sum

Sum (again, here by % of Column Total)

I get:


The differentials are formidable, for me surprisingly so. One would have expected bridge foot traffic to crest in the summer, but a July-January walker ratio of 2.8 comes as a surprise, at least to me (remember again that the above totals compute one-way trips). It’s clear that meteorology has a lot to do with the decision to press ahead on the bridge, in addition to, or in conjunction with, chronology, i.e. the hour of day and day of the week. What we can’t know from the findings is whether the walkers had to get from Brooklyn to Manhattan or vice versa one way or another and chose to walk, or whether the trips were wholly discretionary.

And would one expect a spot of rain to discourage walkers? One suspects as much, of course, but confirmation or denial should be but a few clicks away. We could, for example, write a simple CORREL formula to associate precipitation with pedestrian turnout, provided we understand what it is we’re correlating. Here we need to remind ourselves that because we subjected in the previous post to a Get & Transform routine which replicated the pedestrian data source, that copy rolled twice as many rows we found in the orignal, assigning a record each to Towards Manhattan and Towards Brooklyn hourly totals. As a result each hourly precipitation figure is counted twice there, and so simplicity would have us look at rainfall data in the original dataset, if you still have it. If you do, this CORREL expression, which assesses precipitation by hour:


Delivers a figure of -.0093, or a rather trifling fit between rain/snow and the determination to walk the bridge. Now that doesn’t look or sound right; but that perception is my way of saying it doesn’t comport with my commonsensical first guess.

But the correlation is “right”, in light of the manner in which I’ve framed the relationship. Because the formula considers precipitation with hourly pedestrian totals, most of the rainfall entries are overwhelmingly minute, and indeed, in over 4800 cases – almost two-thirds of all the hourly readings – amount to zero. The correlation appears to capitulate to what is, in effect, a host of unrelated walk/rainfall pairs.

But if you correlate walk numbers with aggregate rainfall by entire days the numbers read very differently. Continuing to the work with the original dataset, try this pivot table:

Rows: hour_beginning (grouped by Days)

Values: Pedestrians

Precipitation (both sum)

(Note that the row labels naturally nominate 1-Jan as the first entry, even as that date isn’t really the earliest. Remember the demonstration project got underway on October 1, 2017. But chronological order – really a lowest-to-highest numeric sort – is in no way a correlational necessity.)

Running a correlation on the above outcomes I get an association of -.333, which makes my common sense feel better about itself. That is, as calibrated here, rain “affects” pedestrian turnout to a fairly appreciable extent – the more rain, the fewer walkers, more or less. Again the (negative) correlation reflects the precipitation aggregated by days, not hours. Indeed – just 47 of the recorded 304 days report no precipitation at all.

And how does temperature figure in the decision to traverse the bridge? Again working with the original data set (and not the pivot table), in which each hourly instance appears once, we can rewrite the correlation, this time introducing the temperature field, which I have in the G column:


I get .391, another persuasive, if partial relationship. With higher temperatures come stepped-up foot traffic – to a degree, pun intended – but that finding induces a couple of hesitations. For one thing, the Fahrenheit system to which the temperatures are here committed promotes an arbitrary, famous understanding, as it were – the temps aren’t keyed to an absolute zero. And so it occurred to me that a second correlation, this one redrawn with the temperatures pitched in Centigrade mode, crunch out a different result. That statistical hunch had me open a new temporary column (in my case in H), in which I refigured the temps with the cooperative CONVERT function, e.g.


Copying down H and reprising the CORRELATION, this time with the C and H-columns range in tow, I wound up with… .391, at one with the first result, at least if you’re happy with a 3-decimal round-off, and I think I am.

But in fact the two .391s depart from one another by an infinitesimal sliver. The first, associating walk totals with temperatures expressed in Fahrenheit, comes to .390815536. That correlation with the temperatures in Centigrade (Celsius) calculates to .390959393. Presumably the tiny shift in numeric gravity wrought by the respective measurement systems accounts for the difference, about which few are likely to care, to be sure. But that discrepancy does mean I need to learn more about the workings of correlations.

The other caution about the correlation, whichever one chooses, asks about its linearity. While we could reasonably anticipate a swell in bridge crossings as the mercury ascends, it’s most possible, on the other hand, that pedestrian activity could be inhibited by temperatures forbiddingly high – in the 90s, for example.

And that conjecture could be submitted put to a pivot table (small note: two rows among the data record no temperatures), e g, assuming again we’ve remained with the original dataset, which features each temperature only once:

Rows: temperature (grouped, say, in bins of five degrees)

Values: Pedestrians (sum)

Pedestrians (count)

I get:


(The blanks reference the two empty temperature cells, and can be filtered out.)

The pedestrian count in effect totals the number of days populating each grouped temperature bin. After having filtered the blanks, move into the next-available D column – a space external to the pivot table – and enter in D4:


Round to two decimal points and copy down D (I don’t think a calculated field can must this result). I get:


We’ve disclosed a strong if imperfect (and unsurprising) upward association between pedestrian hour averages and temperature. But the highest reading – 89-94 degrees – does seem to drive a pull-back in traffic. Note in addition the leap in hourly crossings from the 74-78 to 79-83 bins, as if 80 degrees or so lifts the inclination to walk to its tipping point.

So there. Didn’t I tell you the decision to lace up those high-heeled sneakers was multivariate?

P.S. In response to my previous post’s curiosity about a few missing Towards Brooklyn/Manhattan data, New York’s Department of Transportation wrote me that the empty entries might be attributable to a weather-induced snarl.

Walking Across the Brooklyn Bridge: Stepping Through the Data, Part 1

11 Dec

Ask New Yorkers what they were doing at 4 o’clock in the morning on the night of October 12, 2017 and they’ll tell you they a) don’t remember or b) won’t answer on the advice of counsel. But no matter – we know exactly what one intrepid iconoclast was doing at that hour – walking towards Manhattan on the Brooklyn Bridge.

That solitary – and courageous – wayfarer will be happy to know we have no more information about him, or her; all we know is that he – or she – was sighted and duly recorded by the Brooklyn Bridge Automated Pedestrian Demonstration Project, an initiative of New York’s Department of Transportation.

Squirreling an electronic counter somewhere on the Manhattan-side approach to the storied bridge the project gathered footfall data for both directions through the October 1 2017-July 31 2018 span (I’ll own up to the pun) and walked them over to New York’s open data site here (click the Export button on the screen’s far right and tick the relevant CSV option).

Our solo nightwalker presages a dataset that is nothing if not intriguing, even as it exhibits a few organizational curiosities. Apart from column A’s all-but-standard need for a widening via auto-fit (its dates/times are authentically numeric, though) you’ll join me in declaring the contents of column B – comprising 7,296 citations of the phrase Brooklyn Bridge – slightly superfluous, and so eminently dispensable.

And much the same could be offered about the lat and long fields, each of whose cells deliver the same coordinate, presumably positioning the Brooklyn Bridge in its earthly locus. So too, the Location1 field restates its data all the way down, and in textual terms, no less. We’ve seen this sort of thing in any number of open datasets, and it’s proper to wonder why. One assumes some industry-wide download routine batches out these relentless uniformities but whatever the accounting, the data they produce aren’t needed and could be either deleted or ignored.

And there’s another corrective that merits a claim on our attentions. The numbers in the Pedestrian field for November 9 at 7 and 8 PM read 411 and 344 respectively, but the companion data in the Towards Manhattan and Towards Brooklyn cells – which when summed should equal the Pedestrian figures – report nothing but zeroes. And in view of the fact that the pedestrian numbers for November 9 at 6 and at 9 PM read 455 and 300, it seems clear that the count for 8 and 9 could not have amounted to nothing at all. I broached the discrepancy via emails to both the Department of Transportation and the New York open data site, but have yet to hear from either. For the moment, we have to proceed with four empty cells.

And there’s something else, a data-organizational failing that could, and should, be righted, and one we’ve encountered in previous posts. Because the Towards Manhattan and Towards Brooklyn fields host data that are really of a piece and should be treated unitarily (the easier to calculate the percent of pedestrians by direction and date/time, for example) they should migrate their holdings to a single parameter (which I’ll call Direction), via the Get & Transform Data routine with which we’ve recombined  other datasets. Start by selecting both Towards Manhattan and Towards Brooklyn and continue per these instructions.


The Get & Transform alternate take then also frees us to delete the Pedestrian field, because the new amalgamated Direction brings along with it a Value (I’ll rename it Sum) column that totals the Brooklyn/Manhattan numeric data by each date/time entry, thus superseding Pedestrian.  Of course the new dataset (presenting itself in table form) comprises twice as many records as its progenitor, because each Towards Manhattan and Brooklyn notation now populates the new Direction field with an individual record instead of the previous parallel fields populating the same record; but that trade-off is well worth the price.

Once that extended preliminary chore completes its work we can try to make some sense of the all that fancy footwork on the bridge. We could start by pivot tabling pedestrian totals by hour of the day:

Rows: hour_beginning (group by Hours only)

Columns: Direct

Values: Sum

I get:


The two obvious attention-getters here are the exceedingly, but largely predictable, fluctuations in foot traffic by hour, and the direction of that movement. Note the grand totals as well: over 5 million unidirectional trips launched across the ten-month project period – about 16,500 a day, or approximately 8,200 discrete, carless (don’t read careless) New Yorkers opting for the scenic route over the East River, assuming of course they walked the walk both ways. And if you’re wondering about 4AM, we find around 5.6 average crossings ventured at that hour – even though that hardy coterie probably could have gotten a seat on the subway, then, too.

And consider the literal back and forth of the walkers’ directional proportions. 3PM (that is, the hour running through 3:59), attracted the most pedestrians, closely flanked by 2 and 4PM; and though I doubt the reasons for the hours’ appeal are particularly mysterious, I don’t know what they are. An optimum time for a mid-afternoon stroll? A cadre of workers perambulating home from their early shift, or setting out on their way to a late one? I doubt it, but research on the matter awaits.

And what of the 8AM hour, in which walks toward Brooklyn far outnumber trips to Manhattan? I would have thought – wrongly – that the press of rush hour would have drawn Brooklynites toward Manhattan in predominating numbers, but we’ll have to look elsewhere for an explanation. But by 9AM the flow reverses, rather pronouncedly and curiously.

Now the above table might read more crisply by removing the grand totals for rows (while retaining them for columns) and spinning the numbers through the Show Values As > % of Row Total grinder (remember that this prospect has been enabled by the Get & Transform protocol):


Thus we see 58% of the 8AM traffic heading toward Brooklyn, with a pull-back to 44.71% by the next hour – a dramatic, and analytically provocative, reversal.

And what about pedestrian accumulations by day of the week? Common sense allows that weekend totals should be the greater, and that conjecture dovetails with the truth. But extolling common sense won’t do much for your byline; you’ll need to substantiate the finding with a wee bit more precision. Start by introducing a new column to the immediate right of hour_beginning, call it Weekday and enter in what should be B2:


(If you click on A2 in order to emplace that cell reference in the formula, Excel with respond with a table structured reference; but the result will be identical).

And because the dataset has assumed table form, the formula will instantly copy itself down the B column.

Then try:

Rows: Weekday

Values: Sum

Sum (again, this time % of Column Total)

I get:


(You’ll want to rename the headers. Remember as well that 1 signifies Sunday, 7 Saturday.)

Common sense prevails, though I hadn’t foretold Saturday’s substantial edge over Sunday.

But nor did I expect to find someone on the Brooklyn Bridge at 4 in the morning. I want to see the selfie.



Top 100 Toy and Game Manufacturers: Playing Around With the Data

28 Nov

A worksheet is a big thing. If Excel savant Francis Hayes has read his surveyor’s map properly, he’s discovered that a sheet’s 17 billion cells (give or take a few hundred million) marks out a square mile’s worth of territory, and that’s an awful lot of lawn to mow. And while your dataset isn’t likely to push its lot of fields into the three-lettered column nether (column AAA is a sheet’s 703rd), it’s nice, as they used to say about the Sunday New York Times, to know it’s all there.

But that gigantic tract of cells means there’s enormous room for a spreadsheet’s design to assume this or that conformation, a reflection that’s cued by a look at a Statista workbook that lists and details the corporate skinny for planet Earth’s top 100 toy and game companies. It’s here:

Top_100_Toys&Games – Statista

We can agree that the Toplist sheet is iconoclastically organized; it has assigned five discreet datasets to its space, each reporting on a different corporate parameter and extending its overall reach to column AX. By itself, course, that layout program is perfectly legal, but it controverts the conventional wisdom that would commend each dataset to a sheet all its own.

On other hand, one would be entitled to ask exactly what’s ultimately “wrong” with the all-to-one-sheet scheme. One answer could point to a compromising of navigational ease across the sheet; we’re accustomed to finding our data waiting for us somewhere in the upper reaches of the A column, and only one among our quintet can answer to that description, of course.  Here, you’ll need to do your share of scrolling if you want to view the other four. (Another note: appearances to the contrary, the datasets are just that, and not the tables a first look at the set might suggest. The banded-row effects – a default staple of tables – were here hued by a series of conditional formats that assign contrasting colors to odd and even-numbered rows respectively.)

But I’d agree, on the other hand, the navigational objection could be quashed as less-than-substantive, and merely presentational. But there are other demurrals could be aimed at the datasets that might prove a jot more incriminating.

For example, you’ll note that the actual names of the 100 firms appear only in column B in the first dataset along with their rankings in A. Yet the actual ranking determinant – presumably the revenue-declaring   “latest value” field in the AG column – finds itself in the fourth dataset only.

Moreover, you’ve probably also observed the filter buttons wedged between the datasets in the empty columns that separate them, buttons that of course have nothing to filter and thus provoke the necessary follow-on question: why are they there at all?


The answer, it seems to me, insinuates a likely pre-history of the datasets: that the five were originally one, and that in the interests of thematic reorganization the designer chose to chip the data into the fractionated wholes we’re viewing now. By way of proof, try drawing up a primitive dataset comprising three fields. Turn on the filter buttons (Data ribbon > Filter in the Sort & Filter button group) and proceed to interpolate a new column between any two of the existing columns. You should find yet another filter button topping the new, completely empty column/field, even as you leave it untitled.

So that’s what Excel does, affirming an operational reality that goes some way toward proving my point – that the data in the Toplist tab were originally of a piece, split only later into the five sets about which Statista wants us to know. And indeed – the very placement of all five in the same sheet corroborates the point.

What advantage, then, redounds to the breakup of the data? As intimated above, I suspect Statista wanted to sharpen the data’s readability and focus by subdividing them into the headings featuring in each set’s upper-left-corner cell (e.g. Rank, End of Fiscal Year, etc. And by the way, those headings – in row 4 – will not obstruct a sorting of the data, which will rather, and properly, draw their upper boundary at row 5. We met up with a similar curiosity in one of my posts on the American Sociological Association data. And pivot tables will likewise recruit their field names from the entries in row 5, which means that something will have to be done about A1, in which the name is represented by a dash.)

But the presentational gains realized by the data partitions cloud before the losses in functionality. If, for example, one wanted to group employee counts by manufacturer ranking you’d be stopped at the door, because those counts, in the third dataset, aren’t accompanied by the rankings there. That information is exclusive to the first set, and the two sets can’t be subjected to a relational query – even should they be converted into bona fide, prerequisite tables – because datasets one and three share no field.

Such is the problem, but the way out seems simple – delete the corridors of empty columns and reunite all the data into one grand, overarching monolith of a set. Remember that you’re an analyst here, not an aesthetician; you want the data to behave themselves and cooperate with your investigative intentions; looks come in a distinct second.

But that consolidative act doesn’t conclude the remake. You’ll likely want to attend to the fields topped by year figures – there are four 2017s, for example, renamed successively by Excel 20171, 20172,20173, and 20174 in the interests of differentiation. I’m also not sure why the fields reserved for 2018 data are in place here at all, as they’re unpopulated and could be deleted. While we’re at, we could ask what analytical utility devolves upon what was the fifth dataset, Revenue: Reported Currency (in millions); its income figures, expressed in country-specific denominations, thwarts a like-for-like comparison.  Thus one could tempt oneself to delete that erstwhile fifth set, save for the fact that Reported Currency imposes time-sensitive currency dollar equivalences to the dollar. Thus, for example, tenth-ranked Bandai Namco Holdings, a Japanese concern, reported a 2017-2016 earnings ratio in dollars of1.19. The proportion for the same years expressed in dollars comes to 1.08. Is that discrepancy worth pursuing? Maybe.

In any event, once you’ve fused the datasets into the greater whole you’ll be able to assess employee force size as grouped by the ranks of firms, say in bins of 5, e.g.

Rows:  Rank (grouped by the interval 5)

Values:  latest value (the one in column u, that is.  Column AD bears the identical similar heading, at least by default. Average, formatted to two decimals with commas. And true, latest value here might not edify a reader.)

I get:


Among other things, we learn that employee complements don’t correlate neatly with ranks (which are derived from revenue), and surprise – the list descends to number 108.

But maybe that’s a top 100, accounting for an inflationary quarter.

Atlantic Ocean Hurricanes: Brain-storming the Data, Part 2

13 Nov

Obvious questions, as we learned in the previous post, don’t always facilitate obvious answers; and if you’re seeking additional cases in point, consider this up-front question we could put to our hurricane-data worksheet: has the average length of storms, measured in days, moved about across the 167 years of data filling the sheet?

By way of a first consideration, what we know won’t answer the question is a dividing of the sheet’s 50,000-plus records by the 1848 individual storms we’ve identified. The quotient that emerges – about 27 – can’t propose itself as the storm duration average in days, because each storm elicited multiple observations of its movement and progress. The very first storm in the data set, coded AL0111851, triggered 14 observations across its four-day span. And its daily observation average of 3.5 tells us that the numbers of recorded observations weren’t constant across days; and so that while 27 represents the average number of observations performed per storm, that figure cannot be divided by some unvarying value to yield an average day count.

We’re also roadblocked from an imaginable alternative for calculating the average duration. We can’t subtract each storm’s start from its finish date, at least not directly, because the pre-1900 storms informing the list resist Excel’s best date-formatting efforts. Enter a date antedating January 1, 1900, and the notation is forced into text mode, e.g. a non-numeric entity that can’t be added, subtracted, and the like.

That famous limitation doesn’t close the door on the task, however, and I can think of a couple ways of poking my foot beneath the transom before it slams. The more elegant of the two rides the idea that a unique count of the dates associated with each storm is coterminous with its duration.

For example- storm AL0111851’s 14 entries post four different dates to the Observation_Date field in column E. (We need to tap into Observation_Date and not the dates vested in the Date field in A, because those latter data are visited by times as well, and thus confound the search for unique dates alone.) Four different dates – assuming they’re consecutive for all the storms, a not imprudent assumption – must signal a storm of four days’ duration.

That the pre-1900 entries among the data don’t qualify as genuine, grade-A certified dates doesn’t matter here. All that concerns us is our capacity for culling one instance of each date, its format notwithstanding.

And to broadcast that possibility to our screens we can once again make productive use of a pivot table’s Distinct Count operation, mobilized by a tick of the Add this data to the Data Model box:


Set the pivot table in motion and organize it thusly:

Rows: Year

Values: Observation_Date (Distinct Count)

Storm_ID (Distinct Count)

I get, in excerpt:


The table thus grabs only one instance of each date and storm ID. From here, however, it appears as if we need to supplement the results with some simple but external formulas, because a calculated field invoking distinct counts – in which the annual counts of the observation dates might be divided by the yearly count of storm IDs – isn’t available to us.

We can opt for a none-too-graceful workaround, then, by entering this extra-pivot table formula in D3 (entering the title Average in D2), lining itself up with the first data row in the pivot table:


Format the result to two decimals and copy down D (through the year 2017; omit the Grand Total row).

We could also, however, replace Year with the overarching Grouped Year field we had extemporized in the previous post. Try that, and delete the now-excess formulas in D that descend the final 2010 grouped year.

I get:


Of course the Average field abuts, but cannot enroll, in the pivot table, and as such can’t be party to a pivot chart. But by selecting A3:A20 and D3:D20 with the cooperation of the Ctrl key as you can select the non-contiguous fields, you can insert a scatter chart with straight lines and markers and power up this chart:


You can right-click the horizontal axis, click Format Axis, and set the Major units interval for 10 as you see above, a decision that will disclose all the grouped years into the chart (note that the scatter chart seems to be preferred here because it honors the numeric status of the X axis data, here the grouped years; a line chart engineers a default treatment of X-axis data as labels. Look here for an instructive discussion of the issue. It’s true, on the other hand, that a conventional line chart could be put into place in this case as well, because the grouped years happen to be equally spaced and so would be indifferent to their label status; but you’d need to edit the axis in the Select Data dialogue box in less-than-obvious ways).

The chart delineates a dip of sorts in average storm duration across time, and so dashes a lurking, laymen’s speculation of mine – that the upheavals wrought by global warming would have served to prolong storm lengths. But ok – I guess that’s why we look at the evidence. (Again, of course, we assume that storm measurement criteria and instrumentation enter the equation as constants across the 167 years, a premise that could be vetted.)

Now for a next task, we could examine the average maximum wind velocities across the grouped years, again with an eye toward any material change. Because the job appears to require a distinctly stepwise solution, in which the wind maximum for each storm needs to be calculated and then followed by an average of the maxima by years, I don’t think a pivot table can deliver an immediate, conclusive result. Here’s what I’d do, then: commission this pivot table:

Rows: Grouped Year


Values: Observation_Max_Wind

That table looks something like this:


I’d next turn the above results into a data set of my own, whose records could be plowed back into a second pivot table. I’d thus

  1. Redesign the pivot table into a Tabular mode and likewise tick of Repeat All Item Labels (both options stored in the PivotTable Tools > Design > Report Layout button the Layout button group). Turn off Grand Totals.
  2. Click anywhere inside the above table, click Ctrl-A to select it in its entirety, and perform a simple Copy> Paste Values upon itself.

You’re left with an unprepossessing data set, divested of all its pivot table appurtenances. But we then proceed to pivot table the data anew:

Rows: Grouped Year

Values: Observation_Max_Wind (Average)


And if we’re charting the above we’re forced to return to a conventional line chart, because you can’t describe a scatter chart from a pivot table, and Excel will duly inform you. Proceeding with the line chart option, then, and indulging in some by-the-book tweaks, something like this emerges:


The plummeting of maximum speeds commencing with the 1960 interval surely demands a closer look, particularly in view of the subsequent movement upwards approximating toward pre-1960 averages. How is the dip to be explained? Observational error or methodological rethink? Actual diminution in velocities?

I don’t know, but just don’t call this post long-winded.

Atlantic Ocean Hurricanes: Brain-storming the Data, Part 1

26 Oct

A certain Nobel laureate once contended you don’t have to be a weatherman to know which way the wind is blowing; but the man from Hibbing, Minnesota didn’t track the loci of hurricanes, a matter about which some meteorological expertise will probably help you make deeper sense of the spreadsheet devoted to gale-force systems churning across the Atlantic Basin (i.e. the Atlantic Ocean, according to Wikipedia). You can whisk the sheet into your laptop by freely enrolling in

With content drawn in from the US National Hurricane Center  the workbook delivers more than 50,000 rows of information about the storms, extending its archival reach back to 1851 and carrying through to November of last year. What this retrospective suggests, but doesn’t prove, is the functional comparability of the data –that mid-19th century observational techniques were the equal of contemporary measures, particularly the readings of wind velocities that feature in the data set (see this resume of the early methodologies). But assuming here that the numbers tally properly, let’s proceed.

For starters we can ask an obvious question, namely, the number of hurricanes distributed by years, a stat that might have something to say about the effects of global warming. But it is here that a less-than-obvious step-back is required, because we need to make sure we know what it is we’re meaning to count.

To wit: those 50,000 records don’t certify a like number of hurricanes; in fact, the data set comprises but 1,848 of those, an incongruity explained by the multiple observations to which the storms were subjected, at apparent six-hour intervals.

Moreover, a look through the Observation_Storm_Type field in column H (whose items are denoted in by the sheet’s data dictionary), lets us know that in fact ten different storm types populate the set. They’re not all hurricanes, then, and so each needs to be counted discreetly.

With those preconditions in mind and in hand (and once you’re attended to the usual column auto-fit) we still want to tug the storm years from the records, perhaps most easily via this formula applied to the Observation_Date data in E, after titling the empty field in M Year:


(The little-known ==–(LEFT(E2,4)) will also work.) You’ll note the E entries press together, in text form, year, month, and day data, e.g.


Thus the four-character extraction effected by LEFT delivers a textual result (that’s what LEFT does); couching the expression in VALUE confers numeric standing to the output, something you’ll want if you need to group the years, for example. Then copy the formula down the M column.

Remember that we’re seeking to break out storm counts by year; but because each storm enters the data set multiple times in virtue of their repeated observations, we’re reintroduced to a problem that greeted us a few posts ago: the mandate to realize a distinct, one-time count of each storm.

We reviewed that technique in that earlier post, and once you’ve nailed it down you can proceed to this pivot table:

Rows: Year

Columns: Observation_Storm_Type

Values: Storm_ID (remember these have been subjected to the Distinct Count operation)

I get, in excerpt (I’ve scrolled to the later years):


Even a perfunctory scroll down the years’ grand totals will disclose an upswing in overall storm activity, though the slope is far from resolute. Note in addition the first appearances in 1967 of codes SD and SS, both species of what the data call subtropical cyclones, separated by velocities above and beneath 34 knots.

Now in view of the 117 years of storm data spilling down the timeline you may be inspired to group the years, in the service of a fetching concision. That intention sounds honorable, save for a surprise reality check: you can’t group data once you’ve treated the data to a Distinct Count. You just can’t get there from here, a peculiar obstruction that redefines the Distinct Count option in tradeoff terms; you’ll be able return unique instances of records, but at the cost of a forfeited grouping potential (see this discussion on the conundrum).

But a grouping Plan B awaits, as suggested by the link immediately above. We could fashion a helper column of sorts – i.e. a new field – that rounds off the values in question, and so poising them for a de facto grouping.

Yes, Plan B requires explanation. This is the idea: if, for example, you determined to group the years above into bundles of 10 (e.g. 1850-59, 1860-69), then if all the years contributing to a given bundle could be redefined, or rounded, to the bundle’s first year – say if 1852, 1856, and 1859 were coarsened into a temporary 1850 – then only one year in the tranche, 1850, would remain for grouping. But of course, a single year doesn’t have to be grouped, and so a data set full of 1850s, 1860s, 1870s, etc. is now readied for a conventional, non-grouped table.

Got that? And the formulaic means for bringing this faux grouping about is supplied by the FLOOR function, which identifies a value closest to, but not greater than, a specified number that’s at the same time divisible by a selected number. Thus


would yield 1850, because that value is both closest to 1858 without exceeding it, and divisible by 10.

Put that understanding in place and you can name column N Grouped Year or some such, enter in N2


and copy down the column.

Now this pivot table should do the job, more or less:

Rows: Grouped Year

Columns: Observation_Storm_Type

Values: Storm_ID (Count)

I get:


We see that the row labels portray a de facto grouping; and remember that if you want to more closely emulate the pivot table’s default, hyphenated grouping spans it’s perfectly proper to submit the labels to a manual edit, e.g.


But you’ve noticed a problem, as I did – eventually. The storm-type values populating each row simply don’t comport with the grand totals (and you may have noted much the same in the first screen shot), and it’s the grand totals that are correct. I think this very real discrepancy owes to the fact that the same storms often experience a change in their type across the observations, and as such are counted multiple times (see this related explication).

Moreover, if you double-click any grand total Excel will disgorge up to the first 1000 records within any grouping, in spite of the number reported in the Totals. That is, the double-click that typically divulges just those individual records aggregated by the total seems here beholden to some invariant deep-structural process, one that doesn’t reflect record count directly.

In any case it’s clear the above tables aren’t presentationally acceptable, and you may have to content yourself with this simpler depiction:


Simpler, but I still don’t understand what each storm type means.

You see, Mr. Dylan, sometimes you need to be a weatherman.

Sociology Section Enrollments, Part 2: Collective Behaviors

16 Oct

Curious but methodical, did I quotably describe the American Sociological Association’s section membership data’s spreadsheet, calling attention first of all to the sheet’s functional header row in 2 that manages to ignore the data resting atop it in 1; and in that connection I owned up to my own curiosity about row 1’s resistance to any sort applied to the data set beneath it. It turns out that curiosity was apparently misplaced. If, for example, you proceed to sort this data set:


Its numbers will sort properly – for whatever reason – even if you click among the data in the top row. even as that attainment seems at least slightly curious too, at least to me (for a resume of some of the vagaries of sorting see this post by Excel ace Allan Wyatt).

And while I’m mid-course correcting, you’ll recall the supposition I entered in the previous post – namely, that ASA members are confined to a maximum of three selection enrollments, a deduction I prised from the enrollment averages springing from the race data workbook. It turns out, however, that I was wrong again; my misunderstanding was righted by ASA Director of Academic and Professional Affairs Teresa Ciabattari, who tipped me to the fact that no enrollment limitations are imposed on members. But that very open-endedness rouses the question why per-member enrollments thus hover around the 2 mark. Academics are busy people, of course, and their collective professional experience may have centered the normal curve around that number. What isn’t here is a frequency distribution of sorts, one that would count the numbers of members who affiliated with this or that many sections.

And I learned something else not entirely extraneous to the above: that section memberships must be paid for, the costs of which are interestingly variable. I was thus bidden to wonder if the dues might be negatively, or even positively, associated with section enrollment numbers (the most expensive: $52, for the Community and Urban Sociology section, a fee that at the same time grants members a subscription to the City & Community quarterly).

They aren’t, apparently. The correlation between section dues and their numbers comes to .127, a value that surely won’t quicken the heart of a sociologist, but at least tells us that money isn’t everything.

But back to the curious/methodical theme. Consider the organization of the membership-by-race worksheet as it was conceived by the ASA:


Now suppose you’re been entrusted with the raw membership data in their primeval, pre-spreadsheet form, along with a remit to put the numbers to some good analytical use. As such, you need to bang together a sheet that serves your purposes and only yours, and not the interests of a general readership who intends nothing more than a breezy scan of the numbers. What, then, would your sheet look like?

Here’s the form that functional recreation might assume: First, I’d efface the gray banding that could discolor a piquant conditional formatting of the cells. Next, I’d decouple all those merged cells in row 1 and see to it that each and every column/field was properly headed, e.g. in excerpt:


And by extension, I’d instate Row 1 as the data set’s header.

But these steps toward a remake of the worksheet conceal a false start – because If I’m shaping the sheet for myself and not for the general reader, I’ll thus decide that I don’t need the section % fields in the data set at all, headers notwithstanding. Once I’ve subjected the set to a proper restructuring on my terms, I can tap into the Show Values As > % of Row Total option and derive section percentages by race when needed; and if I don’t need to know those percentages, then I don’t need the section percentage fields either.

So on behalf of a prudent minimalism I’d delete all the section % columns, and by the same token I’d delete the Membership field; its aggregates can likewise be derived in a pivot table from the individual race section numbers.

But a question needs to be asked about row 2, the sheet’s erstwhile header. That row records ASA membership totals by racial category, and those data are not of a piece with the holdings in the following 52 rows. Row 2 enumerates the actual totals of ASA members by claimed racial identity, but they do not inflict a double-count upon the sheet’s fields just the same. Because members can enlist in as many sections as they wish, the totals for the eight race fields won’t add up to the figures in 2. Thus, the enrollment numbers in row 2 possess a discrete, non-duplicating character, but at the same time they don’t belong with their companion rows. On balance, then, I’d say the row has to go.

And for something like the coup de gras –remember, I’m earmarking this data set chiefly for myself – I’d run the records through the Get & Transform routine described here, thereby unpivoting (blame Microsoft for that word; I take no responsibility) the columns into this most pivot-table-friendly arrangement, in excerpt:


And of course I can rename the Attribute field Race, or whatever suits me (but if I’m working in the privacy of my own room and left to my own analytical devices, I may not even bother to rename the field).

And once I’ve gotten this far, my three fields can do almost all of the work of the original sheet design, and some more besides, for example:


That is, here I rank section enrollment preferences by race (the ranks read down the column), and thus learn that Sex & Gender proves the most popular section for White sociologists, while Racial & Ethnic Minorities stands in the first position for both African American and Multiple Selected ASA members. And for the 1162 members who, far whatever statistical or preferential reason slip into the Missing category, Political Sociology tops the enrollments. And again, the Get & Transform rendition of the data set enables the user to tee up a Slicer, though which race or section-specific data – whose items are all now domiciled in a single field – can be neatly filtered.

But the larger point is that even if my revamp does nothing more than emulate the potential functionality of the sheet I download from the ASA site, I’ve achieved in three fields what it took the original 18. That’s prudently minimalist, isn’t it?

Sociology Section Enrollments, Part 1: Collective Behaviors

5 Oct

I try to keep this close to the vest but the firewalls have ears, and in any event, I seem to have leaked this datum into my About link elsewhere on this page in a moment of inexplicable indiscretion, and it’s even true, besides – yes, I have a doctorate in sociology.

There – I’ve manned up and told the truth, and I haven’t even been nominated for the Supreme Court. Mea culpa. But there’s more than head-bowed contrition to the confession; it also explains why I was moved to have a look at the spreadsheet devoted to the section choices plied by members of the American Sociological Association (ASA) nestled into the Scatterplot blog. You can devote yourself to it here:

Copy of 2017_section_membership_by_race

In fact, the workbook cross-references the items of two parameters – the sub-disciplinary section enrollment preferences of ASA sectarians, aligned by what Scatterplot calls race. ASA members were asked to supply the latter information here:


You’ll note the liberty granted members to self-identify by up to two backgrounds, which “ASA translates…into 8 mutually exclusive categories: Black or African American, Asian or Asian American, Hispanic/Latino(a), Native, White, Other, Multiple, Missing.” (Note that the Asian designation denotes Orientals and not denizens of the Indian Peninsula, as that category is understood in the United Kingdom.) The Multiple (or Multiple Selected) rubric thus conceals the paired backgrounds self-imputed by some members, and as such it’s impossible here to know which two were selected in any instance and how often, though we do know the total multi-identity cohort contributes 4.97% of all ASA members. We also don’t know who populates the Other affiliation, a status characterizing another 3.38% of respondents.

Moreover, blog-post author Dan Hirschman of Brown University adds in a comment that “…the Other/Multiple/Missing categories are non-trivial for most sections (ranging from about 7-20%)”, an important, but not incapacitating, constraint on the data. In this connection, the legend on row 56 (which you may eventually want to delete; it’s a little too close to the actual data for comfort) observes that any cell value falling below five has been reassigned to the Missing field (we’ve seen the five limit before in a similar connection, e.g. this post); thus the 27-member Native American contingent totalled in cell I2 reports only 19 section enrollments, all of which do exceed five.  And while of course an N of 27 may beat back all claims to statistical significance in any case, the far more considerable Other, Multiple, and Missing cohorts leave us to yet ask exactly who is being counted there.

But notwithstanding those cautions material findings do await, emerging from the worksheet even as it stands. The sheet has been curiously but methodically designed; and we could pointedly aim both of those adverbs at the de facto header, installed in row 2.  Its contents are not typically the stuff of which headers are made; the entry 11551 doesn’t smack of grade-A header material, after all.

But it appears row 2 holds down its header status because the sheet designers wanted its numeric, aggregating entries to remain atop the sheet when the remaining data are sorted, and because row 1 merges its cells in order to center its race designations over two columns each – one conveying absolute membership totals, the other, each category’s percentage of the whole. But I’ve been unable to determine how the sort invariably staves off and excludes header 1. Suggestions are welcome.

But before we propose any revisionisms for the sheet we can learn a number of insights right now. First, and perhaps most obviously, the aggregate section totals in the B column can be sorted highest to lowest (remember that B2 heads the field), after which you’ll learn that the Sex & Gender section is the best-attended, attracting 1099, or 9.5%, of the ASA fold. The least popular? The Ethnomethodology and Conversational Analysis unit (honk if you like indexical expressions), whose name length stands in elegantly inverse proportion to its following.

But don’t get unduly confused by the numbers. Add the section enrollments and you get 26,628, per the members’ prerogative to enlist in multiple sections (probably no more than three). Divide 26,628 by the ASA actual membership of 11,551 and we see that the typical constituent has signed into 2.27 sections.

Proceeding with that understanding, we can simply (somewhere) compute the average number of member section enrollments by race, e.g. for African-American members:


(It’s C2 in which the ASA African-American membership in toto – 748 – is recorded.) That fraction comes to 1.95 section affiliations per member, and after conducting parallel calculations for the other races this comparison resolves:

African-American 1.95
Asian/Asian American 2.35
Hispanic/Latino(a) 2.48
Native American .70
White 2.28
Other 2.17
Multiple Selected 2.52
Missing 2.11

(Again, the data for Native Americans are obstructively small.) I’m not sure what the above variation, and variation there is, suggests; that question is better put to a…sociologist, e.g. why black disciplinarians opt for fewer section associations than the bearers of other identities.

The section data lend themselves to a number of other looks, including Dan Hirschman’s charted assays of races and their enrollment percentages by ASA section, e.g.:


(Note that each chart sorts the X-axis data items – the section names – highest to lowest, that is, by a race’s extent of participation; and as such, the items are variably situated on the axes.)

It occurred to me at the same time that a spreadsheet-specific, conditionally-formatted framing of the data could add appreciable value to the analysis. For example, we could color any section enrollment figure exceeding a race’s overall ASA membership percentage.

We can naturally and alphabetically enough begin with the African American data. Select the percentages in D3:D54 (note the formulas within that multiply the authentic percentages by 100, souping up the results by a couple of orders of magnitude. 5.27, after all, isn’t .0527.) With that selection in place, plug into Conditional Formatting and enter the following formula:


Note I’ve nominated orange as the if-true cell fill color. and you need the dollar sign in order to properly compare all the African-American enrollments with the aggregate 6.48 in D2. And because we’ve already selected D3:D54, all those cells will receive the format.

You can next leave the D3:D54 selection in place and apply the range’s format to the other percentage fields across the data set via a Format Painter reproduction. If you double-click Format Painter you’ll be able to instill the format immediately by clicking F2, H2, J2, L2, N2, P2, and R2.

When the clicking stops I get, in excerpt:


Here of course, and unlike the Scatterplot charts, the section names remain in place; and I would submit there’s lots there to think about.

Cool. Now does anyone have Max Weber’s email address? I’d like to show this to him.

Tracking the Fields in Track and Field: Olympic Athlete Data, Part 3

21 Sep

Call it a footnote, addendum, postscript, or rather call it for what it is – unvarnished forgetfulness – but my longitudinal appraisals of Olympic athlete heights and weights last post walked right pass the most obvious case – basketball.

What other sport, after all, jams its goal ten feet above terra firma, and so casts its net – pun intended, heh heh – for the tall and majestically wing-spanned? Jump on this pivot table, then:

Rows: Year

Columns: Sex

Values: Height (formatted to two decimals)

Weight (format as above)

Slicer: Sport > Basketball

Nothing that women’s basketball took off in 1976 I get, again in excerpt:


You’ll observe that, dating from the 1976 baseline, the heights of both women and men divulge a five-centimeter rise (about two inches), more or less; but scroll back among the men’s data to the 1936 inception point and the end-to-end differential spikes to almost 18 cm, or about 7.5 inches. Male basketballers have also gotten about 41 pounds heavier across the same interval; measured from the 1976-2016 gender-shared interval, the net average increase for men comes to around 25 pounds, and 11 pounds for women. And for what it’s worth, the average women hoopster nowadays is taller than the typical male competitor in 1936.

And while I was browsing, it occurred to me that at least one other sport likes its entrants long and tall – volleyball, introduced to both genders in 1964. Slice for volleyball and:


The average 1964-2016 height boost for women and men: a notable, nearly-identical 12-plus centimeters.

Now you’ll recall that Part 1 of this trilogy issued a clarification about the data set; namely the fact that many athletes are recorded multiple times in a given Games, if they competed in a plurality of different events in any year. I allowed then that the reasons for counting the same entrants repeatedly in the various pivot tables I compiled could be defended, and I proceeded from that conceptual springboard.

But that policy could be reframed and executed anew. If in fact unique Olympic appearances of athletes by Games could be shaken out – that is, if we counted but one instance of each athlete per Games – would the age and gender distributions I reported in Part 1 depart materially from the ones I could collate here?

Let’s see. First, we could shear the redundant athlete records from the data set via the stalwart Remove Duplicates routine, selecting the records by their ID and Year fields; that decision should preserve one iteration of an athlete for each Game (if you want to give this revamped data set a good you may want to save the workbook under a new name). Going ahead, that run-through discharged more than 80,000 now-excess records from the set, leaving a mere 187,000 or so behind to submit to a round of pivot tables replicating the batch I minted in Part 1 (in the interests of presentational lucidity I won’t paste the Part 1 tables here; they’re still back in there whence they came, though).

I began with a breakout of average athlete ages by gender and sport:

Rows: Sport

Columns: Sex

Values: Age (average, formatted to two decimals)

I get, this time (in excerpt):


The differences are very small and indeed in most cases simply not there at all, and for a good reason: most athletes tried their hands and feet at one event per Games. Small differences do attend the Diving averages, because some divers essayed multiple events in the sport (remember Sport defines the general category beneath which the various events are gathered). You can’t see the figures above, but even the age averages for swimming – a sport in which entrants often plunge into several pools – are near-equivalent, at 19.48 and 21.38 for the entire data set and 19.53 and 21.51 for the unique compendium.

Since the numbers appear to match so thoroughgoingly let’s move to but one more comparison, the alignment of age averages by gender and year of Games. Replace Sport with Year and I get now in excerpt:


You’ll see again that the divergences are largely minute, though we the unique average here for women in the 1932 games – 26.99 – distances itself appreciably from Part 1’s 29.22 (the respective counts for the two tables: 369 women in the full data set, a significantly trimmer 222 in the unique-record rendition).

But the redundant-record entanglement snares us anew if we move to answer perhaps the most straightforward of the questions we could ask of the data: How many medals has each country managed to win, and by which type? (One preparatory briefing before we advance toward our set of totals – you’ll note the outpouring of NAs flooding the Medal column; but here that classic placeholder in lieu of an unavailable datum means, more meaningfully, that the record in question belongs to an athlete who failed to win a medal.)

But the medal question’s defining detail consists of the means by which the country-by-country medal counts are to be totaled. Consider an obvious case: the United States men’s basketball team brought back the gold medal in the 2016 Games – but that victory, for standard reporting purposes, comprises exactly one medal. Yet the Kaggle data set marks the gold attainment for each of the 12 members of the team, thus unleashing a massive, prohibitive overcount of the medal numbers for this and any other sport contested by teams of any size. That sort of capitulation to the data as they stand can’t work, of course, and thus calls for another dial-up of Remove Duplicates (I’m assuming you’re working with the original, full data set), this time selecting for Year, Event (not Sport, each one of which again spans several events), and Medal, because we want to sieve but one of medal dispositions for each event outcome. Implementing the routine rid the data of over 246,000 redundant records this time, per the selection criteria.

Following that massive jettison, we can apply the remaindered records to this pivot table:

Rows: NOC

Columns: Medal

Values: Medal

(Note: NOC, or National Olympic Committee, assigns a three-character abbreviation to each participating nation, the legend to which may be found here. I’ve applied NOC, and not Team, to the table, because the latter parameter also admits, for whatever reason, the names of actual teams that fielded contingents at the Games, e.g. the Bagatelle Polo Club Paris).

In view of the preemptive steps we carried out above, the table should now properly cross-count medal victories by country, e.g. in a pinched excerpt:


It’s starting to look intelligible, but the article isn’t quite finished just the same. For one thing – and we’ve seen this before – the curious sort priority accorded MAR and MON, or Morocco and Monaco in full, needs to be explained. Those countries are stacked atop the column because Excel reads their codes rather as March and Monday, date data that the pivot table insists on sorting first; and in fact a manual A-Z sort won’t restore them to their proper alphabetical place because they’ll continue to retain what is in effect their mathematical character, and so force themselves again to the head of the list. If you really want Morocco and Monaco to assume their orderly mid-column positions you’ll need to right-click each name and apply the Move option, selecting Down repeatedly until each is bumped into the berth you assign them.

And Move gets to do its thing again for the column entries, because you’ll want Gold – the winners’ precious metal of choice – to station itself at the head of the row, followed by Silver and Bronze. As for NA, I’d filter it out – because Remove Duplicates has preserved but one instance of the non-winners for each and every event and year and that’s misleading of course, because most athletes won’t win any event, and there’s a lot of them.

And if you’re still with me you can sort the values by Grand Totals Largest to Smallest, realizing this hierarchy, in excerpt:


The United States wins gold for most golds and all medals, and by a margin far more extensive than I would have supposed. URS is a legacy reference to the former Soviet Union, but note the simultaneous tally assembled for Russia.

And I bet you didn’t know that Luxembourg has won four gold medals – including two for Mixed Painting, as many as the United States.

Tracking the Fields in Track and Field: Olympic Athlete Data, Part 2

6 Sep

Have Olympic athletes gotten taller and heavier? Common sense, the hypothesis builder of first, but not last, resort, submits a yes. After all, competitors in other sports seem to have bulked up and out; so why wouldn’t entrants to the global quadrennial enlarge as well?

The conjecture makes sense, common or otherwise, to me, but fortunately more persuasive corroboration is at hand. The Kaggle Olympic athlete data set features height and weight data, thereby adjudging my sense-making a false start.

Nevertheless, a few important caveats need be pronounced upon the data. First, a notably large portion of the height and weight entries – around 60,000 in both cases, or around 22-23% of all athlete records – are missing, and I haven’t paced my way through the data to learn if those ciphers are systematically distributed (and by the way, about 9,500 age records, or 3.5% report no ages either). Second, one must take some care to think about the like-for-like proviso, in view of the turnover of events appearing in this, but not that, Olympic year. There may be little analytical gain, for example, by according a place to the weight and heights of the combined 3,578 aspirants in the Art Competitions last conducted in 1948. One assumes their vital statistics contributed little to their medal fortunes, though one never knows (and since you asked, their average weight came to 165.64 pounds, with a mean height reaching to about 5’9”); but because the demographics for those pastimes simply aren’t there in the later Games their confounding effects must be considered, and probably discarded.

In addition, because boxing necessarily cordons its contestants by weight classes, its 6,047 contestants should be expected to evince little variation along that parameter across Olympics, and they don’t (they haven’t gotten much taller, either).

With those cautions in mind, investigative prudence might justify a concentrated look at those durable sports expected to win a continual place in the Games. We could then proceed to pivot tables buffered by a Sports Slicer that would more afford a cleaner look at the question, one sport at a time.

If we’re satisfied with that recommendation, any pivot table that would spring therefrom sounds like a rather straightforward proposition, but grouping Olympic years by a chosen interval – say a 12-year tranche that would parse three Olympics at a time imposes a small presentational disquiet.

My initial intent, for example, was to group the years into bins of 12 in the interests of averaging heights and weights for three Olympics at a time (but recall that the eventual rethreading of Winter Games into alternative even-numbered years would step up the number for any 12-year stretch). If, then, we move Year into Rows and frame the 12-year grouping, we’ll see:


Now that bit of consolidation works, but note the disclosure above of some years in which no Games were held (e.g. 1907), and which in fact don’t even appear in the data set. But the pivot table insists on interpolating those years in order to preserve the grouping’s consecutive character, however misleading to the reader. But remember that Row Labels can be edited for display purposes, that is to say you can rewrite the above spans 1896-1904, 1908-1916, etc.

On the other hand, because the data set counts a manageably grand total of 35 modern Olympics, we may be happy to leave the years as we find them in their singular, ungrouped state. If so, we can begin this way:

Rows: Year

Values: Height

Weight (both computing Average and formatted to two decimals)

Columns: Sex

Slicer: Sport

And turn off Grand Totals.

With all those material variables in play – Year, Sex, Height and Weight – the pivot table may tend toward a certain untidiness and hyper-density. Truncating the Average of… labels to something like Avg Height should sharpen its legibility.

Because I’ve delimited my interest to those mainstream sports likely to have been fielded across most or all Olympics, I can aim my first Slicer tick at Gymnastics, mustering these outcomes in excerpt:


Note, my fellow Americans, that height and weights are expressed in centimeters and kilograms, respectively. (To restate the former in inches you’d need to multiply each entry by .39701, one formulaic way or another. Poundage is expressed by multiplying a weight by 2.20462 (I had long thought the increment was exactly 2.2)). As for the #DIV/0! messages besetting the table that allude to sliced fields containing no data, these can be air-brushed out of sight via the PivotTable Tools > Analyze > Options > Options > Layout & Format tab and leaving the relevant field below blank:


Returning to the gymnasts, note the decided lightening of women gymnasts from 1948 through 1980, an average weight reduction, so to speak, from 57 to 41.42 kg, or well over 30 pounds. Indeed, women gymnasts were on the average 5.89 kg, or 13 pounds, lighter in the 1980 Games than in the 1976 competition; but the poundage reverts upward again from 1980.

Explaining the variations requires something more than a top-of-the-head, knee-jerk, or any other such anatomically-driven response. Remember for one thing that many women gymnasts are counted multiple times among the data, and so their weights are submitted to the averages as often. Secondly, a survey of the particular gymnastic events in force for any given Games would be in order as well (the Gymnastic heading spreads over a variety of events), as new or abandoned events could have skewed the totals. In addition, the boycott by 66 nations of the 1980 summer Games may have likewise wrenched the average down.

On the other hand, the weight trajectory for male gymnasts is considerably smoother, tossing but one more complication into the mix.

We could next slice for swimming, another Olympic perennial. I get, in excerpt:


Here we see both men and women getting taller and heavier, with the paired slopes lifting upwards determinedly. That male average weights have burgeoned 16 pounds between 1972 and 2016 and the male-female weight differential has gradually ascended surely means something, though again some deep thinking about conclusions would be in point.

For a look at a winter sport, let’s slice for speed skating, opened to women in 1960:


Here the fluctuations are gentler, perhaps remarkably so, along with a small but measurable closing of the gender weight differential. You’ll also note that gymnasts tend to be shorter than both swimmers and speed skaters. But swimmers and skaters need to get to their finish lines sooner.