Campaign Contribution Data, Part 3: More New York and National Numbers

11 Oct

Just one, or perhaps 390,000, more things about the New York Presidential contribution data: might not the analysis be buoyed by a counting of monies apportioned by contributors’ occupations? It might; start with

Row: contbr_occupation

But stop there and you’ll already understand. Prepare to gaze upon 12,355 uniquely labelled vocation descriptions – but it isn’t their formidable heft that’s the rub. What’s adulterating the mix – again – is the manner(s) of inconsistencies among the labels, and how they impede the prospects for any follow-through across the data. What, for example, are we to do with these cognate employments:


Of course we’ve beheld this complication before, but the problem set here is perhaps slightly different. For one thing, the occupations have in all likely been furnished by the position holders themselves, and so we should properly withhold “blame” from the spreadsheet compiler. And with self-description comes classificatory idiosyncrasy, the bane of the grouping enterprise. Thus what editorial decisions are we to make about these kindred callings?


Does the analyst subject all of the above to a unifying, equivalent proxy occupation? There are 12,000 of those determinations to make here, and you’ll have to determine in turn if you have the time and/or money to devote to the challenge. The answer might be no.

In any event, if you’re enjoying the confusion and the complexity and want to go national, pop open the sheet detailing overall top fundraising periods in election years (it’s the very last Excel link on the page). The compendium, recording candidate receipts and contributions (the former is a superset of all monies including public funding, and hence are very often the greater sum. In 29 cases, however, contributions exceed receipts, often by a very small differential. See the notes on rows 3664-5) for presidential contests tracking back to 1980 and including the current tilt, comprises an assortment of parameters that seemed designed to inspire unease. Again, throw the Presidential Candidate field in a pivot table rows area and survey the ensuing list. How about


(Yes, the two are one and the same.)

Or more famously:


The point won’t be belaboured further (note, on the other hand, that the Bush, George and Bush, George W entries indeed signify father and son, and are meaningfully distinct).
Then replace Presidential Candidate with Report Type:


First, I don’t know what all those markers mean – e.g. 10P and 12C. Second, you do know what the redundant M6, M7, and M8s mean – uninvited spaces that have to go (the M data’s less-than-fully-numerical sort order follows from the fact, of course that they aren’t numeric at all; the M demotes these to label status, and if you want M10-12 to
Note as well that the M, or month, items advance their chronological references by an increment of one. That is, M3 reports contributions for February, and M12 counts monies for November. December data, then, are anomalously assigned YE, which I take to denote year end – but in fact if you filter all the data set entries for the YE’s, numerous pre-December 1 Coverage Start Dates loom, along with some Coverage End Dates that pull into January. Anyone have the FEC phone number?

Third, I’m disquieted by the potential apples-and-oranges mishmash of monthly and quarterly contributions that spill across the entries. Turn to the data and autofilter the receipt data for say William Jefferson Clinton (you remember him). You’ll see in excerpt:


On the other hand, you’ll learn that the above receipts emanate from different Clinton election committees, and so if the quarterly and monthly numbers properly and strictly line up with different organizational entities – that is, the totals can be deemed mutually exclusive, and we’re likewise satisfied that we alleviated any name discrepancies – then we should at least be able to aggregate a candidate’s intake by year:

Rows: Report Year

Slicer: Presidential Candidate

Thus Hillary Clinton’s numbers read:


Remember of course the 2016 totals come to rest at August 31.
If you want to play safe, you could at least globally array receipts by election year:

Rows: Report Year

Values: Total Receipts

I get:


(It appears that all dates are authentically numeric, by the way.) Remember that the 2016 total tops at August 31 (really September 20 for one failed candidate, Lindsey Graham); thus with two-thirds of the compiled year having elapsed, the current total projects linearly to around 1.044 billion dollars, a small downsizing from the two previous presidential years; but it may be reasonable to suppose that a late-in-the-game contribution rush could hoist the final numbers above and beyond. The significant contraction of the 2012 sum from 2008’s aggregate warrants study, however. In any case that latter year seems to have incited a quantum leap in campaign inflow. And as for those small non-Presidential-year 1990 and 1998 totals, they were earmarked to hopefuls Willie Carter (something of a perennial candidate, apparently) and Frank Starr, respectively, and about them I know nothing more – nor am I certain about the reasons for those years’ inclusion. A sidebar awaits.

And if you were wondering, Santa Claus raised $180.00 in 2012.

And I’m on vacation now; I just can’t find Word Press’ out-of-office reply.


Campaign Contribution Data, Part 2: More New York Numbers

30 Sep

So far, the NY Election contribution data have been pleasingly mild-mannered: no scruffy labels gate-crashing the contribution and date numbers, and a candidate name field in which each and every hopeful is decorously spelled the same way.

Yet I don’t mean to sound ungrateful, but I’m bothered by those enormous unitemized receipts reported to the worksheet by the Clinton campaign. That’s not an ethical cavil, mind you, but a data-management reservation. You’ll recall that about half of all of the Democrat’s take in the state (as of July 31) has been crowded into nine massive coffers, each of course comprising one record; and those composites enter the worksheet as what are in effect subtotals (though they’re not, of course, subtotaling the other records in the set; they’re adding contributions quite external to the worksheet) – and for pivot tabling and other purposes, subtotals and individuated records in a data set don’t mix.

As suggested in the previous post, I’m not sure that any worksheet-derivable patch for the problem looms. There’s no immediate way of knowing how many Clinton sectaries have collectively added to the unitemized millions; a call to the Clinton campaign might come away with an answer, and it might not.

In light of the above you’ll recall we originally pivot-tabled an average Clinton contribution of $489.72, that figure fed and distended in large measure by those giant unitemized batches. What one could do, perhaps, is subtract their $41,815,649 from the Clinton total and divide what’s left on the sheet by the 165,504 other contributions on the books, on the assumption – quite clearly an arguable one – that the unitemized gifts approximate the others in typical size. Pursue this idea, and the resulting £231.78 more than halves the original Clinton average – for what it’s worth.

Other breakouts await, for example a cross-sectioning of contributions by candidate as well as New York zip code. Of course you’ll want to know which zip code references which place; look here for a code-by-county identification (though I’m not sure about the View Maps link aligned with each code. A click of the link for the 11390 zip in Queens drew up for me a Google Map of a region of France that, so far as I know, can’t be reached by subway).

At base, American zip codes describe themselves in five-digit terms, but you’ll see that most of the codes attached to donor provenances in the contbr_zip field (in column G) have widened to nine digits; those extra numbers comprise zip code extensions that further specify some territorial position within the ambit of the superordinate, five-digit code area. But since a useful aggregation of the contributions by codes would want to lift the analysis to the base five-digits stratum, we need to disengage the first five numbers from the lengthier entries in contr_zip. That sounds like a job for the LEFT function; thus in your next available column (mine is T; we claimed S last week for formula purposes), enter Zip Code in row 1 and enter, one row beneath:


Copy down the column and perform a Copy > Paste Values upon the results; and now what? Here’s what: we need to convince ourselves that each and every cell in T in fact returns the five characters we seek, and one means of ascertainment is this array formula, written more-or-less anywhere:


That is, let the array formula record a 1 for each cell in the range above whose length comes to 5, and then count all those 1s (remember that, being a customized array formula, you need to finalize the formula process by pressing Ctrl-Shift-Enter). My total of 391367 falls 146 of the field row total, pointing to that many flawed entries. If you run this one-field pivot table:
Rows: Zip Code

You’ll see on top:


And those, needless to say, aren’t American zip codes.

In order to pinpoint those contrary 146 you can roll zip code into the Values area as well, whereby each code’s incidence will be counted, e.g.


Double-click right atop the 3 count for ‘1136, for example (yes, there’s a delinquent apostrophe in there) and all the records bearing that code will present themselves in a new sheet. We see here that all three instances are associated with Josh Rosenthal of Fresh Meadows in Queens. These are typos, of course, and a quick consult with Google tells us that the zip code for Fresh Meadows is in fact 11365. Aiming a find and replace at the data in T, by exchanging 11365 for every ‘1136, will restore Mr. Rosenthal to his proper geographical niche. On the other hand, the two “0” ersatz codes we see above both belong to Carlene Wahl of New York, and her rather generic city location will prove rather more problematic. If we want to learn her real-world zip code we’d thus need to track down Ms. Wahl and her address one way or another – if possible.

You get the idea here. We see that some, but not necessarily all, of the mis-coded 146 should be agreeable to some standard rectification; and one supposes the disagreeable ones could be filtered out of any table, if we had to.

In any case, and acknowledging the above sticking point that adheres to but .037% of all contributions, we could proceed with this table, which again organizes contributions by candidate and zip code:

Rows: Zip code

Columns: cand_nm

Values: Contb_receipt_amt

The reading here is dense, of course, but you’ll at least note the large Clinton sums piled up in zip codes beginning 100, all of which are sited in Manhattan. But look at her results for code 10185: 41,820,648.76. 10185 is likewise a Manhattan code, and here it references the address at which all those unitemized contributions were gathered.

Sound familiar, no? Kind of, because that gigantic sum, administered by the Hillary Victory Fund, has doubtless processed donations from individuals who live all over the place – zip codes unknown, and possibly unknowable.

And as a matter of fact, how do we know that all of those contributors even live in New York?

Campaign Contribution Data, Part 1: The New York Numbers

20 Sep

Read any good campaign financing reports lately? OK – it’s precisely that kind of question that explains why you cross the street when you see me coming your way.

Social skills aren’t my strong point, but I won’t be deterred. For those who want and need to know about who it is who’s paying for American’s presidential candidates this year – i.e., Clinton, Trump, and everyone else, past and present, click into the Federal Election Commission’s site and you’ve come to the right place. All kinds of spreadsheets in there aim to give the taxpayers their money’s worth, including a state-by-state drill-down of contributions, among which of course are the New York data. It’s a big state and a big sheet, so please download it via the above link and its ensuing reference:


The sheet’s 391,513 records log candidate contributions harking back to October, 2013, taking the activity up to August 1 of this year (really July 31; the one August donation on the books notches a donation of $0, and I’m not sure what that entry means to tell us), a short while before both the major party conventions came and went. But 3,204 of these mark refunded (or redesignated) gifts whose figures are expressed as negative numbers, and which assumedly have to be squared with some equivalently positive sum. But before we slide that and other attention-begging data issues beneath the microscope rest assured that all the monetary sums in column J – contb_receipt_amt – and dates in K – contb_receipt_dt – are uniformly numeric. Cool – 783,026 cells you don’t have to worry about. And contbr_st in F is wholly ignorable; because the sheet as a matter of definition marks contributions issuing from New York, the NY entry is the only one you’ll find in the field. But you will have to auto-fit a number of the dataset’s columns, rather an expectable necessity.

Of course the data’s essential investigative question asks how much money each candidate has accumulated; and the answer should be straightforwardly realized, via a pivot table. But before the table happens something need be done about those 3,204 negative values. It seems to be me that their manner of disposition here could be argued about, but I hold that they should be simply ignored – because since they stand for actual contributions that, for whatever reason were recalled, e.g. a campaign shutdown (I’m assuming none of these were voided for reasons of impropriety), they should be accounted in the candidates’ sums. These were monies that were raised, after all.

Again, that position could be contested, but I’ll go with it for the time being; and if then I’m to ignore the negative contributions I can appropriate next-available-column S, call it Contribution, and enter in S2:


We’re simply stipulating that if the actual contribution in the corresponding J cell exceeds zero, use it; otherwise, assign a double-quote result which neither be tallied by the COUNT function nor acknowledged at all by AVERAGE (it will be noted by COUNTA, however).

Copy the formula down S and then proceed:

Rows: cand_nm

Values: Contribution (Sum, formatted to two decimals and Use 1000 separator, or comma. Note the comma doesn’t punctuate the very 1000 alongside the checkbox.)
Contribution (again, here Show Values As > % of Column Total)

I get:


We see Hillary Clinton’s extraordinary sway over the New York donor cohort (and no, I haven’t heard of all those candidates), her receipts trumping the Donald’s 40 times over. Her margin over erstwhile rival Bernie Sanders is more than nine-fold, though of course the latter’s campaign exchanged the hat for the towel some time ago. What’s also nice about the data, as it were, is the steadfast uniqueness of the candidate names in Rows. It’s a pleasingly no-tweak list: no faux duplicates among the 24 aspirants, no alternate names, no misspellings.

Having gotten this far you may want to break the data out additionally by date, a determination that asks a presentational decision or two of us. First I’d whisk the second entry of contb_receipt_amt out of the table, because the data subdivisions about to be instituted by the date parameter would inflict something of a mess on two sets of side-by-side values.

Next we – or you – need to do something about date granularity. If you install contb_receipt_dt in Columns and group its data by Years and Months, 27 columns of numbers pull across your screen (in addition to Grand Totals), an unfolding that might or might not serve readability and/or your analytic designs terribly well. Group by Years and Quarters, on the other hand, and the table retracts to 10 columns. Note, though that the final item here – Qtr3 for 2016 – comprises but one month, July.

Whatever you decide, you won’t be able to miss Clinton’s recent lift-off in her bank account (and pitched to a smaller magnitude, that of Trump), as well as the reddening of the books of failed candidates, their negative numbers presumably signalling monies returned in the aftermath of campaign shut-downs.
You may also want to pursue a corollary interest in the numbers of and per-candidate average. Now that we’ve assumed our stance on the negative-number issue, we can pivot-table thusly:

Rows: cand_nm

Values: Contribution (Count, no decimals)

Contribution (again, by Average, two decimals, 1000 separator)

I get:


Some remarkable disclosures emerge. We see that contributions to the Sanders campaign actually outnumbered those of Clinton’s (as of July 31), but their average amounts to but one-tenth of the nominee’s per-donor outlay, a most interesting and meaningful bit of stratification.

But not so fast. If you sort the contributions you’ll find nine seven (or eight)-figure Clinton contributions described as Unitemized; presumably these mighty aggregates enfold a great many individual gifts, and have the effect of sky-rocketing Clinton’s apparent average.  In fact, those nine total over $41,000,000 – more than half the Clinton total – and mincing these into their discrete donations would likely roll up a Clinton contributor count that would overwhelm the Sanders figure – and ruthlessly depress the Clinton per-gift average at the same time. There’s nothing in the worksheet, however, that can tell us more about those nine behemoths. Some deep background, please.  Note Donald Trump’s low average, along with the 10,000+ contributions to all candidates amounting to precisely $2,700 each, an apparent obeisance to the maximum permissible contribution of an individual to a candidate’s committee.

All very interesting. But there’s more to be said about the New York – and the data aren’t always no-tweak.

US Patents: Inventing an Angle on the Data

12 Sep

Humans are nothing if not a resourceful lot; and one telling record of the human will to innovate continues to be written by the United States Patent Bureau, whose tally of certified inspirations submitted by both American states and other countries makes itself available here:

You’ll note the site wants us know that “…a user may import this report file into spreadsheet software such as Microsoft Excel to facilitate printing.” Could be, but we’re not instructed how that import might be carried out, apart from the decidedly public-domain select-copy-and-paste to which I’ve resorted here:


You’ll recognize that the State/Country data in column A has been ordered to wrap its text in awkward places; a double-clicked auto-fit of A in conjunction with a column auto-fit should set things straight, literally. You should also delete the wholly blank B column, because it’s…blank.

Now before you get to work you may want to flit across the aggregated data in rows 2-4. Here we see the striking and determined ascendancy of international patents, surmounting the US total for the first in 2008 and continuing to pull away from the host country’s numbers ever since. That means something, of course, perhaps speaking more to the alacrity of knowledge development worldwide than any becalming of the imagination in the States. After all, patents in the US boomed 45% across the 2008-2015 frame, too.

But while those are most interesting data indeed, they don’t need us to discover them; they’ve already been compiled. We, however, may want to work with the state and country details arrayed below by pivot tabling them, and if we do, a round or two of preliminary decision-making needs to be thought through.

First, you’ll note the two rows separating the US state data (and these include American territories such as Puerto Rico as well as Washington DC, the country’s capitol that doesn’t quite hold state status) from those of the other countries, a segregational practice of which you’ll probably approve. States are states and countries are countries, after all, and any admixture of the two might confuse the analysis – maybe. There might be some profit in comparing patents from California with those issuing from Poland, but that prospect leaves me hesitant, though I suspect there’s a minority view to be propounded here.

But assuming you’re pleased with the status quo you’ll need to copy the header row in 1 to rows 7 and 67, thus topping the two data sets appropriately. Next – if we’re pivot tabling – I’d delete the data in the All Years field, because keeping them in their place will risk a double-count of the patent numbers once the tables do their aggregating thing. Left as is, All Years will be regarded as just another year, and tabulated accordingly.

Next, you could treat each row to a Sparkline line chart, seeing to it that you identify C8:P65 as the data range, thus seeing to it in turn that (what is now) the B column data – bearing pre-2002 sums of patents dating across the 1963-2001 span – have been excluded from the line construction. 39 years’ worth of patents can’t be made to comport with the year-by-year counts that adjoin them, it seems to me. Select the Q8:Q65 location range, and your OK will plot some interesting, and very far from uniform, trajectories down the range, e.g. Delaware’s one-year dive in patent production last year, or South Carolina’s 140% splurge between 2008 and 2015. Overall, the lines certainly describe a cross-state, upward pull in patent yields, but again the anomalies, e.g. Louisiana,


could be deemed at least slightly provocative. Remember of course that Sparklines’ conformations are drawn relative to their ranges’ values; South Carolna’s spike cannot be matched in absolute terms to California’s, for example, and the very small patent numbers for the US territories need be understood as such, whatever their Sparklines look like.

Those qualifications understood, we could likewise apply Sparklines to the country data. Note that Czechoslovakia’s string of zeroes (following its pre-2002 2121) reflects that country’s reorganization into the Czech Republic and Slovokia, both of which do present substantive patent numbers. The country slopes are remarkably comparable, though some curious investigator might want to do some wondering about Italy’s slump in 2007 and its subsequent bounce back.

And now I’d like to perform a bit of a back-track. Earlier I had reflexively begun to prepare the patent data for some pivot tabling refinements, e.g., by offering that the empty B column should be made to disappear, and by filing the usual brief for a data-set row header. But we – or I – need to ask: given the nature of the rows comprising the data set(s), what might a pivot table really teach us? We need to remind ourselves that pivot tables are, in the first instance, instruments of aggregation, and there doesn’t appear to be much among the state/territory data to aggregate – no recurring, labelled entries there to submit themselves for consolidation, at least not as the data presently stand.

It might be possible, for example, to append a Region field to the country patent data by recalling that B column we had so unblinkingly eliminated, and devising a set of items meant to recur – e.g., Europe, Africa, Oceania, etc. – but that device would impose other problems on the set, those hinted at in my discussion of dataset reconstruction (which perhaps could be reviewed if you’ve decided to pivot table the data anyway) that I’ve cited many times, including now.

So given the data as we’ve received them, and appreciating that rows 1-4 have already done some of the aggregating work for us, the Sparkline strategy might stand as among the most trenchant services we could do for the data – though that superlative leaves me hesitant, too.

But if we’re Sparklining, there’s another subtle point to file away in your presentational rolodex that needs to be entertained. I had earlier advised that the erratic line breakages inflicted on the labels in the A column by the sheet’s Wrap Text effect should be smoothed away by simply turning off the wraps; and in fact that’s a near-imperative move for the Sparklines. That’s because the wrapped text feature by definition heightens the rows in whose text it wraps; and because, and unlike conventional Excel charts, Sparklines are cell-bound, they’re going to be heightened too. And the resulting distortion will skew the relation of the X to the Y axis.

I like that observation. Can I at least copyright it?

Team USA Stats, Part 2: Some Data Gymnastics

4 Sep

You can’t tell the players without a scorecard, they’ll tell you in the States, and you can’t tell the data without the formulas.

You’ve heard more memorable pronouncements than that opener, I’ll grant, but that less-than-bromidic avowal above makes sense. We saw in the last post how the Team USA height, weight, and birth state data threw more than a few curves at us, and baseball isn’t even an Olympic sport any more (it’s coming back in 2020, though); fail to straighten out the data curves and your analysis will be straitened.

Now that line’s a few sights more memorable, but I do go on. Our next round of data inspection takes us through the DOB (date of birth) field, and we want to get this one right, too. Quality control here starts by aiming a COUNT function at the rows J2:J559 that populate DOB. Since dates are numbers, and since COUNT only acknowledges a range’s duly numeric data, we should aspire here to a count of 558, or one date per cell. But my COUNT totals 520, a shortfall that exposes 38 dates manqué that, facades aside, are nothing but text.

Our COUNT is all the more incisive in view of the fact that every entry in the J column really does look like an unassailable date. But now that we’ve made ourselves aware of the turncoats among the loyal numerics, we’d do well, if possible, to rehabilitate them into the real things too. Here’s a pretty painless way: commandeer the next free column (it could be S, if you’ve left the formulas from last week’s exercises in R alone; if so, you’ll have to format S in Date terms), title it Birth Date, and enter, in row 2:


And copy it down. The formula asks if the entry in J2 is textual. If it is, the DATEVALUE function – a rather useful transformative means for turning pure text such as 12/14/1989 – if that expression has been formatted into text – into 12/14/1989, the date version. If the entry in J is an authentic date, on the other hand, the formula simply invokes it its cell entry as is.

Surprise – my eminently sensible tip doesn’t always work. Copy the formula and you’ll be treated to nine #VALUE!-laden cells; a first review of these implicates an old nemesis – a superfluous space, in the source field e.g.:


DATEVALUE can’t handle that textual intrusion, and so this refinement:


Looks right, because TRIM’s job is to make superfluous spaces unwelcome.
But that revision doesn’t work either. Puzzled but intrigued, I went for the next move: a copying of one of the problem entries in J into Word, wherein I turned on the Show/Hide feature (in Home > Paragraph) that uncovers normally unseen codes. I saw:


Look closely and you’ll detect that special character to the immediate left of Word’s paragraph symbol. That Lilliputian circle appears to signal a non-breaking space, and apart from any conjecture about what and why it’s doing there we know one thing: it isn’t a superfluous space, and thus won’t be trimmed.

Thus again, the fix here may be short on elegance but long on common sense: edit out the circle from the nine problem entries in J (in fact some of the cells require two taps of Backspace in order to rid them of their codes. If you sort the dates by Largest to Smallest by first clicking on an actual date all the #VALUE! errors will cluster at the bottom).

And that works, leaving us with yet one last irritation – the birth date of Shooting team member Daniel Lowe, which reads


No need to get fancy here – just enter that slash.

Now, I think, you can go about your analytical business, e.g., breaking out athletes by birth month. You may recall my consideration, in a piece on 2012 Olympic data, of the alleged August effect, in which American athlete births in that month appeared to significantly depart from a chance prediction. Let’s see what the current team data tell us, via a pivot table:

Rows: Birth Date (grouped by Month)

Values: Birth Date (Count)

Birth Date (again, by % of Column Total)

(We don’t need Grand Totals here). I get:


Here August – generally the most fecund month in the United States – shares the modal figure with June and March, its proportion substantially smaller than August’s 2012-team contribution. The numbers here simply suggest no special birth skew impacting the US complement, at least for this Olympics.

We now can also calculate each athlete’s age in conjunction with the most able assistance of the nifty and unsung YEARFRAC function. Enter the Olympics’ start date – August 5, 2016 – in any available cell, name the cell start, and proceed to column T, or whichever’s next available on your sheet. Name it Age and in row 2 try (assuming the corrected dates lie in the R column):


YEAR calculates the distance in years between the two dates on either side of its comma. Thus, cell-reference four-gold-medalist Katie Ledecky’s birthday – March 17, 1997 – in YEARFRAC, and with the start date cell holding down the second argument you get 19.38, Ledecky’s age in years on day one of the Olympics (note that can’t actually enter 3/17/1997 in the function, because YEARFRAC will treat the entry as text. You need to either reference the cell bearing that date or enter 35506, the date’s native numeric equivalence).

Copy down the column and this pivot table beckons:

Rows: Sport

But guess what…


Yep, it’s that superfluous space thing again, this time practicing its mischief on four records among the Track and Field data. The simplest repair in this case, as it turns out: select the Sport field and run a Find and Replace at the column, finding Track and Field[space] and replacing it with Track and Field. That works, because in this case each of the errant four have incurred one space.

Now introduce the Age field to Values (Average, formatted to two decimals). Bring back Age a second time, now exhibiting Count sans decimals. If you sort the results Largest to Smallest you’ll see the 12-member equestrian team holding down the age-senior position, with Team US’s eight boxers computing to a lowest-age 20.75.

We could also correlate average athlete weight by event, an association which might drum up some less-than-obvious numbers, e.g.

Rows: Sport

Columns: Gender

Value: Weight (Average, formatted to two decimals)

I get:


Of course the per-team numbers are small, but they make for interesting reading, particularly the respective by-sport gender disparities (and note some absent teams among the men’s delegation).

I was surprised by the greater average weights of the two basketball teams measured against their rugby colleagues, even if the latter is (officially) the contact sport. And I did a double-take when I caught up with the respective boxing team weights; women boxers outweigh their male teammates by an average of 18 pounds. But here we’ve been thrown a sampling curve – the six male pugilists are concentrated in the lower weight divisions, even as the women – comprising exactly two boxers – weigh 132 and 165 pounds.

Eek  – there was a lot of hard work to do in there; I think I deserve a podium finish for this one.

Team USA Stats, Part 1: Some Data Gymnastics

26 Aug

Now that Super Mario has bored his way back to Tokyo, let us praise Great Britain’s mighty Olympic team, and its world’s best 67 medals, followed by the United States and its 121.

Don’t read that twice – just say hello to the New Math. Here in England, where the media put the mallets to their collective tympani for all-Team GB all the time, one’s ear had to be pressed very close to the radio for news about any athletic glory redounding to anyone else.

But ok. Two weeks of harmless sporting jingoism does wonders for the commonweal, one supposes, and so now I can tell my co-residents here that, glory aside, United States Olympic team has something the British contingent doesn’t: a spreadsheet about its members, available worldwide here:

Just click the Sortable Roster link.

The workbook’s name could be asked about for starters, because properly structured, any data set should be agreeable to sorting. You’ll also take note of the cell borders sectioning off respective sport (team) rosters, demarcations that no longer demarcate once one takes the workbook up on its offer and actually sorts the data by say, Last Name or Height. Because the borders will remain exactly where they were drawn – even when the sorts reorder the records – they’ll now be bordering near-random assortments of athletes.

But now to the data. The Team USA site lets us know that 124 of the team’s 558 members, about 22%, are California-born, an impressive disproportion over and above the state’s 12% contribution to the American demographic whole. If we want to break team representation out by all states, then, a pretty straightforward pivot table should be up to that task:

Rows: Birth State

Values: Birth State (count)

Straightforward, but problematic, e.g. this excerpt:


We’ve seen this before, and now we’re seeing it again. The Olympics may encourage diversity, but promoting disparate spellings of the same state name is grounds for a DQ, at least in this event. Note the pairs of Calif., Colo. and Fla. in the screen shot, a spate of duplications (and there are others in there) inundated by superfluous spaces. Note as well the misspelled Cailf., and it seems that full attention hasn’t been paid to the business of getting the data in shape.

But that’s where we come in. First we can sprint over to column R, the free column alongside the SCHOOL/COLLEGE. The rows in R seem to have been formatted as Text, and so I clicked the R heading and redefined the column in Number terms. Then in R2 I entered, simply:


And copied it down R, selected and copied those results, and pasted their values atop the L entries. (Having discharged that remit you could then go ahead and delete the contents of R.)

That standard corrective works, so far as it goes, but it won’t respell Cailf. That kind of repair might require a record-by-record edit that could make washing your socks seem exciting by comparison, though I for one would opt for the former activity (and discrepancies notwithstanding, I also get just 113 Californians, 111 if you break the residences out by the Current State field instead. I’m also not really sure what distinguishes Hometown State from either the Birth or Current State identifiers). But if you do need to know about team-member state origins (and non-American birthplaces as well), this kind of work just has to be done. Sorry.

And what about athlete weights, a numeric metric that could be productively associated with sport, height, and gender, and perhaps even date of birth? Don’t be disconcerted by the left alignments, but here too we meet up with an issue – namely the more than 50 weights that sport (ok – pun intended) their values in text format, tending to cluster among the Rugby, Gold, and Equestrian members, by the way. But this gremlin is easily sent on its way, however; sort the field by largest to smallest, thus packing all the text data at the very top of the field. Select the problem data in I2:I56 and click the attendant exclamation-pointed notification:


Click Convert to Number, and the weights acquire real, measurable poundage (note the weight and height for gold-medal swimmer Ryan Held are missing).

But what about the Height data? The metaphor may grate, but the entries here are squarely interstitial, purporting quantitative information in wholly textual mode. As expressed here, 5’11” is nothing but text; if you want that datum to assume a useably numeric form this recommendation asks you to convey the above height in its cell as 511 instead, and impose a custom format upon it that interposes those apostrophes between the “feet” and “inch” parameters. Either way the entry is really 511, and that value may not work with your aggregating intentions. Another tip would have you enter a height in inches – in our case 71 – and formulaically dice the number into a foot/inch appearance, which again nevertheless ships the data in text status.

In any case, we need to deal with the data as we have them, and I’d allow the simplest intention is to get these labels into numeric mode, i.e. inch readings. In that connection, I’d return to column R, title it Height in Inches or some such, and enter in R2:


To translate: the formula commences its work by detaching the first character in H2 – a 5 (I’m working with the default arraying of athlete records here, the first of which posts a height of 5’11”), and ascribes a numeric value to it via VALUE, supported by the given that all foot-heights should comprise one digit. That result is next multiplied by 12, yielding 60 inches thus far. I then isolate the 11 in 5’11” by applying a MID function to the task. The LEN(H2)-3 argument that registers the number of characters MID is to extract from the entry in H2 reflects the fact that any entry in the H column should consist of either 4 or 5 characters, e.g., 5’11” or 5’6”. Subtract 3 from either count and you come away with either 1 or 2 – the number of characters MID needs to pull from the entry in order to capture its inch value. Thus in our case we can add 60 and 11, culminating in 71 inches for the archer Brady Ellison. Copy the formula down R and eliminate the decimals, and our heights should be ready for the next round of analytical moves.

Almost. It seems my post-copy vetting of the height-in-inches data in R reports more than a dozen #VALUE! notifications – because some of the heights in the H column look like gymnast Kiana Eide’s 5’3, or indoor volleyballer Thomas Jaeschke’s 6-6. Neither reveal an inches punctuation, and Jaeschke’s height buys into a different notation altogether; and my formula can’t handle those discrepancies.

So it’s time for a Plan B. First run this find-and-replace on the heights in H:


(That is, replace the inch quotes with nothing.) That pre-formulaic fix should eliminate all the inch punctuations, directly exposing the inch numbers to the right of the cell. Then in R2 write:


What’s changed here is the latter half of the expression, which now splits 1 or 2 inch characters from the right of the cell, depending on the single or two-character length of the inch totals. Copy this one down R and we should be in business.

Not. Two utterly obstinate athletes, field hockey aspirant Jill Witmer and soccer teammate Lindsey Horan, feature a single apostrophe beside their inch figure, a miniscule disparity that defeats my best efforts at a global formula rewrite – along with the data-less Ryan Held. Here discretion trumps valor – I’d just delete the incorrigible apostrophes and Held’s #VALUE! message, and take it from there. Now I have real heights.

Ms. Witmer – or whoever entered her data – sure is playing hockey with my fields.

The College Transcript: Downloading an Upgrade

19 Aug

It’s homely, to be sure, but if you want to go somewhere you gotta have one. And no – I’m not talking about your passport photo but your college transcript, an obstinately prosaic but indispensable means of entrée to your next academic or career step.

The transcript – an enumeration of a student’s courses and performances gathering into what we Yanks call the Grade Point Average (GPA) – has undergone a rethink of late. A piece in this past February trained its lens on a number of initiatives aiming to drill qualitative depth into the transcript’s tale, sinking some analytic teeth into its default, alphabetically-driven narrative by linking its grades to students’ work and detailed progress toward a degree.

And that got me to thinking: if it’s depth we’re seeking, why not endeavour to learn something more from the numbers and the letters by re-presenting the transcript as a…spreadsheet?

It makes perfect sense to me, though you’d expect me to say that. But after all: submit a transcript to some standard tweaks and you wind up with a dataset, one suitable for sorting, pivot tabling, formulaic manipulation, and charting, too. And once the tweaking stops, the transcript can treat its readers to a round of different, edifying looks at the data – and that’s what I call depth, too.

Transcripts aren’t things of beauty, but they sport no small amount of variation just the same. So to understand what we’re working with, more or less, take a look at this one – the transcript of Mr. Ron Nelson, who made his academic record available here:


In the interests of exposition, I’ve subjected the baseline transcript above to a round of fictional retakes that of course don’t represent Mr. Nelson’s actual attainments (for one thing, his record dates back nearly 25 years). A few of the points that call for renovative scrutiny, then: First, note the blank column coming between the Course No. and Course Title columns, an excess which must, for spreadsheet purposes, be curtailed. Second, the multi-columned iterations of Course Titles and associated grades need be cinched into a single field. Third, the academic term headings (e.g. Spring Semester 1991), and TERM TOTALS and CUMULATIVE TOTALS lines have to be sent elsewhere; they report information that simply aren’t of a piece with the grade/grade point records that the dataset should comprise.

Second, if you’re new to the GPA you need to know how that defining metric is figured. While of course variation again abounds, the textbook illustration looks something like this: Class grades are typically assigned along an A-to-D continuum along with a failing F, in what are in effect quantified decrements of a third of a point, e.g., A, A-, B+, B, etc. In the typical system an A earns 4 points, an A- 3.67, a B+ 3.33, and so on (the D- grade is rarely offered, by the way). An F naturally awards no points.

Each grade-point achievement is in turn multiplied by the number of credits any given course grants, resulting in what are usually called quality points. Thus a B- grade in a three-credit class yields 8 quality points – 2.67 times 3. An A in a four-credit course evaluates to 16 quality points, or 4 times 4. The GPA, then, divides the sum of quality points by the sum of credits achieved. Thus this set of grades:


Works out to a GPA of 2.92.

It’s pretty intelligible, but with a proviso. The GPA must incorporate the number of credits associated with a failing grade into its denominator, and so these performances:


Calculate to a GPA of 2.33. But the 15 credits recorded above really only bestow 12 usable credits upon the student, and that dual count needs to be remembered.

With that extended preamble noted, my spreadsheet-engineered transcript demo (spreadscript?) presents itself for your consideration here:

Transcript demo

In excerpt, the sheet looks like this:


Note the paired Term and Date columns; though one might be moved to declare the former field superfluous, it seems to me that its textual Spr/Fall entries could enable a pivot table broken out by seasonalty, i.e., GPAs by all Spring and Fall courses across the student’s academic career. The Date field, on the other hand, is duly numeric, thus lending itself to chronological resorting should the current sequence of records be ordered by some other field. And the grades have been visually differentiated via a conditional formats.

The Credits total in the upper right of the screen shot reflects a necessary bypassing of the F grade for Music 101 per our earlier discussion (the grades are stored in the H column), and realized by this formula:


The SUMIF here is instructed to ignore any F in the F column via the “not” operator bound to the formula’s criterion. Note the quotes required by SUMIF for operators clarifying the criterion. The GPA, on the other hand, divides the quality point total by all 112 credits (you will have noted that the spreadsheet outputs the quality points in H via a lookup array range-named gr in Q1:R10. And in the interests of simplicity I’ve let subsidiary course events and their codes, e.g., class withdrawals and incompetes, go unattended).

Now the data become amenable to pivot tabling and other assessments. For example, if we want to break out GPAs by term we can try:

Rows: Date (You’ll want to ungroup these, if you’re working in release 2016)

Values: Hours/Credits (Sum)

Quality/Points (Sum, rounded to two decimals)

Because we need to total the date-specific quality points and divide these by the respective-date credit totals, a calculated field must be implemented, e.g.


Click OK, again round off to two decimals, and you should see:


Once the GPA field is put in place you can, for example, break out credit accumulations by Discipline, or subject, by replacing Date with Discipline:


Or try a frequency analysis of credit totals by grade:

Row: Grade

Values: Hours/Credits (Sum)


(Note: because of the priorities with which Excel sorts text characters, grades accompanied by the + symbol initially appear at the bottom of any letter sort, e.g., you’ll initially see B, B-, and B+. You’ll need to right-click the B+ and select Move > Move “B+” up twice. And of course the same adjustment should be applied to C+.)

Of course these outcomes could be charted, e.g.


And if you are interested in seasonality:

Rows: Term

Values: Hours/Credits

Quality Points (Both Sum, and both rounded to two decimals)



(By the way, you’re not duty-bound to earmark Hours/Credits and Quality Points for the table if you want to display GPA at the same time. Once constructed, GPA becomes available in its own right, and need not be accompanied by its contributory fields.)
And all these and other reads on the data could be assigned to a dashboard, too.

Thus the transcript-as-spreadsheet could break new presentational ground, supplementing the row-by-row recitation of subjects and scores that students and recipient institutions currently face, with a suppler way around the data. They could even be made unofficially available to students themselves via download, empowering the spreadsheet-savvy among them to map and understand their grades in novel ways (one trusts that no one’s accepting a transcript bearing a student’s email address).

But is it a thing of beauty? Maybe not, but don’t you like the colors?