Time (and Record) Management: OECD Data

25 Jul

The OECD clearly has a lot of time on its hands – after all, think how long it takes to say The Organisation for Economic Co-operation and Development. But time likewise figures in its research remit, here keying a survey of the time management habits of denizens of its member nations (at least most of them, along with representations from China, India, and South Africa, designated OECD partners), and recapitulated here:



The data – and it’s the Total sheet I’m reviewing – spread out some neat cross-national takes on how respondents apportion their daily round, normalized for each nation’s activities to the day’s 1440 minutes (and I assume normalization means reconciling the inter-national variance in activity definitions; see the Activity category tab). You’ll be interested to know, for example, that South Africans appear to be able to devote 562 minutes a day or night – nine hours and 22 minutes – to sleeping, and that while earmarking but 463 minutes to that circadian necessity we’re left to conclude that the Japanese and Norwegians seem to be drinking the most coffee. It is also noteworthy that respondents from Turkey – the only preponderantly Muslim country in the national roster – spend the most time in religious activity (row 32), at 19 minutes (South Africans come in one minute behind). It’s all interesting, but I can’t resist the geekish aside to the effect that the row 32 numbers are variously positioned – with some middle-aligned, (along the vertical axis), and others right (and you might very well also want to pump up the 8-point text prevailing throughout).

But at the same time the numbers urge a familiar question upon the plenary: For whom is the spreadsheet intended – a public of passive, if interested, readers, or that A-Team of caffeinated deep thinkers who want nothing better than to do something with the data? If you’re siding with the latter insurgents, and want to act upon what it is you’re viewing here with your canteen of slicers (and Slicers) and dicers, then the data – and we’ve seen this before – have to be reshaped. That’s a question we’ve asked in so many words before, and it requires asking here as well.

Let’s see. For starters, we need to do something about row 3, harboring as it does survey date identifiers impersonating as a row of data. And a similar reprisal need be practiced upon those purple subtotal rows, which should be deleted. Leaving them in place and proceeding to drop the data as they stand into a pivot table will do nothing less than double any summed results, and yes, we’ve seen this before (e.g., my January 24, 2013 post).

On the other hand stripping those purple rows will also deprive the sheet of their associated headings, e.g., Paid Work or Study, Unpaid Work. But again, rows of data should be of a piece; insinuating rows of titles into the average daily minutes is tantamount to playing checkers, when in fact we’re playing chess. Moreover, the Men & Wom field in the A column sufficiently identifies the activities subject to the minutes-per-day estimates. But there’s more to be said about this.

But before we expound the larger issue betokened above, swing over to column AC and its swath of computations averaging the numbers lining up to their immediate left. Try cell AC4:


And try explaining it. Note all the references in the expression are alphabetically contiguous; and given its cellular continuity how do we defend the formulation above, when we have


at our disposal? I can’t answer my own question.

And indeed – those averages ignore the three partner nations on the other side of AC, thus plunking the wrong kind of data amid the time-estimate numbers. And the blank AD column has to go.

But there’s that larger issue, reprising an earlier question, the one I asked and attempted to resolve in my August 22, 2013 post. The Total sheet confers field status upon tufts of data that should more rightly conceive of themselves as items relenting to a larger, governing field. Thus, for example, the field-defined member countries in the sheet should be made to submit to the controlling aegis of a Country field. It seems to me, then, that a reconstructed data set would wheel out records looking something like this:


Again, umbrella fields on the order of Gender or Country are far mightier enablers of the kinds of grouping and ordering feats that pivot tables perform, and well justify the necessary surcharge the additional data entry would exact (again, turn back to August 22).

For example, the data reforms I’m (again) proposing would free a pivot tabler to rank national time outlays by any activity, by grabbing the Rank Largest to Smallest marble from the Show Values As bag of alternatives. The point is that Rank Largest to Smallest can do what it does only among items planted in the same field – and not between data settled in different ones.

So if you’re wearing the A-Team colors, why not reorganize the data here and shop it to the OECD – if you have the time?

Hire Education: U.S. Tuition Stats, Part 2

17 Jul

Now that last week’s spade work on the U.S. university tuition data has softened the ground for the next set of field trips (don’t worry; the puns and the metaphors stop here) we can proceed.

What, then, about some pivot-tabled, statewide breakouts of the tuition numbers? Start straightforwardly:

Row Labels:  State

Values: 2012-13 Tuition and fees (count)

2012013 Tuition and fees (again, this time by Average, restyled by currency format). Then sort this column highest to lowest. In excerpt I get


(Don’t rub your eyes at the math, or at the smallness of the text, for that matter. The 59 “States” grants a visitor’s permit here to various U.S. territories, e.g. the Virgin Islands and Puerto Rico, with its 98 institutions).

Rhode Island, Vermont, and Massachusetts, three states jigsawed into the New England region in the U.S. northeast, head the sort. But because tuitions ebb and flow by sector type that parameter needs a hearing. Drag away 2012-13 Tuition and fees (count) lift Sector name into Column Labels, and sort it all by the 4-year, private not-for-profit item. I get (again, in part):


This time four New England states – Rhode Island, Massachusetts, Connecticut, and Vermont – spearhead the list (trust me, if you can’t make them out up there). Old states (they’re all part of the original 13), old money, or something like that.

Then recalculate the data by Count, restoring the numbers to their General format. Note the refurbished results continue to hold to the 4-year, private not-for-profit sort, only this time hoisting New York to the top. Now show the values as % of Row Total. Don’t move away from the 4-year column and scroll down to Rhode Island and its 66.67% proportion, Vermont’s 68.42% and Massachusetts’s 60.00%. (Note: you can’t here sort the percent data by largest to smallest, because the item “really” continues to comprise the university count. % of Row Total merely masks the actual data as it was brought to the table.) Compare those states’ shares to those of other states, and the overall tuition averages can be more revealingly understood as a creature of the distributions of institution types across states.

But our workbook contains other data, too. The NetPrice worksheet rolls out real-world cost figures, performing subtractions from the stated tuitions as wrought by various modes of student assistance (again, review the Introduction tab for definitional enlightenment). You’ll need to note, though, that NetPrice carries 2011-12 outlays, and as such can’t truly be compared to the Tuitions sheet data, circa 2012-13. Note in addition the 130+ plus schools for which no net price data avail, and a smaller number of truant institutions – about 70 – showing blanks in their Percent receiving grant aid cells.  (And don’t overlook those four schools flouting negative net prices, meaning they’re paying you to go there.) And I can’t quite tell if the net prices we see here typify only those students who have actually received aid, or all the attendees at a school – even the ones who’ve footed their entire bill.

Nevertheless, there’s a lot to work with, starting, for example, with a breakout of the percent of students receiving aid by sector, along with average net sector price:

Row Labels:  Sector name

Values: Percent receiving grant aid 2011-12 (average, with appropriate formatting)

2011-12 Net price (Average, and suitable formatting)


An impressively large proportion of all students, then, receive some manner of outright aid; (the Net price description in the workbook’s Introduction sheet suggests the numbers above denote granted, as opposed to loaned, monies), with 4-year public institution enrollees the least likely to win a subsidy, perhaps because their tuitions are relatively low-ended to begin with. Still, students at the less than 2-year private not-for-profits and public school receive aid more often, and yet pay less tuition. And remember again that if you’re keen to compare net with initial tuitions, you’ll have to somehow ford the reporting-year divide distancing the Tuition and Netprice sheets.

Then simply sort the net prices from largest to smallest, a move that levitates the Aviator College of Aeronautical Science and Technology in Florida to the top floor with its net price of $71,492. You’ll remember the list-topping Landmark College from last week’s post, but that school asserts to its superlative standing only among its 4-year not-for-profit peers. The 2-year, for-profit Aviator College is the dearest of the dear; and what’s odd is that its 2012-13 tuition – presumably the figure assessed before any financial relief comes to the rescue – comes to $63,073, curiously lower. Of course, the numbers emanate from different school years to be sure, but a deeper look seems in order here, and while you’re at it, check out California’s American University of Health Sciences, the net price runner-up at $66,743. Its tuition for 2012-13: a nickel-and-diming $30,700. Explanations, anyone?

And for another set of thought-provokers, be mindful of the NetPriceChange sheet’s two comparison years, 2009-10 and 2011-12. It’s not unreasonable to wonder about that missing year in between, along with the sheet’s roster of 6843 schools, far outpacing the 4269 with which we worked in the earlier sheets. The respective sector populations by sheet:




It’s sector 9 – the Less than 2-year, private for-profit institutions niche – that pulls far away from the same-numbered Tuition/Net Price entry . Someone needs to account for the vastly different counts here. And one of those 2-year schools, by the way, is Florida’s Eternity Cosmetology School,

But with a name like that, are you sure they didn’t mean Cosmology?




Hire Education: U.S. Tuition Stats, Part 1

11 Jul

You’ll have a grand time at the American university of your choice – about 30 grand, that is, and that’s for only one year’s worth of food fights, and all those other required credits. That promo is presented on behalf of the nation’s universities, who want you, or a loved one, or Uncle Sam himself, to burn that plastic until it starts to melt.

And you won’t feel any better when I tell you that my college stay was basically free, courtesy of New York’s City University and its long-since-superannuated tuition policy.  But you’re likely to cheer up when you hear the update, though: that my alma mater, none other than Queens College, nowadays drops an invoice of about $5,900 on the notebooked laps of its enrolees, at least as of the 2012-13 year.  I know that because I’ve called up the data from the Department of Education’s www.collegecost.ed.gov/catc/ site and its College Affordability and Transparency List, which I’m affording to you here:

 Copy of CATClists2012

You’ll see among other things that the workbook’s sheets report different data for different years, so you’ve been notified and keep that in mind, and you will want to inspect the Introduction tab for some important definitional background on key terms. In any event, the data are pretty serviceable, though they could stand a tweak or two.

Let’s begin with the eponymous Tuition sheet, totalling the annual costs for 4269 sundry institutions (can you get a football scholarship to the Professional Golfers Career College?) in 2012-13. I’d adjudge the code-bearing Sector field in column A surplus to requirements, as the actual rubrics for which the codes stand next door in the Sector Name field in B.

More problematic, though, is the data-obstructive fact that some schools are identically named – and while some of these are sister institutions (e.g., St. John’s College in Maryland and New Mexico), those kinships trouble the data just the same. Cropping the records with the ever-handy Remove Duplicates tool (by looking out for duplicates in the Name of institution field only, but not saving the reduced data set here; we merely want to know the very fact of redundancies) made 58 records disappear, but again we do want to retain all the nominal twins; they’re unique schools, after all.

And even as want to keep these same-names in our fold, we nevertheless need to somehow impart a measure of identificatory uniqueness to each and every school. The surest bet would be this: head on over to the J column, the data set’s first empty corridor, call it Unique Name, and enter in J2:


That unpretentious string concatenation assimilates each school’s exclusive UnitID number to its name, e.g.  University of Pittsburgh-Pittsburgh Campus-215293.

That works, but all those numbers aren’t pretty. Since the great bulk of school names are one-of-kind, we’d just as soon work with those, and string the UnitID only to those schools that need further distinction. I’d thus try this expression instead:


That is: if a school name surfaces more than once, call in that UnitID appendix. Otherwise, leave the name as we find it.  Then concretize the results with Copy > Paste Special > Values.

Now here’s something else. Note the List A and List E fields in columns H and I, identifying tuitions (see the Variable values tab for elaboration) landing in the upper 5% or lower 10% of all schools, respectively. Schools in fulfillment of either benchmark are coded 1, with a 0 reserved for non-qualifiers. I find the columns curiously duplicative; it seems to me that the two discriminations could be made to co-exist in one field, by assigning 5-percenters an A and the lower 10-percent B, for example, after which a pair of simple COUNTIF expressions would be able to  number both cohorts.

But either way, I’m confused. If you simply sum the the code-numbers in column H – the one ranking the upper 5% tuition charges – you’ll indeed realize216, or 5.06% of the school population. But filter the schools in receipt of that defining 1 and you’ll find a good many that aren’t remotely to be numbered among the top 5%, e.g. Minnesota’s Rainy River Community College, offering to teach you want you need to know you for a svelte $5,323 a year.

But that’s when I had my pint-sized Eureka moment. It seems as if the 5% metric redlines schools only within their sector, a decision rule I confirmed by some additional filtering. Thus Rainy River, a sector- 4, 2-year public (that is, governmentally-operated, as opposed to the British usage of the modifier) establishment, does in fact “merit” its high-end credential – relative to its institutional peers.

And once that insight strobes across our collective mindset, we can completely forego the sheet’s native Top 5 and Bottom 10% convolutions and pivot-table our way toward much the same. Try:

Row Labels: Unique Name

Values: 2012-13 Tuition and fees (Sum)

Report Filter: Sector Name

Once in place, select say, the 4-year, private not-for-profit sector. Then right-click somewhere in Row Labels and proceed toward Filter > Top 10… and continue:


You may need to sort the tuition results descendingly, and you won’t really need the Grant Total. I get:



And no, I hadn’t heard of Landmark College, a Vermont-based institution wholly dedicated to students with various learning disabilities. I was also slightly, but only slightly, surprised to find Columbia University ensconced in second position, reflective perhaps of New York’s bumped-up cost of living. (If you’re wondering, Harvard asks for a bargain $39,966.) And note the two St. John’s Colleges, differentiated by their IDs.

And now you can play around with the permutations – filtering the various sectors for top and bottom tuition rankings.  And if you’re looking for perspective, consider the objurgations flung at the British government for lifting yearly university tuitions to  a maximum of £9000 (about $15,300) for a three-year, degree-culminating regimen. All told, in other words, that’s about $46,800. But that’s what Columbia charges for a year  - times four.

So going to college? Start looking for a loved one – now.

TWDOCS: The Comeback

3 Jul

I still don’t know what Google wants to happen when it asks, or tells, me that I’m Feeling Lucky, but either way I’m not sure if it was luck or an inertially-driven pair of hands that fell upon my crumb-dappled keyboard and tapped twdocs.com into my URL bar – again. I’d been engaged in the Sisyphean pursuit of the incommunicado site for some time, hopefully expounding its error-messaging links for some sign that something else was going to happen next.

And so on and so on until this week, when force of habit yet auto-piloted me back to twdocs’s still-listed address, whereupon the next click… made something actually happen. A spreadsheet, bearing the results of a Twitter keyword search, the very object I had requested vainly from twdocs lo these many months, offered itself to me in the form of a long-lost, happily-attended prompt. Talk about wow factor; viva inertia, I say.

I’m not asking any questions. Twdocs is back, at least for now, streaming near-real time, key-worded tweets your way in their countable, measurable, eminently usable rows.  We’ve been to the site and done that, in October and November 2012, before twdocs went on its unexplained hiatus. That’s when I posted a couple of looks at the site, endeavoring, among other things, to monitor US election-day tweets for mentions of Obama and Romney.

The site is indeed back, and while the keyword search instructions set down in my 2012 posts seem unchanged, a few adjustments have worked their way into the user’s marching orders, starting with the maximum tweet-requisition limit of 800, down from the earlier 1200. Does that pullback matter? I suspect it doesn’t, because in any case you’re not going to net every Tweet out there on a trending theme – even with a 1200-tweet max – and so again, it’s tweet velocity instead – the numbers of tweets per some or other unit of time – that’s going to matter.

And something else. Twdocs now reports its date/time-field data this way:

Thu Jul 03 10:27:05 +0000 2014

But in 2012 that date would have looked like this:

Thu, 03 Jul 2014 10:27:05 +0000

The problem isn’t with the Euro to American date format shift, but rather with the +0000, having usurped the 2014′s previous date-hugging pride of placement. Because in both cases above the dates dial into the text format, I had originally called for a Text-to-Columns maneuver that would salvage real, quantified dates from the expressions; and that measure should continue to work, provided you engage the Do not import column (skip) option with all due savvy; but it’s occurred to me that a simpler, perhaps fleeter move could do much the same. Simply select the time data in their column (that should be B) and boot up two rapid-fire find and replaces, the first searching for the three-character day abbreviation in force, and the blank space following those three characters, e.g.

Thu[blank space]

And replace it with nothing.

The blank space is integral here, as it too stands as a bona fide character that, if ignored, would fail to renovate the entries’ text format. Then find the +0000s, and again replace them with nothing. What’s left in the cells now should submit themselves as quantifiable dates and times to your inspection. (Note that if a twdocs keyword search returns results tweeted through several different days, as might typify a keyword that appears but occasionally across the twittersphere and hence takes its time to accumulate, you may have to run the three-character find-and-replace more than once, in order to replace each day.)

Next, my early twdocs posts promoted a means for timing a given keyword’s velocity (again, October 18) that prevailed upon the SEARCH function and a copy-down-the-column device . Here I’ll advertise what I take to be a more parsimonious take on that chore, using as example a current search for the word Wimbledon, a timely reference, I think .

First, you need to calculate the duration across which the tweets were dashed off. Click in a blank cell – say N1, and name it d for duration, or whatever. Then name the newly-quantified date/time field in B t for time (I’m all for parsimony in range naming, too). Range-name the data in Column A – the one storing the actual tweets – text. Then back in N1 enter:


That figures the fraction of an hour in effect within which the tweets were received.  Then enter Wimbledon in L4 (the cells have of course been chosen in the interests of choosing something, somewhere). But why am I doing that, you’re sure to wonder. After all, the original search put to twdocs was for that very word; why look for Wimbledon again inside the spreadsheet? Good question – but we do need the additional Wimbledon search, in order to tighten the imprecision of twdocs’ own initial foray for the word. As I observed in the earlier posts, not every record retrieved by the site will necessarily contain the desired term (I told you I wasn’t asking any questions), and so we need to press another search for Wimbledon, this one under our own steam. Thus enter in M4:


Note the syntactical demands COUNTIF makes of us here. We’re looking for a word nested in a larger collocation of words – the battery of all our tweets – and so COUNTIF wants us to string-concatenate the word, in effect here:


(and COUNTIF is case-insensitve; you could lower-case the W).

Once COUNTIF is understood its outcome is divided by d, rounded off say to two decimals, and the instances of Wimbledon per hour are duly cited. I got 7334.40, by the way. Look for Wimbledon now and your number should of course be different.

Once that’s working, you can enter as many supplementary search terms as you wish beneath Wimbledon in the L column, copying the associated formula for each term to develop a per-hour read for each.

Of course this latter strategy draws a subset from a subset. Once we’ve asked twdocs to find every instance of Wimbledon and then go on to interrogate those data for say, Murray or Federer, we’re really in effect hunting for tweets that feature both the words Wimbledon and Murray. But it’s impossible to know how many Murray-laden tweets are out there – the ones, that is, that don’t have Wimbledon as well. Thus if we’re trolling for all Murrays or Federers in the first instance, we’d need to put that question back to twdocs.

And if you want to itemize all the tweets containing both Wimbledon and Murray, after you’ve already called up Wimbledon from twdocs, you can count on COUNTIFS, the multi-criteria extension on COUNTIF. Assuming the search terms Wimbledon and Murray claim cells L4 and L5, for example, you’d enter:


 That should work. Now will the ballboy please get me my towel?

Course Work: Harvard/MIT MOOC Data, Part 3

26 Jun

In the matter of spreadsheet construction what’s worse: a man without a country, or a man with two of them? Go country-less, and your cell turns up blank; have two countries, on the other hand, and you get two records, and that’s a problem.

A problem, at least, if you’re aiming to do something simple, i.e., breaking out the Harvard/MIT MOOC student data by their country distributions.

By way of review, you may recollect that we’d copied and pasted the pivot-tabled source EdX data into a couple of new data sets all their own (see Part 1). One of these comprised nothing but Student ID and country (or userid_DI and final_cc_cname_DI as EdX’s calls them), this one been set aside because a very substantial number of those students with multiple class enrollments were, for whatever reason, identified with more than one country, e.g.,


That’s because some of those with multiple course involvements show multiple international identities, either because their several IP connections reported as much, or because they volunteered them.

Those first two records in the shot above – the second of which contains nothing in the first field – stand for the same student, who’s been tracked both to Canada and the United Kingdom, a duality which could be laid to the modal IP locations pinpointed during her accesses of the two classes. And remember as well that neither country may signify a student’s country of citizenship in any event. (Her second user id isn’t there because the data have been pasted from a foundational pivot table, which by default casts duplicate field entries into its default, streamlined outline format.)

And what that means first of all that the MOOC data can nail down student locations, more or less, but can’t assert their nationalities.  And it means second of all that we’re yet left with the student double-country problem, for which no elegant resolution seems to obtain. If we run a pivot table across the data in the interests of a country breakout we’ll wind up with more countries than students, as about 19,800 students seem to have been sited twice. On the other hand, my preliminary backs-and-forths with the records find that about 17,500 of these “excess” country data declare Unknown/Other for that second country, an unusable geopolitical unit in any case, leaving us in turn with about 2,300 actual nation redundancies. Because that surplus amounts to but .5% of the whole they could perhaps be triaged – that is, pulled away – from the data, and written off as something like an acceptable loss without much in the way of a sampling compromise.

Moreover, an even bulkier sheaf of records is similarly beset – the subset of single-enrollment students who report an Unknown/Other country, too. So let’s roll out the simplest, if avowedly inelegant, option – pivot tabling the data, and filtering out the blanks along with the Unknown/Others.

And for pivot table honchos like you that’s simple indeed:

Row Labels: final_cc_cname_DI (filter (blank) and Unknown/Other)

Column Labels: final_cc_cname_DI (Count, of necessity)

final_cc_cname_DI (% of Column Total)

Turn off Grand Totals, too.



Surprises? I think so. I for one am surprised by the understated American presence, less than a third of all the MOOC participants. India’s clear second ranking is noteworthy, though not incommensurate with that country’s size and economic efflorescence. From there on the distributions shrink, disperse, and smooth, apart perhaps from the 7.07% contributed by what’s called Other Europe.

Now we can return to the original EdX data and break out course enrollments by country. Here we’re not bothered by the student multi-registration issue because each course can and should be considered in its own right.

Try this table, then:

Row Labels: final_cc_cname_DI

Column Labels: course_id

Values: final_cc_cname_DI (% of Column Total)

And filter out Unknown/Other for the Row Labels

I get (in excerpt; the table is wide and deep):


That may not play very well on your screen, but try that one at home. You’ll note the interesting inter-class, inter-national variation streaming across the table, for example MIT’s two-section Circuits and Electronics representation from India – close to a third of all students there, a proportion that far outnumbers the American complement. Yet Indian contributes but 7.26% of Harvard’s Ancient Greek Hero devotees, even as the US checks in with 45.65% of the registrations. The cultural variation is nothing if not provocative.

Now glide gender into the Report Filter and click f only. Check it out – India’s women occupy 58 and 68% of all the virtual seats claimed by women in the two Circuits and Electronics classes. Turn off % of Column Total by opting for No Calculation and the absolute Indian women’s total comes to about 2700 for the classes. Click m alone in the Report Filter and the parallel Indian male aggregate for Circuits and Electronics yields about 14,000. Put another way, five times more Indian males prefer Circuits and Electronics than their female countrywomen do, but Indian women rule the cross-national female contingent.

It’s all relative then – but then, who’s to say all those Indian IP addresses were plugged into by Indians anyway?

Course Work: Harvard/MIT MOOC Data, Part 2

19 Jun

MHxPC130493603 is an accomplished young woman. MHxPC13, as her friends call her, extended her avocational reach by signing onto the Harvard MOOC Justice course last year, the self-same year in which she finally earned her Master’s degree. I know she achieved that credential last year, because according to her MOOC registration she was born in 2013, thus rather narrowing down the possible dates of completion.

OK – we can’t use that date. We’ll either have to write MHxPC13 off as a smart aleck, or relearn the lesson that data entry has its vagaries, and mistakes can and do sporadically suborn the process (MIT’s ever-helpful Jon Daries tells me because YOB information is user-supplied, the first-named accounting likely applies). It’s not as if we haven’t met up with date outliers before, because we have; see the March 27 and May 15 posts, for example, but in any case these birth-date impossibilities have to be filtered out, if the authentic dates of birth are to teach us anything about the MOOC cohort.

And the NAs and (blanks) that command a very considerable number of the YOB cells need to be overturned as well, in part because a pivot table’s Group Selection potential is sabotaged by any non-numeric miscreants in a grouped field. So here’s what I’d do: run two Find and Replaces in the YOB column (remember that we’re working here with that copied-and-pasted-worksheet I described last week, drawn from the EdX downloaded source), one replacing NA with 2013, the second exchanging (blank) for 2013. Since we need to filter 2013 anyway (and I’d filter any date from say, 2000 and beyond. It’s not altogether unreasonable to imagine a 15-year-old essaying a MOOC, though), and because that value, however useless, is numeric, no additional harm will be done the data by imputing the 2013 to the NAs and (blanks).

And once those dummy 2013s have taken their places, we can pivot table away, starting with a simple census of MOOC devotees by age, grouped for illustration’s sake by bins of three years each (again, recall that we’re working with unique student IDs):

Row Labels: YOB (grouped in three-year bands; then filter the bands starting with 2000-2002)

Values: YOB


Then fit YOB into Values again, this time cast into % of Column Total mode:


It isn’t – and is – surprising to find 63% of the MOOC takers crouching beneath the 30 age mark, surprising perhaps in view of the universalizing promise of the MOOC idea. It’s clear that captivation with the Harvard/MIT offerings isn’t smoothly distributed across the age demographic.

And what about the offerings themselves? That question returns us to the original EdX data set, along with the Person Course Documentation PDF and its course-code identities:



I for one would be interested in a gender-enrollment breakout by course (remember again that the pivot tables presented here will have filtered out NAs, blanks, and the like), a simple enough proposition, I think:


Row Labels: course_id

Column Labels: gender

Values: course_id (Count; show as % of Row Total)

I get



Pretty striking, I’d say. Even as the cumulative female presence fills but a hairsbreadth more than a quarter of the MOOC virtual seats (again, these are per-course data; many students have booked multiple courses), the variation is end-to-end, streaking from a nadir of 5.55% for MIT’s 2.01x – Elements of Structures class (sounds like a course on Lego to me), to an acme of 49.23% for Harvard’s pH278x – Human Health and Global Environmental Change section. The numbers and the disparities clearly mean something – after all, the MOOCs are student self-selected – but the interpretations require some care, textured by the understanding that the course types themselves impose a skew on the findings.

Now if you want to heighten the granularity to the university level – that is, gather figures on enrollments by school, of which there are only two here, of course – we’ll need to derive those data formulaically, by dislodging the school names from the course_id contents into a new field. We want then to see “Harvard” in splendid isolation in that field, and not HarvardX/CB22x/2013_Spring, for example. My idea: cell-point to the nearest free column – for me, U – name it Institution, and enter in U2:


That elementary expression buys into the either/or condition of the university names. If the leftmost character in A2 is H, that yield inarguably stands for Harvard; anything other than H and the school has to be MIT.

Copy the formula down the column and lower a Copy > Paste Special > Values atop it all in order to free up a byte or two. Now try something like

Row Values: Institution

Column Labels: gender

Values: Institution (Count)


What’s noteworthy and by no means expected (for me at least) is Harvard’s overall enrollment edge, in view of the fact that its five courses are far outnumbered by MIT’s eleven. Turn the Count into % of Row Total terms and:


A spacious inter-institutional gender gap yawns before us like a somnolent student (again, you don’t need grand totals here. You’d probably want to center the row labels, too). There are assumedly some provocative things to be learned here, over and above the near-tautological aside that women seem to prefer the Harvard courses.

And we could ask if age differentials mark the Harvard/MIT divide as well. Remember that we’re working with EdX’s native workbook now, and so we need to eliminate the NAs and blanks from the YOB field here too. As per the earlier precedent we could run a Find and Replace on the NAs, with 2013 standing in for them. The blank cells here, however, need to be treated differently. Remember that our first Find/Replace go-round had to replace cell entries that actual read (blank), because these were the copy-and-pasted outcomes from a pivot table, in which wholly blank cells are indeed labelled (blank). But the EdX data records blanks with nothing; those cells are…blank. Find and Replace here thus necessitates absolutely placing nothing in the Find cell – but that also means that you can’t kick off the Find/Replace by simply clicking the I column heading and going ahead. That rudimentary act would select all 1,000,000-plus cells in the column, hundreds of thousands of which are blank, of course, and which would receive a 2013 in turn, even as they have no relation to the MOOC data. The tack then is to click in the Name Box and type I2:I641139 and click Enter selecting those and only those cells associated with the student records. Then Find cells bearing nothing, so to speak, and replace these with 2013.

Anyway, folks, when the priming is done you can try this:

Row Labels: YoB (again, grouped by three-year spans, after which the post-1999 bins are filtered, as before)

Column Labels: Institution

Values: Institution (Count, % of Column Total)


Look closely and you’ll find meaningful heterogeneity across the schools. MIT students appear to be younger, as well as more overwhelmingly male; 72.89% of them were born in 1985 or later, but the like figure for Harvard is 59.98%. Different courses, different course populations.

All of which means I think we need a Part 3.

Course Work: Harvard/MIT MOOC Data, Part 1

12 Jun

Even as the lexical imperatives of wordplay bid us to label MOOCs the next big thing, the higher-education jury continues to deliberate their outsized promise just the same.

MOOCs – or Massive Open Online Courses – make their syllabi available to just about anyone with a dongle and a dream of learning a university-level course, and on just about anything; and again, while the intersection of access, pedagogy, and affordability can get jammed, at least we have some massive data on the matter with which to work now, a recent courtesy of EDX, the consortium running the show for the MOOCs conducted by Harvard and MIT.

And yes, the workbook is appropriately massive, socking more than 640,000 entries into its groaning rows and into the storage device of your choice, here. You’ll also surely want to grab hold of the Person Course Documentation PDF from the same page, defining as it does data field and the MOOC course names.

Each of the 640,000 rows IDs a student enrollment in a MOOC, and because each registrant has a Constitutionally-protected right to grab a virtual seat in more than one course the data thus do not individuate discrete students, even as that information might be something you’d want to have; after all, you might very well want to aggregate gender and nationality, for example.

Indeed – if you extemporize a single-field pivot table, namely

Row Labels: userid_DI

And provided your processor cooperates, you’ll shake out 476,532 unique student IDs onto the table; and so it seems to me that if you do want to do something with these you’ll need to stamp them into a separate, subsidiary data set, by pivot-tabling the IDs and companion fields and then running a copy-paste special-values right atop it, reducing the data to a standard, slimmed-down collection of records (and it’s far easier to play around with a data set than pivot table results). On the other hand, the original 640,000 entries we downloaded continue to be of value, too; they can account for class enrollments by gender, for example, in which the discrete defining units are the classes themselves.

So here’s what I’d do: poke these fields into a pivot table, e.g., something like this:

Row Labels: userid_DI

final_cc_cname_DI (country; to be discussed)


YoB (that is, year of birth)

LoE_DI (in effect, highest academic credential)

And turn off Grand and Subtotals and turn on Tabular layout.

But it seems as if my primo idea isn’t problem-free.  Eye the first few records of the table:


It’s that first record that’s plowing the furrows in my brow. That person – a she – is enumerated twice, asserting national affiliations both to Canada and the United Kingdom. And that means, so to speak, she’s indiscrete – she’s being counted twice.

Explication, please. According to MIT’s Jon Daries, those country data derive from one of two sources: the modal IP address of the student that is the IP that registers most frequently across their interactions with the course), and if that isn’t available or inconclusive, the workbook picks up the country that the student may (or may not) have offered in a MOOC survey.

And that means that what the country data do not invariably portray is citizen nationality; rather, the field serves up (rather literally) in the first instance a student’s predominant user location, and only then, absent a definitive IP reading, a student-provided terra firma coordinate, which may not correspond to nationality either. Thus a good many students who’ve signed on to more than a single course) check in with multiple countries (about 20,000, still only a jot more that 4% of the global student contingent, and I do mean global) too, a consequent of their world-wide hops, skips, and jumps. And for we, the record-keepers, that very itinerancy, a perfectly practicable and pleasing prospect for the distance learners themselves – poses a problem, because right here, we want to be able to count each student exactly once.

And so after much beard-stroking, it seems to me that discretion as usual trumps valor. I’d simply extirpate final_CC_cname_DI from the above pivot table, and then commission another compensatory table:

Row Labels: userid_DI


And we can figure what to do with this one later.

In any case, once we sign off on those decisions (although I’d use invisible ink if I were you), click in our first table and turn to PivotTable Tools > Options > Select (in the Actions button group) > Entire PivotTable, click Copy, earmark a destination (it could be a new workbook if you wish), and punch in a Paste Special > Values (a straight copy-paste would have merely duplicated the pivot table, but here we want plain old records, in their pre-table state. Note as well that all the fields were zoned into the pivot table’s Row Labels area; any reassignment of any of these into Values would have quantified them prematurely. But we want to preserve the original qualitative data here in the pasted data set).

And for those of you who can’t get enough of the complications, I discovered that this new, values-pasted iteration brought along a minute hold of blank student ID records – 17, if you’re counting, almost certainly a residuum from students who, having registered for multiple classes, reported their highest education in only one of these; and their actual IDs are blanked here in the pasted data because pivot tables default to an outlining format in the Row Labels area, thus actually inscribing each unique ID but once (as per the first screen shot here, for example). So here you need only conduct a Z to A sort and then just delete the 17 bottomed-out records. And for a final, if obvious caution, you’ll see that many of the gender, date of birth, and highest education cells yet remain blank; these data simply appear to be unavailable.

Be that as it may, now we can actually start doing something with these data. Begin with a simple gender breakout of the MOOC students (remember again we’re working with the copy-and-pasted data set):

Row Labels: gender

Values: gender (Count)


Then filter out the blanks, NAs, and Os:


(I assume the data coded O point either to transgendered individuals or some data corruption; if you roll through the records you’ll also see that many of these show no data for any of the fields other than ID. And note as well that all subsequent pivot table shots here will have filtered the blanks and kindred data out.)

You should be left with something like this:


Impose a % of Column Total upon the above breakout:


(You don’t really need the Grand Total here, either).

In light of the general 57%-43% female-male apportionment of the general US college population, the data above are rather striking – but remember that the Harvard/MIT MOOC student body, as it were, is far from exclusively American, as we hope to demonstrate later.

And what of highest education by gender?

Row Labels: gender

Column Labels: LoE_DI (of course you can rename these gnarly fields, right in the table)

Values: gender (Count, and % of Column Total)


(Percentages read down each column.) Striking as well – although perhaps more subtly so. The distinctly minority women’s MOOC cohort here is nevertheless the better-educated, and in rather straight-line fashion, intimating some particular women’s self-selection mechanism (and note: while the column labels above exhibit the default alphabetic sort order, you can reorganize them hierarchically by right-clicking the labels, clicking Move, and observing the sub-menu instructions, e.g.)


And what about Year of Birth? Good question – I’m still working on that one, because I’m getting over 800 students born in the 21st century, including 61 who arrived in 2013. I mean, I know those folks at Harvard and MIT are smart, but are they really teaching neurolingustics and differential calculus in their day care centers?


Get every new post delivered to your Inbox.

Join 148 other followers