Search results for 'maryland'

$5,000,000,000 Worth of Salaries: The Maryland State Worker Data

27 Mar

Transparency, it’s wonderful; but when the sun trains its rays on government data full on, sometimes the red faces you see out there aren’t sunburned, but rather embrowned with embarrassment.

Point your TomTom toward the sovereign state of Maryland somewhere in America’s Mid-Atlantic in order to see what I mean. For some time now, the state has made its employee salary data available to inquiring minds – and by employee name. And that means if your neighbor’s, or friend’s, or cousin’s check has the governor’s name on it, you know exactly how much they make. Will you ever look at them the same way again?

In any event, all that illumination, or incrimination, shines right here, on the Baltimore Sun’s site:

(Click the “For this data in CSV form, click here.” link.)

The workbook will run roughshod over 12.8 MB of your hard drive, but that’s what 135,000 employee records will do to a file.

In any case, once there, you’ll find the data quality pretty good, but not flawless. For one thing, the first two data rows clearly aren’t fit for purpose; row 3 in particular seems to have played host to some manner of dummy data, as all its cells report some variant of the 123456789 sequence. You need then to delete these mislaid rows, though you probably didn’t need me to tell you as much. I’d also probably delete the A column (field-named orig-sheet) that appears to share sheet-identifying codes with us that can’t imaginably advance our narrative intents.

But once you get past the quotidian but necessary column auto fits, it gets a little more exciting. Interested as I was in Maryland’s salaries for its university employees, I pushed the On button for a pivot table, beginning with something like this:

Filter: organization-name (select the STATE UNIVERSITIES & COLLEGES item)

Row Labels: organization-subtitle


What’s wrong with this picture? Hint: We’ve seen it before, and now we’ve seen it again. But before the suspense gets intolerable, I’ll let on: it’s those two entries for Towson University – in plain, apparent violation of the pivot table dictum that field items shipped to the Row Label area can be listed only once. And in fact, the dictum holds; note the discrepant lengths of the Towsons, intimating loudly that some of the data here have been superfluously spaced. And that’s true – and for the quick fix, insert a temporary column in the data alongside the organization-subtitle field, and enter


(Your reference here would be F2 if you had deleted that A column.) Copy the above expression down the column, copy all these results, click G2, and aim a Paste Special > Values at the cell, thus imprinting the whole column with hard-coded space-lopped data. And then delete the temporary column, which can now be honorably discharged.

And while we’re at it, note the two entries for what is surely the same Frostburg State University both times. These several spellings can be brought into line by selecting the G column, and issuing something on the order of this Find and Replace:


 Refresh the pivot table as it stands, and the Row Labels reappear, duly rectified.

And given the workbook’s potpourri of fields there stand to be any number of interesting investigative possibilities in there. For example, why not try to associate salary with what Maryland calls eod, or entry on duty (which I take to mean employee start date; some field headings are elaborated here, though I had to do a site search for the meaning of eod. I’m also not quite sure about Samuel Esposito, eod dated April 21, 1921 – soon to celebrate his 93rd year of service to the state, along with the coterie of staffers with eods dating back to the 50s):

Row Labels: eod

Values: ytd-gross (Average)

(The site observes that “The ‘YTD’ refers to the time specified in the database query by the state, not to the current date.” (quoted from the linked location above). I for one am construing the data, perhaps mistakenly, as the effective yearly totals for calendar 2012. After all, it’s now 2014, and the yearly results should be in by now. But note as well, by the way, the small packet of 15 records bearing 2013 dates, all but one of them time-stamped January. Because some of these report overtime earnings, I’m assuming these are live, actually-compiled data, their post-2012 eod notwithstanding.)

Be that as it may, over 10,000 dates stake the Row Labels area in our table, and so some grouping might the parsimonious thing to do here. But my Group Selection button lets me know I Cannot Group that Selection. Why? Sort the dates and you’ll see why; six of the records date their eod 1900-01-00, a dummy, text placeholder that as a consequence stymies any attempt at grouping (all presumptively grouped data have to be up to the grouping task). I’d take the relatively easy way out here; click among those ersatz dates, sort A to Z (if you click an actual date instead in order to start the sort you’d sort by Oldest to Newest instead), and beam a row insertion between 135722 and 135723, thus casting the stand-in dates away from the data source. Then you can try, after resetting the operative pivot table range to $A$1:$p$135722 (via PivotTable Tools > Options > Change Data Source in the Data button group; again, the P may be lettered differently in your table if you’ve deleted some fields):

Row Labels: eod (group, say by year)

Values: ytd-gross (Average, currency format)

Ytd-gross (Count)


(That’s an excerpt, of course.) The correlation between year and average salary: -584, which is not insensible; all things being equal, the longer the tenure, that is, the “smaller” the starting year, the higher the earnings (Note: if you essay the correlation by invoking the numbers in the A and B columns of the pivot table itself, that is,


you’ll be slapped with a #DIV/0! message for your troubles. That’s because that even though the dates lining the Row Labels are full-fledged numeric data, they revert to a temporary label standing when they’re shunted to Row Labels. To realize the correlation you’ll have to copy the pivot data elsewhere).

But whatever you do, just remember: the next time you visit Maryland, be nice to the 135,728 red-faced citizens there trying hard to keep a low profile. Don’t let them know what you know.



US Politicians’ Worth: A Wealth of Disclosures

20 Feb

If the citizen’s right to know shades into a gossipy prurience, the right to know nevertheless prevails; thus if the disclosures of the estimated net worth of American lawmakers compiled by the Center for Responsive Politics and reposed in the Enigma public data site titillates or outrages, one assumes the greater good is being served just the same. Those estimates are here:


First, don’t be fooled by the 2014 date stamp, as I was at first blush. After all, the worth list, sorted as it is highest-to-lowest, is headed by the irrepressibly well-heeled Mr. Trump, who if memory serves only pulled into the White House in 2017, whether he wanted to or not. The 2014, then, must point to the date of the estimates instead (and as such, the year field and its relentlessly uniform 2014s could be deleted, though keep in mind that Enigma stores Center worth data for previous years as well), and if a good many of the numbers appear to you in scientific-notational form (e.g. 1.07E+.08) that’s just one more way by which Excel means to tell you that their columns are presently too narrow. Just auto-fit away, as usual. (Note that the origin field in H is likewise uninformative, comprising nothing but the Center for Responsive Politics entry, and the source field in I is even sparser, amounting to nothing but its header.)

Get past those preliminaries and you’ll need to remind yourself about the codes informing the chamber field: E – Executive, S – Senate, H – House of Representatives, and J – Judiciary. That latter code estimates the worth of the nine now-serving Supreme Court Justices, an interesting set of disclosures about a coterie one might deem slightly above the financial fray; the S and H signifiers apparently name and enumerate each and every current Senator and Congressperson, a nicely comprehensive complement. The twelve executive entries, however, stand in a different relation to the dataset; they identify presidential hopefuls for the 2016 election and not office holders, with the exception of the winner.

Perhaps the first attention-grabber in the data to seize your lapels is the trio of worth estimates lined up for each politician. The Center for Responsive Politics explains that the data spring from disclosure forms completed by office holders which “do not require exact values”, but rather “the range of value into which an asset falls”. The Center thus filled ranges with lower and upper estimates and averaged the two into the avgvalue field, those computations in turn exhibiting considerable plus-minus variation. Indeed, 77 of the politicians declared a negative lower-end asset value, with another 11 submitting negative upper estimates. Of course, our data can’t take that question further.

In any case, working with the averages we could broach a first obvious question, namely how worth associates with chamber:

Rows: chamber

Values: avgvalue (formatted to two decimals with a comma)

I get:


Remember of course that the Executive universe is small, numbering but 12 cases. As such the 1.482- billion-dollar average estimate for Donald Trump slingshots the mean skyward, in spite of negative-asset claimers Scott Walker, Rick Santorum, and Martin O’Malley. Whether that means erstwhile Maryland governor O’Malley truly has less money than you and I is perhaps a query better put to his accountant, but the data maintain as much.

But the larger, if ingenuous, conclusion urged upon us is the reality that American national politicians – including Supreme Court justices – are overwhelmingly multi-millionaires. If the averages of the averages are credible, Senators possess over $10,000,000 worth of assets, with Representatives checking in a $7,500,000, accumulations that of course can’t ascribed to their $174,000 salaries. At the same time, however, variation within the Congress is massive, including outlier Darrel Issa, a Republican Congressman from southern California worth, per his average estimate, over $436,000,000.

For a sense of the dispersion of wealth, then, you can drag avgvalue in to values again, setting it to Stddevp (for the operational difference between Stddev and Stddevp, look here):


When the standard deviations exceed the averages you’re dealing with a lot of variation, even after discounting the ten-figured Mr. Trump.

Now for perhaps the juicier question: Might net worth vary palpably by party affiliation? Conventional wisdom would respond in the affirmative, i.e. Republicans and their putative sympathies for the haves could be expected to have more as well than the latter-day Blues. But either way, the dataset should enable an answer of sorts to the question.

The marching order here is to isolate party, as it’s referenced by its initial in the name field. Because that initial appears (almost) in every case immediately after the open parenthesis alongside each surname, we can take over the empty source field, rename it Party, and write in I2 (assuming you haven’t deleted the extraneous year field):


That expression seeks the character position of the open parenthesis and adds one to it, enabling MID to grab the next single character, which denotes political party. Copy the formula down the field and that should work, with nine exceptions – the Supreme Court justices, whose party memberships have been withheld from the data, an exclusion that imposes a #VALUE message upon their formulas. The workaround: rewrite the formulas thusly:


The nifty IFERROR function returns the default, nested formula result unless it evaluates to a error, in which case it’ll interject the defined if-false result, in the above case J, for Judiciary.

Once that revamped expression is copied down the field the pivot table follows:

Rows: Party

Values: avgvalue (Average, with appropriate formatting)

I get:



The thesis is confirmed, apparently (the I represents the two independent legislators informing the list, by the way, one of whom is Bernie Sanders). Republicans are worth on average 47% more than the (very) relatively proletarian Dems; but toss Stdevp back in to the table and we see:


The mammoth standard deviation among GOP sectarians has a good deal to do with the Trump billion, and Issa’s 436,000,000, too. Thus, if one deletes the row bearing Trump’s data and refreshes the pivot table, you’ll see:


Check it out.  The Republicans, we see, in actuality average a smaller net worth and standard deviation than Democrats. Go further and delete all 12 Executive entries – none of whom, with the exception of the incumbent, are election officials, after all – and run the pivot table once more:


Here, Democrats possess 9.94% more worth on average than Republicans.

Now that’s unconventionally wise.

The College Transcript: Downloading an Upgrade

19 Aug

It’s homely, to be sure, but if you want to go somewhere you gotta have one. And no – I’m not talking about your passport photo but your college transcript, an obstinately prosaic but indispensable means of entrée to your next academic or career step.

The transcript – an enumeration of a student’s courses and performances gathering into what we Yanks call the Grade Point Average (GPA) – has undergone a rethink of late. A piece in this past February trained its lens on a number of initiatives aiming to drill qualitative depth into the transcript’s tale, sinking some analytic teeth into its default, alphabetically-driven narrative by linking its grades to students’ work and detailed progress toward a degree.

And that got me to thinking: if it’s depth we’re seeking, why not endeavour to learn something more from the numbers and the letters by re-presenting the transcript as a…spreadsheet?

It makes perfect sense to me, though you’d expect me to say that. But after all: submit a transcript to some standard tweaks and you wind up with a dataset, one suitable for sorting, pivot tabling, formulaic manipulation, and charting, too. And once the tweaking stops, the transcript can treat its readers to a round of different, edifying looks at the data – and that’s what I call depth, too.

Transcripts aren’t things of beauty, but they sport no small amount of variation just the same. So to understand what we’re working with, more or less, take a look at this one – the transcript of Mr. Ron Nelson, who made his academic record available here:


In the interests of exposition, I’ve subjected the baseline transcript above to a round of fictional retakes that of course don’t represent Mr. Nelson’s actual attainments (for one thing, his record dates back nearly 25 years). A few of the points that call for renovative scrutiny, then: First, note the blank column coming between the Course No. and Course Title columns, an excess which must, for spreadsheet purposes, be curtailed. Second, the multi-columned iterations of Course Titles and associated grades need be cinched into a single field. Third, the academic term headings (e.g. Spring Semester 1991), and TERM TOTALS and CUMULATIVE TOTALS lines have to be sent elsewhere; they report information that simply aren’t of a piece with the grade/grade point records that the dataset should comprise.

Second, if you’re new to the GPA you need to know how that defining metric is figured. While of course variation again abounds, the textbook illustration looks something like this: Class grades are typically assigned along an A-to-D continuum along with a failing F, in what are in effect quantified decrements of a third of a point, e.g., A, A-, B+, B, etc. In the typical system an A earns 4 points, an A- 3.67, a B+ 3.33, and so on (the D- grade is rarely offered, by the way). An F naturally awards no points.

Each grade-point achievement is in turn multiplied by the number of credits any given course grants, resulting in what are usually called quality points. Thus a B- grade in a three-credit class yields 8 quality points – 2.67 times 3. An A in a four-credit course evaluates to 16 quality points, or 4 times 4. The GPA, then, divides the sum of quality points by the sum of credits achieved. Thus this set of grades:


Works out to a GPA of 2.92.

It’s pretty intelligible, but with a proviso. The GPA must incorporate the number of credits associated with a failing grade into its denominator, and so these performances:


Calculate to a GPA of 2.33. But the 15 credits recorded above really only bestow 12 usable credits upon the student, and that dual count needs to be remembered.

With that extended preamble noted, my spreadsheet-engineered transcript demo (spreadscript?) presents itself for your consideration here:

Transcript demo

In excerpt, the sheet looks like this:


Note the paired Term and Date columns; though one might be moved to declare the former field superfluous, it seems to me that its textual Spr/Fall entries could enable a pivot table broken out by seasonalty, i.e., GPAs by all Spring and Fall courses across the student’s academic career. The Date field, on the other hand, is duly numeric, thus lending itself to chronological resorting should the current sequence of records be ordered by some other field. And the grades have been visually differentiated via a conditional formats.

The Credits total in the upper right of the screen shot reflects a necessary bypassing of the F grade for Music 101 per our earlier discussion (the grades are stored in the H column), and realized by this formula:


The SUMIF here is instructed to ignore any F in the F column via the “not” operator bound to the formula’s criterion. Note the quotes required by SUMIF for operators clarifying the criterion. The GPA, on the other hand, divides the quality point total by all 112 credits (you will have noted that the spreadsheet outputs the quality points in H via a lookup array range-named gr in Q1:R10. And in the interests of simplicity I’ve let subsidiary course events and their codes, e.g., class withdrawals and incompetes, go unattended).

Now the data become amenable to pivot tabling and other assessments. For example, if we want to break out GPAs by term we can try:

Rows: Date (You’ll want to ungroup these, if you’re working in release 2016)

Values: Hours/Credits (Sum)

Quality/Points (Sum, rounded to two decimals)

Because we need to total the date-specific quality points and divide these by the respective-date credit totals, a calculated field must be implemented, e.g.


Click OK, again round off to two decimals, and you should see:


Once the GPA field is put in place you can, for example, break out credit accumulations by Discipline, or subject, by replacing Date with Discipline:


Or try a frequency analysis of credit totals by grade:

Row: Grade

Values: Hours/Credits (Sum)


(Note: because of the priorities with which Excel sorts text characters, grades accompanied by the + symbol initially appear at the bottom of any letter sort, e.g., you’ll initially see B, B-, and B+. You’ll need to right-click the B+ and select Move > Move “B+” up twice. And of course the same adjustment should be applied to C+.)

Of course these outcomes could be charted, e.g.


And if you are interested in seasonality:

Rows: Term

Values: Hours/Credits

Quality Points (Both Sum, and both rounded to two decimals)



(By the way, you’re not duty-bound to earmark Hours/Credits and Quality Points for the table if you want to display GPA at the same time. Once constructed, GPA becomes available in its own right, and need not be accompanied by its contributory fields.)
And all these and other reads on the data could be assigned to a dashboard, too.

Thus the transcript-as-spreadsheet could break new presentational ground, supplementing the row-by-row recitation of subjects and scores that students and recipient institutions currently face, with a suppler way around the data. They could even be made unofficially available to students themselves via download, empowering the spreadsheet-savvy among them to map and understand their grades in novel ways (one trusts that no one’s accepting a transcript bearing a student’s email address).

But is it a thing of beauty? Maybe not, but don’t you like the colors?

The Global Terrorism Database, Part 1: Surmises and Surprises

20 Jul

Anyone with an investigative concern with terrorism – and that complement is regrettably large – would do well to point itself at the Global Terrorism Database (GTD), the self-described “most comprehensive unclassified data base on terrorist events in the world” tended by the University of Maryland’s National Consortium for the Study of Terrorism and Responses to Terrorism, or START. Dating from 1970 and comprising over 156,000 records of incidents world-wide, the database in spreadsheet form is available to interested parties for download, provided the requestor complete a simple requisition form.

The database, then, is vast (75MB) and instructive, though you’ll need to read the very real qualifying cautions in the GTD FAQ page, particularly accountings of the missing data for 1993, shifts in incident counting methodologies beginning in 1998, and the Database’s definitions of terrorism. You’ll also probably need to download the GTD Codebook for a resume of the database’s 111 fields.

But while not faultless, you’ll want to consult this resource, even as some questions about its spreadsheet incarnation remain to be asked as you work with its holdings. For example, note the zeros in the imonth and iday fields (columns C and D) that proxy for unknown incident months and days. However only 23 month cells here go answer to the “unknown” description, while a more measurable 894 days are currently lost to the compilers. These relatively small lacunae will necessarily impinge event breakouts by chronology, of course, but again, their compromising effect is small, calculating to about .6% of all records.

Less clear are the entries informing the approxdate field in E; the codebook (page 12) allows that approxdate archives those incident dates whose ascertainment is unclear, but in most of the approxdate cases the same information already populates the contiguous iyear, imonth, and iday fields. Moreover, the approxdate data are textual, and as such aren’t immediately ready for analytic use with some formulaic backs and forths.

I’m also not sure what the country and region codes in H and J respectively bring to the data, in view of the actual country and region names that sidle these fields. But that’s a quibble; if the fields prove unneeded they can simply be ignored, after all.

But once you’ve oriented yourself to the database-cum-workbook, some obvious but most interesting pivot tables beg our consideration, for example, terrorist incidents by country and year:

Row: country_txt

Columns: iyear

Values: iyear (Count, Show Values as % of Row Total)

(For presentation’s sake you may want to beat back the fusillade of zeros invading the above table cells. Try File > Options > and click off the “Show a zero in cells that have zero value” box beneath the Display options for this workbook heading.)

Now if you bring a 2016 sensibility to the data you’ll be surprised, at least some of the time. Terrorist acts in the United States peaked in 1970, with 17.38% of all acts – 468 in absolute numeric terms – across the 46-year span ending in 2015 having been perpetrated in that year. Double-click that 17.38% and a new table springs from that figure, and onto a new sheet. Pivot table those 1970-specific data:

Rows: provstate (bears US state names)

Values: provstate (Count)

And you discover that 98 terrorist acts – as defined here – were incurred by California, with another 86 having been inflicted in New York. Substitute gname (g for group, presumably, in column BG) for provstate in Rows and find 108 acts mounted by generic Left-Wing Militants, 69 undertaken by what the GTD calls Black Nationalists, 54 wrought by Student Radicals, and so on. Compare that outpouring with the grand total of 38 incidents tracked in the US last year – that sum lifting a curve that has pulled upward since 2011, to be sure, but certainly far flatter than the Vietnam-era total registered 46 years ago (the implication, then, by the way, is that the immediately pre-1970 counts were perhaps comparably numerous, but the GTD doesn’t have those data. On the other hand, 1970 seems to stand as a towering apex among the years, even the ones that immediately follow it).

Subjected to a simple pivot table line chart, enabled by referring the country_txt field to a Slicer within which one can isolate one country at a time, and stationing iyear in Rows, the American (United States) incident totals read thusly (recall the missing 1993 data, though):


The charts for some other countries not unacquainted with terrorist attacks:


There I would have supposed the spikes would have bolted upwards earlier, during the throes of the Troubles.


Again, I would have surmised an earlier peak, but note the welcome trough in acts in 2015.


More expectable, perhaps, though note the 1975 year-inception point and the remarkably sparse incident count through 2003.

Note at the same time, however, that the charts reckon percentage distributions of incidents, and don’t size up absolute incident counts for the countries. Those read as follows:

US – 2693

Ireland – 274

Israel – 2085

Iraq – 18770

Iraq’s enormous total is thus very much a recent vintage.

Of course any surprise at any of the charted histories above may have more than a little to do with one’s definitional investments in the term terrorism. It may be easy not to liken the Weather Underground and its 45 attacks from 1970 through 1975 to the latter-day insurgencies of Al Qaeda (I854 attacks under variant names), but that aversion needs to be thought through, and the QTD did – though of course you’re welcome to contest and/or refine the understandings.

For a worldwide incident count, click the Clear Filter icon on the Slicer, in effect selecting all countries simultaneously:


The chart result:


10.74% of all the recorded attacks were conducted in 2014, with a small but appreciable decrement to 9.41% characterizing the following year. Still, an extraordinary 39.6% of all attacks crowded the six years spanning 2010 through 2015 (note that by imposing a five-year grouping interval on the iyear data, the pivot table nevertheless casts the most recent data into a six-year tranche, 2010-2015).

In any case, with 111 fields and in spite of its lengthy blocks of blank cells, the GTD’s massive trove of data has a lot of permutations in there.

Hire Education: U.S. Tuition Stats, Part 1

11 Jul

You’ll have a grand time at the American university of your choice – about 30 grand, that is, and that’s for only one year’s worth of food fights, and all those other required credits. That promo is presented on behalf of the nation’s universities, who want you, or a loved one, or Uncle Sam himself, to burn that plastic until it starts to melt.

And you won’t feel any better when I tell you that my college stay was basically free, courtesy of New York’s City University and its long-since-superannuated tuition policy.  But you’re likely to cheer up when you hear the update, though: that my alma mater, none other than Queens College, nowadays drops an invoice of about $5,900 on the notebooked laps of its enrolees, at least as of the 2012-13 year.  I know that because I’ve called up the data from the Department of Education’s site and its College Affordability and Transparency List, which I’m affording to you here:

 Copy of CATClists2012

You’ll see among other things that the workbook’s sheets report different data for different years, so you’ve been notified and keep that in mind, and you will want to inspect the Introduction tab for some important definitional background on key terms. In any event, the data are pretty serviceable, though they could stand a tweak or two.

Let’s begin with the eponymous Tuition sheet, totalling the annual costs for 4269 sundry institutions (can you get a football scholarship to the Professional Golfers Career College?) in 2012-13. I’d adjudge the code-bearing Sector field in column A surplus to requirements, as the actual rubrics for which the codes stand next door in the Sector Name field in B.

More problematic, though, is the data-obstructive fact that some schools are identically named – and while some of these are sister institutions (e.g., St. John’s College in Maryland and New Mexico), those kinships trouble the data just the same. Cropping the records with the ever-handy Remove Duplicates tool (by looking out for duplicates in the Name of institution field only, but not saving the reduced data set here; we merely want to know the very fact of redundancies) made 58 records disappear, but again we do want to retain all the nominal twins; they’re unique schools, after all.

And even as want to keep these same-names in our fold, we nevertheless need to somehow impart a measure of identificatory uniqueness to each and every school. The surest bet would be this: head on over to the J column, the data set’s first empty corridor, call it Unique Name, and enter in J2:


That unpretentious string concatenation assimilates each school’s exclusive UnitID number to its name, e.g.  University of Pittsburgh-Pittsburgh Campus-215293.

That works, but all those numbers aren’t pretty. Since the great bulk of school names are one-of-kind, we’d just as soon work with those, and string the UnitID only to those schools that need further distinction. I’d thus try this expression instead:


That is: if a school name surfaces more than once, call in that UnitID appendix. Otherwise, leave the name as we find it.  Then concretize the results with Copy > Paste Special > Values.

Now here’s something else. Note the List A and List E fields in columns H and I, identifying tuitions (see the Variable values tab for elaboration) landing in the upper 5% or lower 10% of all schools, respectively. Schools in fulfillment of either benchmark are coded 1, with a 0 reserved for non-qualifiers. I find the columns curiously duplicative; it seems to me that the two discriminations could be made to co-exist in one field, by assigning 5-percenters an A and the lower 10-percent B, for example, after which a pair of simple COUNTIF expressions would be able to  number both cohorts.

But either way, I’m confused. If you simply sum the the code-numbers in column H – the one ranking the upper 5% tuition charges – you’ll indeed realize216, or 5.06% of the school population. But filter the schools in receipt of that defining 1 and you’ll find a good many that aren’t remotely to be numbered among the top 5%, e.g. Minnesota’s Rainy River Community College, offering to teach you want you need to know you for a svelte $5,323 a year.

But that’s when I had my pint-sized Eureka moment. It seems as if the 5% metric redlines schools only within their sector, a decision rule I confirmed by some additional filtering. Thus Rainy River, a sector- 4, 2-year public (that is, governmentally-operated, as opposed to the British usage of the modifier) establishment, does in fact “merit” its high-end credential – relative to its institutional peers.

And once that insight strobes across our collective mindset, we can completely forego the sheet’s native Top 5 and Bottom 10% convolutions and pivot-table our way toward much the same. Try:

Row Labels: Unique Name

Values: 2012-13 Tuition and fees (Sum)

Report Filter: Sector Name

Once in place, select say, the 4-year, private not-for-profit sector. Then right-click somewhere in Row Labels and proceed toward Filter > Top 10… and continue:


You may need to sort the tuition results descendingly, and you won’t really need the Grant Total. I get:



And no, I hadn’t heard of Landmark College, a Vermont-based institution wholly dedicated to students with various learning disabilities. I was also slightly, but only slightly, surprised to find Columbia University ensconced in second position, reflective perhaps of New York’s bumped-up cost of living. (If you’re wondering, Harvard asks for a bargain $39,966.) And note the two St. John’s Colleges, differentiated by their IDs.

And now you can play around with the permutations – filtering the various sectors for top and bottom tuition rankings.  And if you’re looking for perspective, consider the objurgations flung at the British government for lifting yearly university tuitions to  a maximum of £9000 (about $15,300) for a three-year, degree-culminating regimen. All told, in other words, that’s about $46,800. But that’s what Columbia charges for a year  – times four.

So going to college? Start looking for a loved one – now.

Course Work: Harvard/MIT MOOC Data, Part 2

19 Jun

MHxPC130493603 is an accomplished young woman. MHxPC13, as her friends call her, extended her avocational reach by signing onto the Harvard MOOC Justice course last year, the self-same year in which she finally earned her Master’s degree. I know she achieved that credential last year, because according to her MOOC registration she was born in 2013, thus rather narrowing down the possible dates of completion.

OK – we can’t use that date. We’ll either have to write MHxPC13 off as a smart aleck, or relearn the lesson that data entry has its vagaries, and mistakes can and do sporadically suborn the process (MIT’s ever-helpful Jon Daries tells me because YOB information is user-supplied, the first-named accounting likely applies). It’s not as if we haven’t met up with date outliers before, because we have; see the March 27 and May 15 posts, for example, but in any case these birth-date impossibilities have to be filtered out, if the authentic dates of birth are to teach us anything about the MOOC cohort.

And the NAs and (blanks) that command a very considerable number of the YOB cells need to be overturned as well, in part because a pivot table’s Group Selection potential is sabotaged by any non-numeric miscreants in a grouped field. So here’s what I’d do: run two Find and Replaces in the YOB column (remember that we’re working here with that copied-and-pasted-worksheet I described last week, drawn from the EdX downloaded source), one replacing NA with 2013, the second exchanging (blank) for 2013. Since we need to filter 2013 anyway (and I’d filter any date from say, 2000 and beyond. It’s not altogether unreasonable to imagine a 15-year-old essaying a MOOC, though), and because that value, however useless, is numeric, no additional harm will be done the data by imputing the 2013 to the NAs and (blanks).

And once those dummy 2013s have taken their places, we can pivot table away, starting with a simple census of MOOC devotees by age, grouped for illustration’s sake by bins of three years each (again, recall that we’re working with unique student IDs):

Row Labels: YOB (grouped in three-year bands; then filter the bands starting with 2000-2002)

Values: YOB


Then fit YOB into Values again, this time cast into % of Column Total mode:


It isn’t – and is – surprising to find 63% of the MOOC takers crouching beneath the 30 age mark, surprising perhaps in view of the universalizing promise of the MOOC idea. It’s clear that captivation with the Harvard/MIT offerings isn’t smoothly distributed across the age demographic.

And what about the offerings themselves? That question returns us to the original EdX data set, along with the Person Course Documentation PDF and its course-code identities:



I for one would be interested in a gender-enrollment breakout by course (remember again that the pivot tables presented here will have filtered out NAs, blanks, and the like), a simple enough proposition, I think:


Row Labels: course_id

Column Labels: gender

Values: course_id (Count; show as % of Row Total)

I get



Pretty striking, I’d say. Even as the cumulative female presence fills but a hairsbreadth more than a quarter of the MOOC virtual seats (again, these are per-course data; many students have booked multiple courses), the variation is end-to-end, streaking from a nadir of 5.55% for MIT’s 2.01x – Elements of Structures class (sounds like a course on Lego to me), to an acme of 49.23% for Harvard’s pH278x – Human Health and Global Environmental Change section. The numbers and the disparities clearly mean something – after all, the MOOCs are student self-selected – but the interpretations require some care, textured by the understanding that the course types themselves impose a skew on the findings.

Now if you want to heighten the granularity to the university level – that is, gather figures on enrollments by school, of which there are only two here, of course – we’ll need to derive those data formulaically, by dislodging the school names from the course_id contents into a new field. We want then to see “Harvard” in splendid isolation in that field, and not HarvardX/CB22x/2013_Spring, for example. My idea: cell-point to the nearest free column – for me, U – name it Institution, and enter in U2:


That elementary expression buys into the either/or condition of the university names. If the leftmost character in A2 is H, that yield inarguably stands for Harvard; anything other than H and the school has to be MIT.

Copy the formula down the column and lower a Copy > Paste Special > Values atop it all in order to free up a byte or two. Now try something like

Row Values: Institution

Column Labels: gender

Values: Institution (Count)


What’s noteworthy and by no means expected (for me at least) is Harvard’s overall enrollment edge, in view of the fact that its five courses are far outnumbered by MIT’s eleven. Turn the Count into % of Row Total terms and:


A spacious inter-institutional gender gap yawns before us like a somnolent student (again, you don’t need grand totals here. You’d probably want to center the row labels, too). There are assumedly some provocative things to be learned here, over and above the near-tautological aside that women seem to prefer the Harvard courses.

And we could ask if age differentials mark the Harvard/MIT divide as well. Remember that we’re working with EdX’s native workbook now, and so we need to eliminate the NAs and blanks from the YOB field here too. As per the earlier precedent we could run a Find and Replace on the NAs, with 2013 standing in for them. The blank cells here, however, need to be treated differently. Remember that our first Find/Replace go-round had to replace cell entries that actual read (blank), because these were the copy-and-pasted outcomes from a pivot table, in which wholly blank cells are indeed labelled (blank). But the EdX data records blanks with nothing; those cells are…blank. Find and Replace here thus necessitates absolutely placing nothing in the Find cell – but that also means that you can’t kick off the Find/Replace by simply clicking the I column heading and going ahead. That rudimentary act would select all 1,000,000-plus cells in the column, hundreds of thousands of which are blank, of course, and which would receive a 2013 in turn, even as they have no relation to the MOOC data. The tack then is to click in the Name Box and type I2:I641139 and click Enter selecting those and only those cells associated with the student records. Then Find cells bearing nothing, so to speak, and replace these with 2013.

Anyway, folks, when the priming is done you can try this:

Row Labels: YoB (again, grouped by three-year spans, after which the post-1999 bins are filtered, as before)

Column Labels: Institution

Values: Institution (Count, % of Column Total)


Look closely and you’ll find meaningful heterogeneity across the schools. MIT students appear to be younger, as well as more overwhelmingly male; 72.89% of them were born in 1985 or later, but the like figure for Harvard is 59.98%. Different courses, different course populations.

All of which means I think we need a Part 3.