Campaign Contribution Data, Part 1: The New York Numbers

20 Sep

Read any good campaign financing reports lately? OK – it’s precisely that kind of question that explains why you cross the street when you see me coming your way.

Social skills aren’t my strong point, but I won’t be deterred. For those who want and need to know about who it is who’s paying for American’s presidential candidates this year – i.e., Clinton, Trump, and everyone else, past and present, click into the Federal Election Commission’s site and you’ve come to the right place. All kinds of spreadsheets in there aim to give the taxpayers their money’s worth, including a state-by-state drill-down of contributions, among which of course are the New York data. It’s a big state and a big sheet, so please download it via the above link and its ensuing NY.zip reference:

vote1

The sheet’s 391,513 records log candidate contributions harking back to October, 2013, taking the activity up to August 1 of this year (really July 31; the one August donation on the books notches a donation of $0, and I’m not sure what that entry means to tell us), a short while before both the major party conventions came and went. But 3,204 of these mark refunded (or redesignated) gifts whose figures are expressed as negative numbers, and which assumedly have to be squared with some equivalently positive sum. But before we slide that and other attention-begging data issues beneath the microscope rest assured that all the monetary sums in column J – contb_receipt_amt – and dates in K – contb_receipt_dt – are uniformly numeric. Cool – 783,026 cells you don’t have to worry about. And contbr_st in F is wholly ignorable; because the sheet as a matter of definition marks contributions issuing from New York, the NY entry is the only one you’ll find in the field. But you will have to auto-fit a number of the dataset’s columns, rather an expectable necessity.

Of course the data’s essential investigative question asks how much money each candidate has accumulated; and the answer should be straightforwardly realized, via a pivot table. But before the table happens something need be done about those 3,204 negative values. It seems to be me that their manner of disposition here could be argued about, but I hold that they should be simply ignored – because since they stand for actual contributions that, for whatever reason were recalled, e.g. a campaign shutdown (I’m assuming none of these were voided for reasons of impropriety), they should be accounted in the candidates’ sums. These were monies that were raised, after all.

Again, that position could be contested, but I’ll go with it for the time being; and if then I’m to ignore the negative contributions I can appropriate next-available-column S, call it Contribution, and enter in S2:

=IF(J2>0,J2,””)

We’re simply stipulating that if the actual contribution in the corresponding J cell exceeds zero, use it; otherwise, assign a double-quote result which neither be tallied by the COUNT function nor acknowledged at all by AVERAGE (it will be noted by COUNTA, however).

Copy the formula down S and then proceed:

Rows: cand_nm

Values: Contribution (Sum, formatted to two decimals and Use 1000 separator, or comma. Note the comma doesn’t punctuate the very 1000 alongside the checkbox.)
Contribution (again, here Show Values As > % of Column Total)

I get:

vote2

We see Hillary Clinton’s extraordinary sway over the New York donor cohort (and no, I haven’t heard of all those candidates), her receipts trumping the Donald’s 40 times over. Her margin over erstwhile rival Bernie Sanders is more than nine-fold, though of course the latter’s campaign exchanged the hat for the towel some time ago. What’s also nice about the data, as it were, is the steadfast uniqueness of the candidate names in Rows. It’s a pleasingly no-tweak list: no faux duplicates among the 24 aspirants, no alternate names, no misspellings.

Having gotten this far you may want to break the data out additionally by date, a determination that asks a presentational decision or two of us. First I’d whisk the second entry of contb_receipt_amt out of the table, because the data subdivisions about to be instituted by the date parameter would inflict something of a mess on two sets of side-by-side values.

Next we – or you – need to do something about date granularity. If you install contb_receipt_dt in Columns and group its data by Years and Months, 27 columns of numbers pull across your screen (in addition to Grand Totals), an unfolding that might or might not serve readability and/or your analytic designs terribly well. Group by Years and Quarters, on the other hand, and the table retracts to 10 columns. Note, though that the final item here – Qtr3 for 2016 – comprises but one month, July.

Whatever you decide, you won’t be able to miss Clinton’s recent lift-off in her bank account (and pitched to a smaller magnitude, that of Trump), as well as the reddening of the books of failed candidates, their negative numbers presumably signalling monies returned in the aftermath of campaign shut-downs.
You may also want to pursue a corollary interest in the numbers of and per-candidate average. Now that we’ve assumed our stance on the negative-number issue, we can pivot-table thusly:

Rows: cand_nm

Values: Contribution (Count, no decimals)

Contribution (again, by Average, two decimals, 1000 separator)

I get:

vote3

Some remarkable disclosures emerge. We see that contributions to the Sanders campaign actually outnumbered those of Clinton’s (as of July 31), but their average amounts to but one-tenth of the nominee’s per-donor outlay, a most interesting and meaningful bit of stratification.

But not so fast. If you sort the contributions you’ll find nine seven (or eight)-figure Clinton contributions described as Unitemized; presumably these mighty aggregates enfold a great many individual gifts, and have the effect of sky-rocketing Clinton’s apparent average.  In fact, those nine total over $41,000,000 – more than half the Clinton total – and mincing these into their discrete donations would likely roll up a Clinton contributor count that would overwhelm the Sanders figure – and ruthlessly depress the Clinton per-gift average at the same time. There’s nothing in the worksheet, however, that can tell us more about those nine behemoths. Some deep background, please.  Note Donald Trump’s low average, along with the 10,000+ contributions to all candidates amounting to precisely $2,700 each, an apparent obeisance to the maximum permissible contribution of an individual to a candidate’s committee.

All very interesting. But there’s more to be said about the New York – and the data aren’t always no-tweak.

US Patents: Inventing an Angle on the Data

12 Sep

Humans are nothing if not a resourceful lot; and one telling record of the human will to innovate continues to be written by the United States Patent Bureau, whose tally of certified inspirations submitted by both American states and other countries makes itself available here:

http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cst_utl.htm

You’ll note the site wants us know that “…a user may import this report file into spreadsheet software such as Microsoft Excel to facilitate printing.” Could be, but we’re not instructed how that import might be carried out, apart from the decidedly public-domain select-copy-and-paste to which I’ve resorted here:

united-states-patent-data

You’ll recognize that the State/Country data in column A has been ordered to wrap its text in awkward places; a double-clicked auto-fit of A in conjunction with a column auto-fit should set things straight, literally. You should also delete the wholly blank B column, because it’s…blank.

Now before you get to work you may want to flit across the aggregated data in rows 2-4. Here we see the striking and determined ascendancy of international patents, surmounting the US total for the first in 2008 and continuing to pull away from the host country’s numbers ever since. That means something, of course, perhaps speaking more to the alacrity of knowledge development worldwide than any becalming of the imagination in the States. After all, patents in the US boomed 45% across the 2008-2015 frame, too.

But while those are most interesting data indeed, they don’t need us to discover them; they’ve already been compiled. We, however, may want to work with the state and country details arrayed below by pivot tabling them, and if we do, a round or two of preliminary decision-making needs to be thought through.

First, you’ll note the two rows separating the US state data (and these include American territories such as Puerto Rico as well as Washington DC, the country’s capitol that doesn’t quite hold state status) from those of the other countries, a segregational practice of which you’ll probably approve. States are states and countries are countries, after all, and any admixture of the two might confuse the analysis – maybe. There might be some profit in comparing patents from California with those issuing from Poland, but that prospect leaves me hesitant, though I suspect there’s a minority view to be propounded here.

But assuming you’re pleased with the status quo you’ll need to copy the header row in 1 to rows 7 and 67, thus topping the two data sets appropriately. Next – if we’re pivot tabling – I’d delete the data in the All Years field, because keeping them in their place will risk a double-count of the patent numbers once the tables do their aggregating thing. Left as is, All Years will be regarded as just another year, and tabulated accordingly.

Next, you could treat each row to a Sparkline line chart, seeing to it that you identify C8:P65 as the data range, thus seeing to it in turn that (what is now) the B column data – bearing pre-2002 sums of patents dating across the 1963-2001 span – have been excluded from the line construction. 39 years’ worth of patents can’t be made to comport with the year-by-year counts that adjoin them, it seems to me. Select the Q8:Q65 location range, and your OK will plot some interesting, and very far from uniform, trajectories down the range, e.g. Delaware’s one-year dive in patent production last year, or South Carolina’s 140% splurge between 2008 and 2015. Overall, the lines certainly describe a cross-state, upward pull in patent yields, but again the anomalies, e.g. Louisiana,

patent1

could be deemed at least slightly provocative. Remember of course that Sparklines’ conformations are drawn relative to their ranges’ values; South Carolna’s spike cannot be matched in absolute terms to California’s, for example, and the very small patent numbers for the US territories need be understood as such, whatever their Sparklines look like.

Those qualifications understood, we could likewise apply Sparklines to the country data. Note that Czechoslovakia’s string of zeroes (following its pre-2002 2121) reflects that country’s reorganization into the Czech Republic and Slovokia, both of which do present substantive patent numbers. The country slopes are remarkably comparable, though some curious investigator might want to do some wondering about Italy’s slump in 2007 and its subsequent bounce back.

And now I’d like to perform a bit of a back-track. Earlier I had reflexively begun to prepare the patent data for some pivot tabling refinements, e.g., by offering that the empty B column should be made to disappear, and by filing the usual brief for a data-set row header. But we – or I – need to ask: given the nature of the rows comprising the data set(s), what might a pivot table really teach us? We need to remind ourselves that pivot tables are, in the first instance, instruments of aggregation, and there doesn’t appear to be much among the state/territory data to aggregate – no recurring, labelled entries there to submit themselves for consolidation, at least not as the data presently stand.

It might be possible, for example, to append a Region field to the country patent data by recalling that B column we had so unblinkingly eliminated, and devising a set of items meant to recur – e.g., Europe, Africa, Oceania, etc. – but that device would impose other problems on the set, those hinted at in my discussion of dataset reconstruction (which perhaps could be reviewed if you’ve decided to pivot table the data anyway) that I’ve cited many times, including now.

So given the data as we’ve received them, and appreciating that rows 1-4 have already done some of the aggregating work for us, the Sparkline strategy might stand as among the most trenchant services we could do for the data – though that superlative leaves me hesitant, too.

But if we’re Sparklining, there’s another subtle point to file away in your presentational rolodex that needs to be entertained. I had earlier advised that the erratic line breakages inflicted on the labels in the A column by the sheet’s Wrap Text effect should be smoothed away by simply turning off the wraps; and in fact that’s a near-imperative move for the Sparklines. That’s because the wrapped text feature by definition heightens the rows in whose text it wraps; and because, and unlike conventional Excel charts, Sparklines are cell-bound, they’re going to be heightened too. And the resulting distortion will skew the relation of the X to the Y axis.

I like that observation. Can I at least copyright it?

Team USA Stats, Part 2: Some Data Gymnastics

4 Sep

You can’t tell the players without a scorecard, they’ll tell you in the States, and you can’t tell the data without the formulas.

You’ve heard more memorable pronouncements than that opener, I’ll grant, but that less-than-bromidic avowal above makes sense. We saw in the last post how the Team USA height, weight, and birth state data threw more than a few curves at us, and baseball isn’t even an Olympic sport any more (it’s coming back in 2020, though); fail to straighten out the data curves and your analysis will be straitened.

Now that line’s a few sights more memorable, but I do go on. Our next round of data inspection takes us through the DOB (date of birth) field, and we want to get this one right, too. Quality control here starts by aiming a COUNT function at the rows J2:J559 that populate DOB. Since dates are numbers, and since COUNT only acknowledges a range’s duly numeric data, we should aspire here to a count of 558, or one date per cell. But my COUNT totals 520, a shortfall that exposes 38 dates manqué that, facades aside, are nothing but text.

Our COUNT is all the more incisive in view of the fact that every entry in the J column really does look like an unassailable date. But now that we’ve made ourselves aware of the turncoats among the loyal numerics, we’d do well, if possible, to rehabilitate them into the real things too. Here’s a pretty painless way: commandeer the next free column (it could be S, if you’ve left the formulas from last week’s exercises in R alone; if so, you’ll have to format S in Date terms), title it Birth Date, and enter, in row 2:

=IF(ISTEXT(J2),DATEVALUE(J2),J2)

And copy it down. The formula asks if the entry in J2 is textual. If it is, the DATEVALUE function – a rather useful transformative means for turning pure text such as 12/14/1989 – if that expression has been formatted into text – into 12/14/1989, the date version. If the entry in J is an authentic date, on the other hand, the formula simply invokes it its cell entry as is.

Surprise – my eminently sensible tip doesn’t always work. Copy the formula and you’ll be treated to nine #VALUE!-laden cells; a first review of these implicates an old nemesis – a superfluous space, in the source field e.g.:

olyp1

DATEVALUE can’t handle that textual intrusion, and so this refinement:

=IF(ISTEXT(J4),DATEVALUE(TRIM(J4)),J4)

Looks right, because TRIM’s job is to make superfluous spaces unwelcome.
But that revision doesn’t work either. Puzzled but intrigued, I went for the next move: a copying of one of the problem entries in J into Word, wherein I turned on the Show/Hide feature (in Home > Paragraph) that uncovers normally unseen codes. I saw:

olyp6

Look closely and you’ll detect that special character to the immediate left of Word’s paragraph symbol. That Lilliputian circle appears to signal a non-breaking space, and apart from any conjecture about what and why it’s doing there we know one thing: it isn’t a superfluous space, and thus won’t be trimmed.

Thus again, the fix here may be short on elegance but long on common sense: edit out the circle from the nine problem entries in J (in fact some of the cells require two taps of Backspace in order to rid them of their codes. If you sort the dates by Largest to Smallest by first clicking on an actual date all the #VALUE! errors will cluster at the bottom).

And that works, leaving us with yet one last irritation – the birth date of Shooting team member Daniel Lowe, which reads

11/181992

No need to get fancy here – just enter that slash.

Now, I think, you can go about your analytical business, e.g., breaking out athletes by birth month. You may recall my consideration, in a piece on 2012 Olympic data, of the alleged August effect, in which American athlete births in that month appeared to significantly depart from a chance prediction. Let’s see what the current team data tell us, via a pivot table:

Rows: Birth Date (grouped by Month)

Values: Birth Date (Count)

Birth Date (again, by % of Column Total)

(We don’t need Grand Totals here). I get:

olyp2

Here August – generally the most fecund month in the United States – shares the modal figure with June and March, its proportion substantially smaller than August’s 2012-team contribution. The numbers here simply suggest no special birth skew impacting the US complement, at least for this Olympics.

We now can also calculate each athlete’s age in conjunction with the most able assistance of the nifty and unsung YEARFRAC function. Enter the Olympics’ start date – August 5, 2016 – in any available cell, name the cell start, and proceed to column T, or whichever’s next available on your sheet. Name it Age and in row 2 try (assuming the corrected dates lie in the R column):

=YEARFRAC(R2,start)

YEAR calculates the distance in years between the two dates on either side of its comma. Thus, cell-reference four-gold-medalist Katie Ledecky’s birthday – March 17, 1997 – in YEARFRAC, and with the start date cell holding down the second argument you get 19.38, Ledecky’s age in years on day one of the Olympics (note that can’t actually enter 3/17/1997 in the function, because YEARFRAC will treat the entry as text. You need to either reference the cell bearing that date or enter 35506, the date’s native numeric equivalence).

Copy down the column and this pivot table beckons:

Rows: Sport

But guess what…

olyp5

Yep, it’s that superfluous space thing again, this time practicing its mischief on four records among the Track and Field data. The simplest repair in this case, as it turns out: select the Sport field and run a Find and Replace at the column, finding Track and Field[space] and replacing it with Track and Field. That works, because in this case each of the errant four have incurred one space.

Now introduce the Age field to Values (Average, formatted to two decimals). Bring back Age a second time, now exhibiting Count sans decimals. If you sort the results Largest to Smallest you’ll see the 12-member equestrian team holding down the age-senior position, with Team US’s eight boxers computing to a lowest-age 20.75.

We could also correlate average athlete weight by event, an association which might drum up some less-than-obvious numbers, e.g.

Rows: Sport

Columns: Gender

Value: Weight (Average, formatted to two decimals)

I get:

olyp4

Of course the per-team numbers are small, but they make for interesting reading, particularly the respective by-sport gender disparities (and note some absent teams among the men’s delegation).

I was surprised by the greater average weights of the two basketball teams measured against their rugby colleagues, even if the latter is (officially) the contact sport. And I did a double-take when I caught up with the respective boxing team weights; women boxers outweigh their male teammates by an average of 18 pounds. But here we’ve been thrown a sampling curve – the six male pugilists are concentrated in the lower weight divisions, even as the women – comprising exactly two boxers – weigh 132 and 165 pounds.

Eek  – there was a lot of hard work to do in there; I think I deserve a podium finish for this one.

Team USA Stats, Part 1: Some Data Gymnastics

26 Aug

Now that Super Mario has bored his way back to Tokyo, let us praise Great Britain’s mighty Olympic team, and its world’s best 67 medals, followed by the United States and its 121.

Don’t read that twice – just say hello to the New Math. Here in England, where the media put the mallets to their collective tympani for all-Team GB all the time, one’s ear had to be pressed very close to the radio for news about any athletic glory redounding to anyone else.

But ok. Two weeks of harmless sporting jingoism does wonders for the commonweal, one supposes, and so now I can tell my co-residents here that, glory aside, United States Olympic team has something the British contingent doesn’t: a spreadsheet about its members, available worldwide here:

http://www.teamusa.org/road-to-rio-2016/team-usa/athletes

Just click the Sortable Roster link.

The workbook’s name could be asked about for starters, because properly structured, any data set should be agreeable to sorting. You’ll also take note of the cell borders sectioning off respective sport (team) rosters, demarcations that no longer demarcate once one takes the workbook up on its offer and actually sorts the data by say, Last Name or Height. Because the borders will remain exactly where they were drawn – even when the sorts reorder the records – they’ll now be bordering near-random assortments of athletes.

But now to the data. The Team USA site lets us know that 124 of the team’s 558 members, about 22%, are California-born, an impressive disproportion over and above the state’s 12% contribution to the American demographic whole. If we want to break team representation out by all states, then, a pretty straightforward pivot table should be up to that task:

Rows: Birth State

Values: Birth State (count)

Straightforward, but problematic, e.g. this excerpt:

oly1

We’ve seen this before, and now we’re seeing it again. The Olympics may encourage diversity, but promoting disparate spellings of the same state name is grounds for a DQ, at least in this event. Note the pairs of Calif., Colo. and Fla. in the screen shot, a spate of duplications (and there are others in there) inundated by superfluous spaces. Note as well the misspelled Cailf., and it seems that full attention hasn’t been paid to the business of getting the data in shape.

But that’s where we come in. First we can sprint over to column R, the free column alongside the SCHOOL/COLLEGE. The rows in R seem to have been formatted as Text, and so I clicked the R heading and redefined the column in Number terms. Then in R2 I entered, simply:

=TRIM(L2)

And copied it down R, selected and copied those results, and pasted their values atop the L entries. (Having discharged that remit you could then go ahead and delete the contents of R.)

That standard corrective works, so far as it goes, but it won’t respell Cailf. That kind of repair might require a record-by-record edit that could make washing your socks seem exciting by comparison, though I for one would opt for the former activity (and discrepancies notwithstanding, I also get just 113 Californians, 111 if you break the residences out by the Current State field instead. I’m also not really sure what distinguishes Hometown State from either the Birth or Current State identifiers). But if you do need to know about team-member state origins (and non-American birthplaces as well), this kind of work just has to be done. Sorry.

And what about athlete weights, a numeric metric that could be productively associated with sport, height, and gender, and perhaps even date of birth? Don’t be disconcerted by the left alignments, but here too we meet up with an issue – namely the more than 50 weights that sport (ok – pun intended) their values in text format, tending to cluster among the Rugby, Gold, and Equestrian members, by the way. But this gremlin is easily sent on its way, however; sort the field by largest to smallest, thus packing all the text data at the very top of the field. Select the problem data in I2:I56 and click the attendant exclamation-pointed notification:

oly2

Click Convert to Number, and the weights acquire real, measurable poundage (note the weight and height for gold-medal swimmer Ryan Held are missing).

But what about the Height data? The metaphor may grate, but the entries here are squarely interstitial, purporting quantitative information in wholly textual mode. As expressed here, 5’11” is nothing but text; if you want that datum to assume a useably numeric form this recommendation asks you to convey the above height in its cell as 511 instead, and impose a custom format upon it that interposes those apostrophes between the “feet” and “inch” parameters. Either way the entry is really 511, and that value may not work with your aggregating intentions. Another tip would have you enter a height in inches – in our case 71 – and formulaically dice the number into a foot/inch appearance, which again nevertheless ships the data in text status.

In any case, we need to deal with the data as we have them, and I’d allow the simplest intention is to get these labels into numeric mode, i.e. inch readings. In that connection, I’d return to column R, title it Height in Inches or some such, and enter in R2:

=VALUE(LEFT(H2,1)*12+VALUE(MID(H15,3,LEN(H2)-3))

To translate: the formula commences its work by detaching the first character in H2 – a 5 (I’m working with the default arraying of athlete records here, the first of which posts a height of 5’11”), and ascribes a numeric value to it via VALUE, supported by the given that all foot-heights should comprise one digit. That result is next multiplied by 12, yielding 60 inches thus far. I then isolate the 11 in 5’11” by applying a MID function to the task. The LEN(H2)-3 argument that registers the number of characters MID is to extract from the entry in H2 reflects the fact that any entry in the H column should consist of either 4 or 5 characters, e.g., 5’11” or 5’6”. Subtract 3 from either count and you come away with either 1 or 2 – the number of characters MID needs to pull from the entry in order to capture its inch value. Thus in our case we can add 60 and 11, culminating in 71 inches for the archer Brady Ellison. Copy the formula down R and eliminate the decimals, and our heights should be ready for the next round of analytical moves.

Almost. It seems my post-copy vetting of the height-in-inches data in R reports more than a dozen #VALUE! notifications – because some of the heights in the H column look like gymnast Kiana Eide’s 5’3, or indoor volleyballer Thomas Jaeschke’s 6-6. Neither reveal an inches punctuation, and Jaeschke’s height buys into a different notation altogether; and my formula can’t handle those discrepancies.

So it’s time for a Plan B. First run this find-and-replace on the heights in H:

oly3

(That is, replace the inch quotes with nothing.) That pre-formulaic fix should eliminate all the inch punctuations, directly exposing the inch numbers to the right of the cell. Then in R2 write:

=VALUE(LEFT(H2,1)*12+VALUE(RIGHT(H2,LEN(H2)-2)))

What’s changed here is the latter half of the expression, which now splits 1 or 2 inch characters from the right of the cell, depending on the single or two-character length of the inch totals. Copy this one down R and we should be in business.

Not. Two utterly obstinate athletes, field hockey aspirant Jill Witmer and soccer teammate Lindsey Horan, feature a single apostrophe beside their inch figure, a miniscule disparity that defeats my best efforts at a global formula rewrite – along with the data-less Ryan Held. Here discretion trumps valor – I’d just delete the incorrigible apostrophes and Held’s #VALUE! message, and take it from there. Now I have real heights.

Ms. Witmer – or whoever entered her data – sure is playing hockey with my fields.

The College Transcript: Downloading an Upgrade

19 Aug

It’s homely, to be sure, but if you want to go somewhere you gotta have one. And no – I’m not talking about your passport photo but your college transcript, an obstinately prosaic but indispensable means of entrée to your next academic or career step.

The transcript – an enumeration of a student’s courses and performances gathering into what we Yanks call the Grade Point Average (GPA) – has undergone a rethink of late. A piece in insidehighered.com this past February trained its lens on a number of initiatives aiming to drill qualitative depth into the transcript’s tale, sinking some analytic teeth into its default, alphabetically-driven narrative by linking its grades to students’ work and detailed progress toward a degree.

And that got me to thinking: if it’s depth we’re seeking, why not endeavour to learn something more from the numbers and the letters by re-presenting the transcript as a…spreadsheet?

It makes perfect sense to me, though you’d expect me to say that. But after all: submit a transcript to some standard tweaks and you wind up with a dataset, one suitable for sorting, pivot tabling, formulaic manipulation, and charting, too. And once the tweaking stops, the transcript can treat its readers to a round of different, edifying looks at the data – and that’s what I call depth, too.

Transcripts aren’t things of beauty, but they sport no small amount of variation just the same. So to understand what we’re working with, more or less, take a look at this one – the transcript of Mr. Ron Nelson, who made his academic record available here:

trans1.jpg

In the interests of exposition, I’ve subjected the baseline transcript above to a round of fictional retakes that of course don’t represent Mr. Nelson’s actual attainments (for one thing, his record dates back nearly 25 years). A few of the points that call for renovative scrutiny, then: First, note the blank column coming between the Course No. and Course Title columns, an excess which must, for spreadsheet purposes, be curtailed. Second, the multi-columned iterations of Course Titles and associated grades need be cinched into a single field. Third, the academic term headings (e.g. Spring Semester 1991), and TERM TOTALS and CUMULATIVE TOTALS lines have to be sent elsewhere; they report information that simply aren’t of a piece with the grade/grade point records that the dataset should comprise.

Second, if you’re new to the GPA you need to know how that defining metric is figured. While of course variation again abounds, the textbook illustration looks something like this: Class grades are typically assigned along an A-to-D continuum along with a failing F, in what are in effect quantified decrements of a third of a point, e.g., A, A-, B+, B, etc. In the typical system an A earns 4 points, an A- 3.67, a B+ 3.33, and so on (the D- grade is rarely offered, by the way). An F naturally awards no points.

Each grade-point achievement is in turn multiplied by the number of credits any given course grants, resulting in what are usually called quality points. Thus a B- grade in a three-credit class yields 8 quality points – 2.67 times 3. An A in a four-credit course evaluates to 16 quality points, or 4 times 4. The GPA, then, divides the sum of quality points by the sum of credits achieved. Thus this set of grades:

trans2

Works out to a GPA of 2.92.

It’s pretty intelligible, but with a proviso. The GPA must incorporate the number of credits associated with a failing grade into its denominator, and so these performances:

trans3

Calculate to a GPA of 2.33. But the 15 credits recorded above really only bestow 12 usable credits upon the student, and that dual count needs to be remembered.

With that extended preamble noted, my spreadsheet-engineered transcript demo (spreadscript?) presents itself for your consideration here:

Transcript demo

In excerpt, the sheet looks like this:

trans4

Note the paired Term and Date columns; though one might be moved to declare the former field superfluous, it seems to me that its textual Spr/Fall entries could enable a pivot table broken out by seasonalty, i.e., GPAs by all Spring and Fall courses across the student’s academic career. The Date field, on the other hand, is duly numeric, thus lending itself to chronological resorting should the current sequence of records be ordered by some other field. And the grades have been visually differentiated via a conditional formats.

The Credits total in the upper right of the screen shot reflects a necessary bypassing of the F grade for Music 101 per our earlier discussion (the grades are stored in the H column), and realized by this formula:

=SUMIF(F:F,”<>F”,G:G)

The SUMIF here is instructed to ignore any F in the F column via the “not” operator bound to the formula’s criterion. Note the quotes required by SUMIF for operators clarifying the criterion. The GPA, on the other hand, divides the quality point total by all 112 credits (you will have noted that the spreadsheet outputs the quality points in H via a lookup array range-named gr in Q1:R10. And in the interests of simplicity I’ve let subsidiary course events and their codes, e.g., class withdrawals and incompetes, go unattended).

Now the data become amenable to pivot tabling and other assessments. For example, if we want to break out GPAs by term we can try:

Rows: Date (You’ll want to ungroup these, if you’re working in release 2016)

Values: Hours/Credits (Sum)

Quality/Points (Sum, rounded to two decimals)

Because we need to total the date-specific quality points and divide these by the respective-date credit totals, a calculated field must be implemented, e.g.

trans5

Click OK, again round off to two decimals, and you should see:

trans6

Once the GPA field is put in place you can, for example, break out credit accumulations by Discipline, or subject, by replacing Date with Discipline:

trans7

Or try a frequency analysis of credit totals by grade:

Row: Grade

Values: Hours/Credits (Sum)

trans8

(Note: because of the priorities with which Excel sorts text characters, grades accompanied by the + symbol initially appear at the bottom of any letter sort, e.g., you’ll initially see B, B-, and B+. You’ll need to right-click the B+ and select Move > Move “B+” up twice. And of course the same adjustment should be applied to C+.)

Of course these outcomes could be charted, e.g.

trans9

And if you are interested in seasonality:

Rows: Term

Values: Hours/Credits

Quality Points (Both Sum, and both rounded to two decimals)

GPA

trans10

(By the way, you’re not duty-bound to earmark Hours/Credits and Quality Points for the table if you want to display GPA at the same time. Once constructed, GPA becomes available in its own right, and need not be accompanied by its contributory fields.)
And all these and other reads on the data could be assigned to a dashboard, too.

Thus the transcript-as-spreadsheet could break new presentational ground, supplementing the row-by-row recitation of subjects and scores that students and recipient institutions currently face, with a suppler way around the data. They could even be made unofficially available to students themselves via download, empowering the spreadsheet-savvy among them to map and understand their grades in novel ways (one trusts that no one’s accepting a transcript bearing a student’s email address).

But is it a thing of beauty? Maybe not, but don’t you like the colors?

Notes on a Continuing Saga: More Trump Tweets

12 Aug

It has been a tumultuous three months for Donald Trump, his party, and his country, and probably not in that order. His nomination at the Republican convention, his wife’s sounds-familiar speech there (truth to be told, with two lifted passages in toto her rhetorical trespasses probably wouldn’t even get her thrown out of school), and his barrage of subsequent, incendiary pronouncements, have made for an interesting campaign, no?

I’m counting those three months’ worth of controversy here, because we last paid a visit to Mr. Trump’s tweet account about that long ago, and you doubtless want to know what communicative mischief he’s been up to in the interim.

So I retraced my steps back to the trusty twdocs.com site for yet one more take-out order of Trump’s latest dispatches from the hustings – and he has been dispatching, to be sure. (Note: not knowing twdocs’ distribution policy on its downloads, I have again not made the workbook available here. If you can filch $7.80 from petty cash you’re in business, though. Note in addition there may be some issues with opening the downloads in Excel 2016. Contact twdocs if events warrant.)

Since May 10, the date of his final tweet considered in my May 12 post, the man who put the candid in candidate has pumped out an additional 983 tweets, broken out thusly:

t1

While his output was never neatly curved, Trump’s tweet numbers have been patently tempered of late, notably down from his January-February distributions of 481, 471, and 418. One might be moved to explain the July spike with a guess about a tweet frenzy stoked by the Republican nominating convention July 18-21, but the nominee signed off on 37 tweets in the course of that four-day event – in keeping with the remainder of his July activity, though his 18 tweets on the 21st do jostle the average.

Now in the interests of historical compare-and-contrasting, I applied same the key-word search (whose mechanics are detailed here; again, the percentages denote the fraction of tweets containing the word or phrase) I had conducted in May to the same terms here, more specifically to the 983 post-May 10 tweets. The results in May:

t2

And now:

t3

Of course the citations of erstwhile rivals have all but disappeared from the current distributions, but a few surprises have been sprung upon the latter list, not the least of which perhaps is the halving of mentions of the tweeter himself. I’m not sure how this newfound diffidence is to be explained, and by this most unshrinking of candidates, other than to allow that the press of his nascent campaign has redirected Trump’s fingers to other keys and targets. I would not have predicted the relative boom in references to Bernie Sanders, either, many of which malign his failed campaign and capitulation to Democratic nominee Hillary Clinton.

But of course no surprises attend the steep escalation in tweets aimed at Hillary Clinton, his now-official opponent. Indeed – of the 270 post-convention tweets Trump has filed (remember the screen shot above dates from May 11), the Clinton/Hillary-bearing tweets have moved up to 15.56% and 28.89% respectively, with the Bernie/Sanders splits bouncing to 11.11%/5.56%. Moreover, tweets sporting the name Trump have retrenched again, down now to just 14.07%. One assumes again this is a manner of zero-summing at work; given the choice between self-puffery or the chance to assail his opponent, the latter takes the day. One has to make the most of his 140 characters, after all.

Thus the adjective “crooked”, the modifier Trump dependably pairs with Hillary Clinton’s name (in fact he calls her Crooked once in a while in stand-alone capacity, as if it’s her first name) finds its way into 16.28% of all the post-May 10 tweets, with the sobriquet “Crooked Hillary” informing 13.84%.

In sum, the tweets make for interesting reading, and on a variety of levels; apart from their vituperative cast, they mint the impression of a problem-free campaign on a roll, poised to smash a hapless opponent in November.

You may also want to decide if a measure of revisionism seasons the tweets. If you’re downloading, scan Trump’s tweets of August 1, the day he averred in an ABC television interview that Vladimir Putin is “…not going into Ukraine, OK, just so you understand. He’s not going to go into Ukraine, all right? You can mark it down. You can put it down. You can take it anywhere you want”. It was rather immediately pointed out to Trump that Russian troops have held down parts of the Ukraine for some time, and his tweet replies: “When I said in an interview that Putin is ‘not going into Ukraine, you can mark it down,’ I am saying if I am President. Already in Crimea!” It’s your call, seasoned journalist.

Now for another one of those spreadsheet points that, in the interests of staving off allegations of revisionism of my own, I had hadn’t previously understood. When I attempted to filter, or group, tweets for the July 18-21 span during which the Republican convention was convened, I entered these values in the Grouping dialog box:

t4

That seemed like the thing to do, but a click of OK brought about:

t5

That is, the Grouping instructions, phrased Starting and Ending at, in fact seem to merely identify the first and last dates in the greater grouping scheme, including the greater and more than residual categories. In order to admit the 21st into the actual data mix, then, I needed to enter 7/22/2016 into the Ending at: field, yielding

t6

That’s a pretty quirky take on grouping; but neither Mr. Gates, nor Mr. Trump, take my calls.

The Global Terrorism Database, Part 3: Surmises and Surprises

5 Aug

Among its definitional essentials, of course, is the idea that terrorism is aimed at someone, and/or on occasions somethings; and the Global Terrorism Database’s densely-fielded and sub-fielded data set has a great deal to report on the terrorist toll.

In the interests of first exposition consider three superordinate fields: the 22-category Target Type, (targtype1_txt, column AJ), Number Killed (nkill, CW), and Number Wounded (nwound, CZ), all of whose particulars are itemized in the GTD coding book available for download.

Again, the permutations are plenteous, and so only a few can be proposed here. We could start with a breakout of target types by five-year groupings (once again note the absent 1993 data):

Rows: targtype1_txt

Columns: iyear

Values: targtype1_txt (by % of Column Total) Click PivotTable Tools > Grand Totals > On for Rows Only. The column totals must of necessarily yield 100%; as such we don’t need them.

I get:

Smart1

(Remember again that the percentages read downwards.)

I was struck by the decline in Private Citizens and Property targets, that cohort’s percentage having peaked in the 2005-2009 tranche, following a steeply-sloped ascending curve. Note as well the downturn in what the GTD calls Business targets, a likely pointer to Vietnam-era sorties against corporate sites, and the severe fluctuations in Military targets across the 1970-1984 span call for some considered drill-downs into country and/or region. Again, we need to recall that the percentages record intra-tranche distributions; in absolute-numeric terms, the 2010-15 interval (and yes, it’s six years) was by far the most terror-ridden.

Apropos the above conjecture, if we confine the target data to the United States by electing a Slicer (for country_txt) and ticking that country, I get:

Smart2

My guess about Business needs to be rephrased, in view of the leap in such targets in the 2000-2004 tranche. But if we turn off the % of Column Total enhancement (by replacing that selection with No Calculation) and restore the table’s absolute numbers I get:

Smart3

We see here that the pullback in actual event counts across the tranches complicates the analysis. 273 business targets incurred a terrorist act in the 1970-1974, nearly five times the 58 between 2004-2009.

But of course the human toll of terrorism transcends the target counts, and the GTD brings those numbers to light. We could begin by viewing fatality totals by region and tranche:

Rows: region_txt

Columns: iyear

Values: nkill

I get:

Smart4

The monumental spikes in terrorist-inflicted deaths in the Middle East, South Asia, and Sub-Saharan Africa are declared with chilling clarity, along with the striking recession in victims in Western Europe. The aberrant total for North America in the 2000-2004 tranche is a consequence of 9/11, of course.

We could then look at fatalities by country and tranche. Reintroduce country_txt to a Slicer, and in the interests of presentability transport the grouped year data into rows. Summarize the nkill values by average (rounded to two decimals), and bring back nkill into values again, this time by sum. Click Slicer for United States, for example, and I get:

Smart5

Thus even as US led the world in incidents from 1970-1974 with 931, those attacks were predominantly non-lethal. Again the tragically high total and average for 2000-2004 are explained by 9/11.

The data for the United Kingdom:

Smart6

Apart from the 9/11 cataclysm (which ultimately can’t be ignored, of course), average attacks in UK, particularly in the earlier tranches, were substantially more deadly than those in the United States.

For Iraq:

Smart7

The enormous lethality of the country’s terrorist carnage, both in absolute and average terms, is unambiguous, though the drop in the per-incident average for the latest tranche is notable and probably worth investigative scrutiny. Needless to say, clicking through a series of countries should be both sobering and instructive, and can alert us to the wealth of information the GTD has garnered on the terrorist phenomenon, including a multitude of parameters we haven’t explored here – but it’s all there. There’s obviously much to learn here.

Now for a not-terribly-well-known spreadsheet advisory that could, given the formidable size of the GTD data source, greatly slim its file size and perhaps accelerate its processing speeds (and no, I didn’t always know this tip either). When a user inaugurates a pivot table, a hidden copy of its data source, called a cache, installs itself out of view; and it is the cache that the tables query, not the native source that appears before us as a matter of course in the workbook. And that means that the user can actually delete the data source, but continue to query and pivot table its clandestine twin.

I know; the prospect of querying an invisible data set sounds slightly fearsome, but if you’ve been pivot tabling that’s exactly what you’ve been doing along anyway. And by deleting the up-front data source, the one you actually see, you may thus end up halving the size of the workbook – and with a data set as imposing as the GTD the savings can be measurable.

There are of course a few downsides to the practice. For one thing, if the data set is active – that is, if your intention is to continue to add records to its complement – you’ll obviously need the data on hand, and you’ll likewise want them in place if you simply want to inspect them. And if you need to cell-reference the data in formulas – as we did last week with the FREQUENCY alternative – I know of no way you can make that happen without the data set out there, duly committed to a standard worksheet tab.

And that also means that, if you have in fact deleted the original data and want them back, you can double-click the Grand Totals cell on any pivot table. Boom – the data return, in an Excel table, no less.
And that sure beats getting them back by closing the workbook without saving it, right?