Team USA Stats, Part 1: Some Data Gymnastics

26 Aug

Now that Super Mario has bored his way back to Tokyo, let us praise Great Britain’s mighty Olympic team, and its world’s best 67 medals, followed by the United States and its 121.

Don’t read that twice – just say hello to the New Math. Here in England, where the media put the mallets to their collective tympani for all-Team GB all the time, one’s ear had to be pressed very close to the radio for news about any athletic glory redounding to anyone else.

But ok. Two weeks of harmless sporting jingoism does wonders for the commonweal, one supposes, and so now I can tell my co-residents here that, glory aside, United States Olympic team has something the British contingent doesn’t: a spreadsheet about its members, available worldwide here:

http://www.teamusa.org/road-to-rio-2016/team-usa/athletes

Just click the Sortable Roster link.

The workbook’s name could be asked about for starters, because properly structured, any data set should be agreeable to sorting. You’ll also take note of the cell borders sectioning off respective sport (team) rosters, demarcations that no longer demarcate once one takes the workbook up on its offer and actually sorts the data by say, Last Name or Height. Because the borders will remain exactly where they were drawn – even when the sorts reorder the records – they’ll now be bordering near-random assortments of athletes.

But now to the data. The Team USA site lets us know that 124 of the team’s 558 members, about 22%, are California-born, an impressive disproportion over and above the state’s 12% contribution to the American demographic whole. If we want to break team representation out by all states, then, a pretty straightforward pivot table should be up to that task:

Rows: Birth State

Values: Birth State (count)

Straightforward, but problematic, e.g. this excerpt:

oly1

We’ve seen this before, and now we’re seeing it again. The Olympics may encourage diversity, but promoting disparate spellings of the same state name is grounds for a DQ, at least in this event. Note the pairs of Calif., Colo. and Fla. in the screen shot, a spate of duplications (and there are others in there) inundated by superfluous spaces. Note as well the misspelled Cailf., and it seems that full attention hasn’t been paid to the business of getting the data in shape.

But that’s where we come in. First we can sprint over to column R, the free column alongside the SCHOOL/COLLEGE. The rows in R seem to have been formatted as Text, and so I clicked the R heading and redefined the column in Number terms. Then in R2 I entered, simply:

=TRIM(L2)

And copied it down R, selected and copied those results, and pasted their values atop the L entries. (Having discharged that remit you could then go ahead and delete the contents of R.)

That standard corrective works, so far as it goes, but it won’t respell Cailf. That kind of repair might require a record-by-record edit that could make washing your socks seem exciting by comparison, though I for one would opt for the former activity (and discrepancies notwithstanding, I also get just 113 Californians, 111 if you break the residences out by the Current State field instead. I’m also not really sure what distinguishes Hometown State from either the Birth or Current State identifiers). But if you do need to know about team-member state origins (and non-American birthplaces as well), this kind of work just has to be done. Sorry.

And what about athlete weights, a numeric metric that could be productively associated with sport, height, and gender, and perhaps even date of birth? Don’t be disconcerted by the left alignments, but here too we meet up with an issue – namely the more than 50 weights that sport (ok – pun intended) their values in text format, tending to cluster among the Rugby, Gold, and Equestrian members, by the way. But this gremlin is easily sent on its way, however; sort the field by largest to smallest, thus packing all the text data at the very top of the field. Select the problem data in I2:I56 and click the attendant exclamation-pointed notification:

oly2

Click Convert to Number, and the weights acquire real, measurable poundage (note the weight and height for gold-medal swimmer Ryan Held are missing).

But what about the Height data? The metaphor may grate, but the entries here are squarely interstitial, purporting quantitative information in wholly textual mode. As expressed here, 5’11” is nothing but text; if you want that datum to assume a useably numeric form this recommendation asks you to convey the above height in its cell as 511 instead, and impose a custom format upon it that interposes those apostrophes between the “feet” and “inch” parameters. Either way the entry is really 511, and that value may not work with your aggregating intentions. Another tip would have you enter a height in inches – in our case 71 – and formulaically dice the number into a foot/inch appearance, which again nevertheless ships the data in text status.

In any case, we need to deal with the data as we have them, and I’d allow the simplest intention is to get these labels into numeric mode, i.e. inch readings. In that connection, I’d return to column R, title it Height in Inches or some such, and enter in R2:

=VALUE(LEFT(H2,1)*12+VALUE(MID(H15,3,LEN(H2)-3))

To translate: the formula commences its work by detaching the first character in H2 – a 5 (I’m working with the default arraying of athlete records here, the first of which posts a height of 5’11”), and ascribes a numeric value to it via VALUE, supported by the given that all foot-heights should comprise one digit. That result is next multiplied by 12, yielding 60 inches thus far. I then isolate the 11 in 5’11” by applying a MID function to the task. The LEN(H2)-3 argument that registers the number of characters MID is to extract from the entry in H2 reflects the fact that any entry in the H column should consist of either 4 or 5 characters, e.g., 5’11” or 5’6”. Subtract 3 from either count and you come away with either 1 or 2 – the number of characters MID needs to pull from the entry in order to capture its inch value. Thus in our case we can add 60 and 11, culminating in 71 inches for the archer Brady Ellison. Copy the formula down R and eliminate the decimals, and our heights should be ready for the next round of analytical moves.

Almost. It seems my post-copy vetting of the height-in-inches data in R reports more than a dozen #VALUE! notifications – because some of the heights in the H column look like gymnast Kiana Eide’s 5’3, or indoor volleyballer Thomas Jaeschke’s 6-6. Neither reveal an inches punctuation, and Jaeschke’s height buys into a different notation altogether; and my formula can’t handle those discrepancies.

So it’s time for a Plan B. First run this find-and-replace on the heights in H:

oly3

(That is, replace the inch quotes with nothing.) That pre-formulaic fix should eliminate all the inch punctuations, directly exposing the inch numbers to the right of the cell. Then in R2 write:

=VALUE(LEFT(H2,1)*12+VALUE(RIGHT(H2,LEN(H2)-2)))

What’s changed here is the latter half of the expression, which now splits 1 or 2 inch characters from the right of the cell, depending on the single or two-character length of the inch totals. Copy this one down R and we should be in business.

Not. Two utterly obstinate athletes, field hockey aspirant Jill Witmer and soccer teammate Lindsey Horan, feature a single apostrophe beside their inch figure, a miniscule disparity that defeats my best efforts at a global formula rewrite – along with the data-less Ryan Held. Here discretion trumps valor – I’d just delete the incorrigible apostrophes and Held’s #VALUE! message, and take it from there. Now I have real heights.

Ms. Witmer – or whoever entered her data – sure is playing hockey with my fields.

The College Transcript: Downloading an Upgrade

19 Aug

It’s homely, to be sure, but if you want to go somewhere you gotta have one. And no – I’m not talking about your passport photo but your college transcript, an obstinately prosaic but indispensable means of entrée to your next academic or career step.

The transcript – an enumeration of a student’s courses and performances gathering into what we Yanks call the Grade Point Average (GPA) – has undergone a rethink of late. A piece in insidehighered.com this past February trained its lens on a number of initiatives aiming to drill qualitative depth into the transcript’s tale, sinking some analytic teeth into its default, alphabetically-driven narrative by linking its grades to students’ work and detailed progress toward a degree.

And that got me to thinking: if it’s depth we’re seeking, why not endeavour to learn something more from the numbers and the letters by re-presenting the transcript as a…spreadsheet?

It makes perfect sense to me, though you’d expect me to say that. But after all: submit a transcript to some standard tweaks and you wind up with a dataset, one suitable for sorting, pivot tabling, formulaic manipulation, and charting, too. And once the tweaking stops, the transcript can treat its readers to a round of different, edifying looks at the data – and that’s what I call depth, too.

Transcripts aren’t things of beauty, but they sport no small amount of variation just the same. So to understand what we’re working with, more or less, take a look at this one – the transcript of Mr. Ron Nelson, who made his academic record available here:

trans1.jpg

In the interests of exposition, I’ve subjected the baseline transcript above to a round of fictional retakes that of course don’t represent Mr. Nelson’s actual attainments (for one thing, his record dates back nearly 25 years). A few of the points that call for renovative scrutiny, then: First, note the blank column coming between the Course No. and Course Title columns, an excess which must, for spreadsheet purposes, be curtailed. Second, the multi-columned iterations of Course Titles and associated grades need be cinched into a single field. Third, the academic term headings (e.g. Spring Semester 1991), and TERM TOTALS and CUMULATIVE TOTALS lines have to be sent elsewhere; they report information that simply aren’t of a piece with the grade/grade point records that the dataset should comprise.

Second, if you’re new to the GPA you need to know how that defining metric is figured. While of course variation again abounds, the textbook illustration looks something like this: Class grades are typically assigned along an A-to-D continuum along with a failing F, in what are in effect quantified decrements of a third of a point, e.g., A, A-, B+, B, etc. In the typical system an A earns 4 points, an A- 3.67, a B+ 3.33, and so on (the D- grade is rarely offered, by the way). An F naturally awards no points.

Each grade-point achievement is in turn multiplied by the number of credits any given course grants, resulting in what are usually called quality points. Thus a B- grade in a three-credit class yields 8 quality points – 2.67 times 3. An A in a four-credit course evaluates to 16 quality points, or 4 times 4. The GPA, then, divides the sum of quality points by the sum of credits achieved. Thus this set of grades:

trans2

Works out to a GPA of 2.92.

It’s pretty intelligible, but with a proviso. The GPA must incorporate the number of credits associated with a failing grade into its denominator, and so these performances:

trans3

Calculate to a GPA of 2.33. But the 15 credits recorded above really only bestow 12 usable credits upon the student, and that dual count needs to be remembered.

With that extended preamble noted, my spreadsheet-engineered transcript demo (spreadscript?) presents itself for your consideration here:

Transcript demo

In excerpt, the sheet looks like this:

trans4

Note the paired Term and Date columns; though one might be moved to declare the former field superfluous, it seems to me that its textual Spr/Fall entries could enable a pivot table broken out by seasonalty, i.e., GPAs by all Spring and Fall courses across the student’s academic career. The Date field, on the other hand, is duly numeric, thus lending itself to chronological resorting should the current sequence of records be ordered by some other field. And the grades have been visually differentiated via a conditional formats.

The Credits total in the upper right of the screen shot reflects a necessary bypassing of the F grade for Music 101 per our earlier discussion (the grades are stored in the H column), and realized by this formula:

=SUMIF(F:F,”<>F”,G:G)

The SUMIF here is instructed to ignore any F in the F column via the “not” operator bound to the formula’s criterion. Note the quotes required by SUMIF for operators clarifying the criterion. The GPA, on the other hand, divides the quality point total by all 112 credits (you will have noted that the spreadsheet outputs the quality points in H via a lookup array range-named gr in Q1:R10. And in the interests of simplicity I’ve let subsidiary course events and their codes, e.g., class withdrawals and incompetes, go unattended).

Now the data become amenable to pivot tabling and other assessments. For example, if we want to break out GPAs by term we can try:

Rows: Date (You’ll want to ungroup these, if you’re working in release 2016)

Values: Hours/Credits (Sum)

Quality/Points (Sum, rounded to two decimals)

Because we need to total the date-specific quality points and divide these by the respective-date credit totals, a calculated field must be implemented, e.g.

trans5

Click OK, again round off to two decimals, and you should see:

trans6

Once the GPA field is put in place you can, for example, break out credit accumulations by Discipline, or subject, by replacing Date with Discipline:

trans7

Or try a frequency analysis of credit totals by grade:

Row: Grade

Values: Hours/Credits (Sum)

trans8

(Note: because of the priorities with which Excel sorts text characters, grades accompanied by the + symbol initially appear at the bottom of any letter sort, e.g., you’ll initially see B, B-, and B+. You’ll need to right-click the B+ and select Move > Move “B+” up twice. And of course the same adjustment should be applied to C+.)

Of course these outcomes could be charted, e.g.

trans9

And if you are interested in seasonality:

Rows: Term

Values: Hours/Credits

Quality Points (Both Sum, and both rounded to two decimals)

GPA

trans10

(By the way, you’re not duty-bound to earmark Hours/Credits and Quality Points for the table if you want to display GPA at the same time. Once constructed, GPA becomes available in its own right, and need not be accompanied by its contributory fields.)
And all these and other reads on the data could be assigned to a dashboard, too.

Thus the transcript-as-spreadsheet could break new presentational ground, supplementing the row-by-row recitation of subjects and scores that students and recipient institutions currently face, with a suppler way around the data. They could even be made unofficially available to students themselves via download, empowering the spreadsheet-savvy among them to map and understand their grades in novel ways (one trusts that no one’s accepting a transcript bearing a student’s email address).

But is it a thing of beauty? Maybe not, but don’t you like the colors?

Notes on a Continuing Saga: More Trump Tweets

12 Aug

It has been a tumultuous three months for Donald Trump, his party, and his country, and probably not in that order. His nomination at the Republican convention, his wife’s sounds-familiar speech there (truth to be told, with two lifted passages in toto her rhetorical trespasses probably wouldn’t even get her thrown out of school), and his barrage of subsequent, incendiary pronouncements, have made for an interesting campaign, no?

I’m counting those three months’ worth of controversy here, because we last paid a visit to Mr. Trump’s tweet account about that long ago, and you doubtless want to know what communicative mischief he’s been up to in the interim.

So I retraced my steps back to the trusty twdocs.com site for yet one more take-out order of Trump’s latest dispatches from the hustings – and he has been dispatching, to be sure. (Note: not knowing twdocs’ distribution policy on its downloads, I have again not made the workbook available here. If you can filch $7.80 from petty cash you’re in business, though. Note in addition there may be some issues with opening the downloads in Excel 2016. Contact twdocs if events warrant.)

Since May 10, the date of his final tweet considered in my May 12 post, the man who put the candid in candidate has pumped out an additional 983 tweets, broken out thusly:

t1

While his output was never neatly curved, Trump’s tweet numbers have been patently tempered of late, notably down from his January-February distributions of 481, 471, and 418. One might be moved to explain the July spike with a guess about a tweet frenzy stoked by the Republican nominating convention July 18-21, but the nominee signed off on 37 tweets in the course of that four-day event – in keeping with the remainder of his July activity, though his 18 tweets on the 21st do jostle the average.

Now in the interests of historical compare-and-contrasting, I applied same the key-word search (whose mechanics are detailed here; again, the percentages denote the fraction of tweets containing the word or phrase) I had conducted in May to the same terms here, more specifically to the 983 post-May 10 tweets. The results in May:

t2

And now:

t3

Of course the citations of erstwhile rivals have all but disappeared from the current distributions, but a few surprises have been sprung upon the latter list, not the least of which perhaps is the halving of mentions of the tweeter himself. I’m not sure how this newfound diffidence is to be explained, and by this most unshrinking of candidates, other than to allow that the press of his nascent campaign has redirected Trump’s fingers to other keys and targets. I would not have predicted the relative boom in references to Bernie Sanders, either, many of which malign his failed campaign and capitulation to Democratic nominee Hillary Clinton.

But of course no surprises attend the steep escalation in tweets aimed at Hillary Clinton, his now-official opponent. Indeed – of the 270 post-convention tweets Trump has filed (remember the screen shot above dates from May 11), the Clinton/Hillary-bearing tweets have moved up to 15.56% and 28.89% respectively, with the Bernie/Sanders splits bouncing to 11.11%/5.56%. Moreover, tweets sporting the name Trump have retrenched again, down now to just 14.07%. One assumes again this is a manner of zero-summing at work; given the choice between self-puffery or the chance to assail his opponent, the latter takes the day. One has to make the most of his 140 characters, after all.

Thus the adjective “crooked”, the modifier Trump dependably pairs with Hillary Clinton’s name (in fact he calls her Crooked once in a while in stand-alone capacity, as if it’s her first name) finds its way into 16.28% of all the post-May 10 tweets, with the sobriquet “Crooked Hillary” informing 13.84%.

In sum, the tweets make for interesting reading, and on a variety of levels; apart from their vituperative cast, they mint the impression of a problem-free campaign on a roll, poised to smash a hapless opponent in November.

You may also want to decide if a measure of revisionism seasons the tweets. If you’re downloading, scan Trump’s tweets of August 1, the day he averred in an ABC television interview that Vladimir Putin is “…not going into Ukraine, OK, just so you understand. He’s not going to go into Ukraine, all right? You can mark it down. You can put it down. You can take it anywhere you want”. It was rather immediately pointed out to Trump that Russian troops have held down parts of the Ukraine for some time, and his tweet replies: “When I said in an interview that Putin is ‘not going into Ukraine, you can mark it down,’ I am saying if I am President. Already in Crimea!” It’s your call, seasoned journalist.

Now for another one of those spreadsheet points that, in the interests of staving off allegations of revisionism of my own, I had hadn’t previously understood. When I attempted to filter, or group, tweets for the July 18-21 span during which the Republican convention was convened, I entered these values in the Grouping dialog box:

t4

That seemed like the thing to do, but a click of OK brought about:

t5

That is, the Grouping instructions, phrased Starting and Ending at, in fact seem to merely identify the first and last dates in the greater grouping scheme, including the greater and more than residual categories. In order to admit the 21st into the actual data mix, then, I needed to enter 7/22/2016 into the Ending at: field, yielding

t6

That’s a pretty quirky take on grouping; but neither Mr. Gates, nor Mr. Trump, take my calls.

The Global Terrorism Database, Part 3: Surmises and Surprises

5 Aug

Among its definitional essentials, of course, is the idea that terrorism is aimed at someone, and/or on occasions somethings; and the Global Terrorism Database’s densely-fielded and sub-fielded data set has a great deal to report on the terrorist toll.

In the interests of first exposition consider three superordinate fields: the 22-category Target Type, (targtype1_txt, column AJ), Number Killed (nkill, CW), and Number Wounded (nwound, CZ), all of whose particulars are itemized in the GTD coding book available for download.

Again, the permutations are plenteous, and so only a few can be proposed here. We could start with a breakout of target types by five-year groupings (once again note the absent 1993 data):

Rows: targtype1_txt

Columns: iyear

Values: targtype1_txt (by % of Column Total) Click PivotTable Tools > Grand Totals > On for Rows Only. The column totals must of necessarily yield 100%; as such we don’t need them.

I get:

Smart1

(Remember again that the percentages read downwards.)

I was struck by the decline in Private Citizens and Property targets, that cohort’s percentage having peaked in the 2005-2009 tranche, following a steeply-sloped ascending curve. Note as well the downturn in what the GTD calls Business targets, a likely pointer to Vietnam-era sorties against corporate sites, and the severe fluctuations in Military targets across the 1970-1984 span call for some considered drill-downs into country and/or region. Again, we need to recall that the percentages record intra-tranche distributions; in absolute-numeric terms, the 2010-15 interval (and yes, it’s six years) was by far the most terror-ridden.

Apropos the above conjecture, if we confine the target data to the United States by electing a Slicer (for country_txt) and ticking that country, I get:

Smart2

My guess about Business needs to be rephrased, in view of the leap in such targets in the 2000-2004 tranche. But if we turn off the % of Column Total enhancement (by replacing that selection with No Calculation) and restore the table’s absolute numbers I get:

Smart3

We see here that the pullback in actual event counts across the tranches complicates the analysis. 273 business targets incurred a terrorist act in the 1970-1974, nearly five times the 58 between 2004-2009.

But of course the human toll of terrorism transcends the target counts, and the GTD brings those numbers to light. We could begin by viewing fatality totals by region and tranche:

Rows: region_txt

Columns: iyear

Values: nkill

I get:

Smart4

The monumental spikes in terrorist-inflicted deaths in the Middle East, South Asia, and Sub-Saharan Africa are declared with chilling clarity, along with the striking recession in victims in Western Europe. The aberrant total for North America in the 2000-2004 tranche is a consequence of 9/11, of course.

We could then look at fatalities by country and tranche. Reintroduce country_txt to a Slicer, and in the interests of presentability transport the grouped year data into rows. Summarize the nkill values by average (rounded to two decimals), and bring back nkill into values again, this time by sum. Click Slicer for United States, for example, and I get:

Smart5

Thus even as US led the world in incidents from 1970-1974 with 931, those attacks were predominantly non-lethal. Again the tragically high total and average for 2000-2004 are explained by 9/11.

The data for the United Kingdom:

Smart6

Apart from the 9/11 cataclysm (which ultimately can’t be ignored, of course), average attacks in UK, particularly in the earlier tranches, were substantially more deadly than those in the United States.

For Iraq:

Smart7

The enormous lethality of the country’s terrorist carnage, both in absolute and average terms, is unambiguous, though the drop in the per-incident average for the latest tranche is notable and probably worth investigative scrutiny. Needless to say, clicking through a series of countries should be both sobering and instructive, and can alert us to the wealth of information the GTD has garnered on the terrorist phenomenon, including a multitude of parameters we haven’t explored here – but it’s all there. There’s obviously much to learn here.

Now for a not-terribly-well-known spreadsheet advisory that could, given the formidable size of the GTD data source, greatly slim its file size and perhaps accelerate its processing speeds (and no, I didn’t always know this tip either). When a user inaugurates a pivot table, a hidden copy of its data source, called a cache, installs itself out of view; and it is the cache that the tables query, not the native source that appears before us as a matter of course in the workbook. And that means that the user can actually delete the data source, but continue to query and pivot table its clandestine twin.

I know; the prospect of querying an invisible data set sounds slightly fearsome, but if you’ve been pivot tabling that’s exactly what you’ve been doing along anyway. And by deleting the up-front data source, the one you actually see, you may thus end up halving the size of the workbook – and with a data set as imposing as the GTD the savings can be measurable.

There are of course a few downsides to the practice. For one thing, if the data set is active – that is, if your intention is to continue to add records to its complement – you’ll obviously need the data on hand, and you’ll likewise want them in place if you simply want to inspect them. And if you need to cell-reference the data in formulas – as we did last week with the FREQUENCY alternative – I know of no way you can make that happen without the data set out there, duly committed to a standard worksheet tab.

And that also means that, if you have in fact deleted the original data and want them back, you can double-click the Grand Totals cell on any pivot table. Boom – the data return, in an Excel table, no less.
And that sure beats getting them back by closing the workbook without saving it, right?

The Global Terrorism Database, Part 2: Surmises and Surprises

28 Jul

Last week’s post on the fecund Global Terrorism Database wound down with a selective look at some countries toward which popular scrutiny might be expected to be drawn, e.g., Iraq, Ireland, and Israel. But a more extensive set of drill-downs across all the data might impart some educative breadth and depth to an understanding of the terrorism phenomenon – or phenomena.
For example, by isolating five-year tranches of incident frequency we should be able to learn something about the tectonic movement of terrorist activity. We could look to this pivot table:

Rows: country_txt

Columns: iyear (grouped by units of five years)

Values: country_txt (in effect we’re really interested in the number of times a country’s name appears in the dataset; each mention of course identifies an incident).
We could then filter by tranche, e.g.:

start1

And for presentation purposes we could right-click anywhere among the Rows data and click in turn Filter > Top 10, calling up the default top 10 items (you’ll still need to sort these results though, here Largest to Smallest. Top 10 doesn’t sort automatically, though it should).

For 1970-74, then, I get (remember to sort):

start2

I suspect you’re surprised. Would you have imagined the United States the world’s most incident-ridden country during the above years, and the United Kingdom in second position – and with totals far greater than any other country?

But then filter for 1975-79 (again remember to sort):

start3

The picture is sharply redrawn. We see Italy experiencing a more than twenty-fold incident spike in these five later years, and with Spain incurring ten times as many acts as in the earlier tranche. The United Kingdom remains the second most terrorized country, and the aggregate, all-nation incident total of 7,169 overwhelms the 1970-74 count of 2,668.

The outcomes take a decidedly Latin American turn in the following tranche:

start4

Vast national changes roil this table, along with the disappearance of the United States from the top 10. Proceed with the five-year breakouts and the incident distributions swing manifestly, and here not surprisingly, to the Middle East (remember though that the absent 1993 data depress the 1990-94 counts artificially).

Now if you want to learn to whom the GTD imputes responsibility for the acts, while continuing to image the results by year and country, I’d dislocate country_txt from Rows and restore it to a Slicer. I’d next introduce gname (comprising identities of apparent perpetrators) into Rows (but you can leave Count of iyear in place in the Values area; each gname entry is associated with a year, and so counting gnames communicates nothing we can’t learn from counting iyears, which are posted to every entry).

Click Italy in the slicer for the 1975-79 tranche and observe an extraordinary diffuse 120 groups implicated in at least one act, in addition to a perhaps nearly-as-startling 470 acts – 48% of all – ascribable to unknown parties, itself a researchable theme, it seems to me. Tick United Kingdom for the same period and discover that 541 of its 866 incidents, or 62.47%, are laid to the Irish Republican Army; the country’s Unknown proportion, however, amounts to a slight 1.73%. Filter for 1980-84 and tick the Slicer for El Salvador, the modal country for that parcel of time with 2,555 events, and again behold the enormous Unknown attribution, accounting for 49.08% of all incidents. Not far behind are the 1,166 attacks launched by the Farabundo Marti National Liberation Front, or FMLN, a pastiche of left-wing insurgencies that was legalized in 1992. Indeed, its incident total swelled to 1,485 in the 1985-89 tranche, then fell back to 678 in 1990-94 and recorded no acts in the 1995-1999 phase. In fact, no terrorist records appear for El Salvador after 1997.

Apropos the above, the GTD’s Region field lets us drill up to findings such as this one, for starters:

Rows: Region

Values: Region

Region (again, by % of Column Total)

I get:

start5

But those are coarse 45-year aggregates. Add iyear to Columns and subtract the first region column data from Values for neatness’ sake. Grouping again by five-year bundles I get:

start6

The shifts are massive and evident (and remember the percentages read downward). The early, relative vulnerabilities of North America and Western Europe’s have supplanted by the latter-day terrorist effusions of in South Asia and the Middle East, with South America’s peak in activity in the 1980-1994 swath shrinking remarkably, along with comparable abatements in North America and Western Europe. Anyone seeking a cogent charting of the political winds gusting across the globe in the past 45 years would do well to start here.

Now for a technical consideration. I’ve alluded a number of times to the missing 1993 data which, among other things, obviously undercount the 1990-94 values (and even so, the Middle East totals for the four available years in the that demaraction more than double the 1985-89 figure). One means for adjusting for the shortfall is to recalibrate the years into effective five-year partitions, e.g. 1980-84, 1985-89, 1990-95, 1996-2000, etc, an intention which should also succeed in trimming the curious, six-year 2010-15 interval back to five years; but because a given pivot table can’t (so far as I know) group values irregularly – that is, by variable intervals – we need to crack the wrapper open for Plan B, one driven by the FREQUENCY array formula, and that should allow us to break out incident totals both by years and country.

It works something like this: enter the first free column in the dataset sheet (it should be EH) and name it ccount (for country count) or something like it. Slip in a new worksheet and enter this range of year intervals down a column (I’m working in G6:G15):

start7

Note the six-years distancing 1995 from 1989, again a compensation of sorts for the missing 1993 data. Then enter a country name in I7 in the new sheet (I’ve started with the United Kingdom), name that mono-cellular range ctry, and enter back in the source data sheet, in EH2:

=IF(I2=ctry,B2,””)

That formula tests the country names in the I-column, country_txt field for a match with the name you’ve entered in the ctry cell. If the match avails, a 1 is posted; otherwise, nothing, as it were, is emplaced. Copy the formula all the way down EH and name the just-formed range ccount. Then return to the new sheet and select H6:H15, the cells alongside the year references. With the selection in force enter in H6:

=FREQUENCY(ccount,G6:G5)

And conclude the process with the trademark Ctrl-Shift-Enter triad befitting array formulas. The grouped incident totals for the United Kingdom should write themselves down H6:H15, commensurate with the years G6:G15. Again, note that FREQUENCY can support intervals of fluctuating width, thus enabling the counting of the five data-available years between 1990 and 1995, with each year in the G column standing for the maximum allowable year in each bin. Enter any other country in I7, and its totals should likewise break out in the respective bins.

And there’s still more to say about the GTD.

The Global Terrorism Database, Part 1: Surmises and Surprises

20 Jul

Anyone with an investigative concern with terrorism – and that complement is regrettably large – would do well to point itself at the Global Terrorism Database (GTD), the self-described “most comprehensive unclassified data base on terrorist events in the world” tended by the University of Maryland’s National Consortium for the Study of Terrorism and Responses to Terrorism, or START. Dating from 1970 and comprising over 156,000 records of incidents world-wide, the database in spreadsheet form is available to interested parties for download, provided the requestor complete a simple requisition form.

The database, then, is vast (75MB) and instructive, though you’ll need to read the very real qualifying cautions in the GTD FAQ page, particularly accountings of the missing data for 1993, shifts in incident counting methodologies beginning in 1998, and the Database’s definitions of terrorism. You’ll also probably need to download the GTD Codebook for a resume of the database’s 111 fields.

But while not faultless, you’ll want to consult this resource, even as some questions about its spreadsheet incarnation remain to be asked as you work with its holdings. For example, note the zeros in the imonth and iday fields (columns C and D) that proxy for unknown incident months and days. However only 23 month cells here go answer to the “unknown” description, while a more measurable 894 days are currently lost to the compilers. These relatively small lacunae will necessarily impinge event breakouts by chronology, of course, but again, their compromising effect is small, calculating to about .6% of all records.

Less clear are the entries informing the approxdate field in E; the codebook (page 12) allows that approxdate archives those incident dates whose ascertainment is unclear, but in most of the approxdate cases the same information already populates the contiguous iyear, imonth, and iday fields. Moreover, the approxdate data are textual, and as such aren’t immediately ready for analytic use with some formulaic backs and forths.

I’m also not sure what the country and region codes in H and J respectively bring to the data, in view of the actual country and region names that sidle these fields. But that’s a quibble; if the fields prove unneeded they can simply be ignored, after all.

But once you’ve oriented yourself to the database-cum-workbook, some obvious but most interesting pivot tables beg our consideration, for example, terrorist incidents by country and year:

Row: country_txt

Columns: iyear

Values: iyear (Count, Show Values as % of Row Total)

(For presentation’s sake you may want to beat back the fusillade of zeros invading the above table cells. Try File > Options > and click off the “Show a zero in cells that have zero value” box beneath the Display options for this workbook heading.)

Now if you bring a 2016 sensibility to the data you’ll be surprised, at least some of the time. Terrorist acts in the United States peaked in 1970, with 17.38% of all acts – 468 in absolute numeric terms – across the 46-year span ending in 2015 having been perpetrated in that year. Double-click that 17.38% and a new table springs from that figure, and onto a new sheet. Pivot table those 1970-specific data:

Rows: provstate (bears US state names)

Values: provstate (Count)

And you discover that 98 terrorist acts – as defined here – were incurred by California, with another 86 having been inflicted in New York. Substitute gname (g for group, presumably, in column BG) for provstate in Rows and find 108 acts mounted by generic Left-Wing Militants, 69 undertaken by what the GTD calls Black Nationalists, 54 wrought by Student Radicals, and so on. Compare that outpouring with the grand total of 38 incidents tracked in the US last year – that sum lifting a curve that has pulled upward since 2011, to be sure, but certainly far flatter than the Vietnam-era total registered 46 years ago (the implication, then, by the way, is that the immediately pre-1970 counts were perhaps comparably numerous, but the GTD doesn’t have those data. On the other hand, 1970 seems to stand as a towering apex among the years, even the ones that immediately follow it).

Subjected to a simple pivot table line chart, enabled by referring the country_txt field to a Slicer within which one can isolate one country at a time, and stationing iyear in Rows, the American (United States) incident totals read thusly (recall the missing 1993 data, though):

GTD1

The charts for some other countries not unacquainted with terrorist attacks:

GTD2

There I would have supposed the spikes would have bolted upwards earlier, during the throes of the Troubles.

GTD3

Again, I would have surmised an earlier peak, but note the welcome trough in acts in 2015.

GTD4

More expectable, perhaps, though note the 1975 year-inception point and the remarkably sparse incident count through 2003.

Note at the same time, however, that the charts reckon percentage distributions of incidents, and don’t size up absolute incident counts for the countries. Those read as follows:

US – 2693

Ireland – 274

Israel – 2085

Iraq – 18770

Iraq’s enormous total is thus very much a recent vintage.

Of course any surprise at any of the charted histories above may have more than a little to do with one’s definitional investments in the term terrorism. It may be easy not to liken the Weather Underground and its 45 attacks from 1970 through 1975 to the latter-day insurgencies of Al Qaeda (I854 attacks under variant names), but that aversion needs to be thought through, and the QTD did – though of course you’re welcome to contest and/or refine the understandings.

For a worldwide incident count, click the Clear Filter icon on the Slicer, in effect selecting all countries simultaneously:

GTD5

The chart result:

GTD6

10.74% of all the recorded attacks were conducted in 2014, with a small but appreciable decrement to 9.41% characterizing the following year. Still, an extraordinary 39.6% of all attacks crowded the six years spanning 2010 through 2015 (note that by imposing a five-year grouping interval on the iyear data, the pivot table nevertheless casts the most recent data into a six-year tranche, 2010-2015).

In any case, with 111 fields and in spite of its lengthy blocks of blank cells, the GTD’s massive trove of data has a lot of permutations in there.

Drug Overdoses: The Data From Connecticut

11 Jul

The broadened availability of data is potently arming the multi-theater war on drug abuse in America. Open-data concern Socrata – the firm whose interface front-ends a great many US governmental open data sites – describes its role the opioid epidemic here, for example. And the data.gov site has called up the sobering Accidental Drug Related Deaths 2012-2015 site from data.ct.gov, the Connecticut open data venue. You can download it here. Click the CSV link for Accident Drug Related Deaths 2012-2015.

Without substantiating the accidental, as opposed to the willed, character of the fatalities (that information is presumably sourced elsewhere), the Connecticut workbook details instances of lethal overdosing in the state for precisely four years, January 1, 2012 to December 31, 2015; and once some basic spreadsheet rough spots are smoothed, the data have much to reveal.

First, of course, the Date field need be auto-fit, as do a few other columns. Second, I cannot explain the intermittency of the CT references in the Death State field; on the other hand, since all the data emanate from Connecticut in any event the field can be overlooked or deleted.

But a somewhat more provocative complication besets the CaseNumber field in column A. It is clear that the cells’ two-digit prefixes insinuate the incident year; but because the final four digits of some of the entries – e.g. the one A3 – happen to resemble standard year references they are treated as such (particularly for 2012 incidents), as per one of Excel’s standard date formats. Had the field been prospectively formatted into text mode the ambiguity would have been forestalled; but faced with a mixed-format fait accompli the journalist will need to decide first of all if the field will make an analytical difference, and it probably won’t. Reliable and usable date data feature alongside the problem codes in column B, after all; but if you do insist on some unity of appearances in A perhaps the simplest tack would be to select column A, format the field in Short Date terms, and left-align the entries; thus the actual date in A3 will now read 12/1/1989 – still a date, but one that seems more-or-less of a piece with the other, labelled codes, which of course remain impervious to the new format. They’re labels.

You’ll also want to look closely at the Residence Field. The city name Waterbruy in F133 is almost surely Waterbury, and so needs to be emended; go ahead and run all the names through this single-field pivot table:

Row Labels: Residence City

And you’ll find No Haven, likely a truncated North Haven, and other suspect localities: Port Chester and Portchester, Stafford Spgs and Stafford Springs, West Haven and West Haven (the former revealing two interpolated spaces), and possibly a few others. If it’s your intention to break out the data by Residence City, then, these discrepancies must be researched, and repaired if necessary.

With that caution in mind we can nevertheless proceed to some meaningful pivot tabling, beginning with perhaps the most obvious:

Row Labels: Date (group by Year, if necessary; remember that the 2016 version of Excel will execute that grouping automatically, though you’ll need to filter out the <1/1/2012 and >12/31/2016 lines if necessary):

Values: Date (Count)

Date (again, by Show Values as % of Column Total).

I get

d1

The overdose toll is grimly ascendant, with the 2015 totals more than doubling the 2012 figure.

Ethnicity, here identified as Race, would likewise compel the investigator:

Row Labels: Race

Values: Race (Count)

Race (again, % of Column Total)

I get:

d2

If nothing else, the figures commit instructive violence to “minority” stereotypes about heavy drug use. As per the 2010 census Connecticut’s African-American population stood at 10.34%, even as overdose deaths across the four-year period among that group register 7.30%. And while estimates for 2012 put the state’s Hispanic/Latino representation at 14.2%, the merged 10.64% thus falls well beneath that proportion.

Other, standard scrutinies are available as well, for example age, grouped here in five-year intervals:

Row Labels: Age (grouped by five years)

Values: Age (Count)

Age (again, by % of Column Total)

(note two records lack age data. These can be filtered from the table without impairing the grouping.)

I get

d3

I was surprised by the 44-48 and 49-53 modal intervals, my having assumed that the overdoses would have clustered further down the age range. But that’s why look at data.
Gender by year might also prove informative:

Row Labels: Date (again grouped by year and appropriately filtered)

Column Labels: Sex (the two blanks can likewise be filtered)

Values: Sex (Count, as the data are textual)

I get:

d4

Refigure these as % of Row Total and withhold Grand Totals for rows:

d5

No striking inter-year gender differential obtains, though of course we see that overdose deaths befall an overwhelmingly male population.

For location of incident:

Row Labels: Location

Values: Location (Count)

Location (% of Column total)

d6

(Note the 17 blank records were filtered.)

As the dataset reports apparent accidental overdoses, the fact that over half of the deaths occurred in a residence, though not necessarily that of the victim, is unsurprising.

For a final consideration here – though plenty of other permutations are there across the data – it occurred to me that some relation between overdoses and day of the week might obtain. If that inquiry is redolent of a non sequitur, so be it – why would anyone impute a relation between the two, you may wonder? Still, I thought a look might be worth the taking. And it was.

Because the incident dates command column B, the standard means for developing the answer would be to head into the first free column abutting the data – column AF – and enter, starting in AF2:

=WEEKDAY(B2)

After replicating that formula down the column, you’d typically move to learn the distribution of overdose events by day via a series of COUNTIFs, or a pivot table tally; but a more parsimonious approach, one that would look past that column replete with 2100 WEEKDAY formulas, comprises seven array formulas instead.

Those formulas would ask you first to stream numbers 1 through 7 down a column – let’s say in cells AH5:AH11. I’d then write in AI5:

=SUM(IF(WEEKDAY(B$2:B$2129)=AH5,1,0))

And because the above is an array formula the entry process concludes with the storied Ctrl-Shift-Enter triumvirate. I’d then copy the formula down through AI11.

What the formula is doing is assessing each cell in the B2:B2129 range – the dates – for their weekday number (Sunday defaults to 1, and so on, through Saturday’s 7). The formulas then ascribe 1 to each cell whose weekday matches the neighboring entry in the AH column, and finally sums them. The routine yields these outcomes:

d7

You’ll observe that the totals for days 6 and 7 – Saturday and Sunday – are notably higher than those for the other five days of the week. The finding then: that, for whatever reason, more lethal overdoses in Connecticut are perpetrated on the weekend. A consequence of heightened recreational drug on those days? It’s possible, but that conjecture needs to be researched.

But there’s a catch: add the totals above and they sum to 2128, or the number of records in the dataset. But two of the dates in the B column are in fact missing, and so the above array routine is somehow admitting the blank cells to its count. And why? It’s because the WEEKDAY function treats a blank cell‘s reference as a Sunday, and thus evaluates it as a 7. I’m not sure why – it may have something to do with how WEEKDAY divides a date’s numeric equivalent – but whatever the reason, that’s what does. As a result, the weekday counts above errantly ascribe two Sundays to the tally that aren’t there.

It took me a while to realize a workaround, but I came up with this, entered in AI5 (again concluding with Ctrl-Shift-Enter):

=SUM(IF(LEN(B$2:B$2129)>0,IF(WEEKDAY(B$2:B$2129)=AH5,1),0))

It looks a bit fearsome, but what’s happening is that the formula determines if the entry in B exhibits a character length greater than 0 – that is, it asks in effect if the cell is blank, by looking for non-blanks. If that logical test is passed, then the array formula proceeds per its original remit. The effect, then, is to exclude the blank cells, delivering a total of 339 Sundays. But the weekend effect remains in force.

Follow

Get every new post delivered to your Inbox.

Join 222 other followers