Some Q and A about A and E Waiting Times

31 Jan

We don’t think that way, but field headers are a species of first-line documentation. We need to know what fields mean before we work with them, and quite apart from the brute operational need for fields to be headed at all (and header-less worksheets are out there to be found; I can think of a couple).

That is a none-too-arresting truism to be sure (most truisms aren’t too arresting), but when headers put understanding on hold, the truism suddenly matters. I’m thinking about the UK’s National Health Service Accident and Emergency (A&E) data on waiting times entrants to the Service’s emergency facilities endure before they’re attended to. It’s a perennially newsworthy concern, and some pretty current data are here:

 2015.01.25-AE-Timeseries4w3Rl

But before we think about those aforementioned headers some other, more standard spreadsheet considerations require a look, too. Begin with the oldest-to-newest sort ordering the Week Ending data in the Period field, a curiosity you may or not find nettling depending on the intentions you’re bringing to the sheet. If you’re doing nothing but reading the records you’d likely have opted for a recency-driven sort, in which the latest entries top the field; but if, on the other hand, you’re keen to mortar and pestle the data with a pivot table or some other analytical instrument you shouldn’t care at all, because you’ll be sorting as and when you wish. Note as well, of course, that those W/Es have seen to it that the presumptive dates filling the Period column are labels, a substantive devaluation that could and should have been forestalled. If you run a Find for W/E[space] and Replace with nothing, you should be able to recover the data’s rightful quantitative standing (though be prepared for some issues if your Dates and Times honor the U.S. month-first protocol). And when that rework does its job you can rename the field Week Ending.

Moreover, if you are going to pivot table the data set, you need to do something about the blank row 18 – namely, by deleting it – and treat the grand-totalling row 17 to much the same. Again, you don’t want 17’s totals uncorking a double-count all over your tables. And the merged superordinate headings in 15 have to go the way of 17 and 18, or else your pivot tables won’t go at all (just delete the entries; the row can be left alone).

And now for my opening pronouncement. Perhaps it’s me, but I experienced more than a jot of difficulty in my strivings to understand whence the percentages in columns M and N came. You see how the two metrics are meant to be differentiated: one conveys the percentage of all A& E patients attended (that verb probably requires scrutiny) in four hours or less, the other the fraction of type 1 incidents seen within that time (Type 1 denotes  “A consultant led 24 hour service with full resuscitation facilities and designated accommodation for the reception of accident and emergency patients”, we’re informed here).

The problem – to which my obtuseness is probably central – is in the clarity of the headers’ implied instructions to divide this field by that field.  What field, that is, denominates the Percentage in 4 hours or less (type 1) data?  That datum in cell M18, for example, reports 95.6%, and after a round of head scratching I returned my fingers to the keyboard and at last prevailed with this triumphant formulation:

=1-I18/E18

Considered in field terms, I divided

Type 1 Departments – Major A&E/Type 1 Departments – Major A& E

Subtract that result from 1 and you do get 95.6%, in single-decimal terms. But you got it – I seem to have divided two fields by the same name, and in fact I have. Of course they’re topping different columns, but…I wouldn’t have done that, quite apart from my early inclination to think about dividing I18 by H18, or Total Attendances.  That 95.6% then needs to be understood as confining itself to Type 1 on both sides of the divisor, and not all attendances.

And now – now – that other percentage, the 96.9% in N18, reads more sensibly:

=1-L18/H18

Or,

1-Percentage in 4 hours or less (all)/Total attendances

(remember that L18 counts attendances that took longer than four hours, and so the division computers the greater-than-four-hour data, which need to be subtracted from 1, or 100%, in order to develop the less-than-four-hour percentage). But I might have also replaced Total with All, so that the adjective reads identically across those two fields. Why should we have to wonder about the difference between Total and All?

Another point: the merged-cell heading in I15, A&E attendances > 4 hours from arrival to admission, transfer or discharge, seems to point to patient receptions in excess of four hours; but the percentage recounted above computes attendances of four hours or less. But as I said, maybe it’s me.

A concluding recommendation, even as no one asked me for it: click anywhere in M and N and see 10-or-so decimal points congesting its column. Might not all confusion have been allayed if the actual formulas occasioning the M and N results have been left in their places, instead of the hard-coded replacements that we’re seeing? We’d then know exactly what was divided by what, to the relief of all those bloggers confessing their obtuseness in shameless public view.

Ebola Data: Worth Thinking About

22 Jan

If the subject is Ebola, you want the data to be equal to the epidemiological tasks to which they’re likely to be applied – that is, as timely and accurate as they can be. The Humanitarian Data Exchange (HDX) site, an outpost of the United Nations Office for the Coordination of Humanitarian Affairs (OCHA), has been tracking the disease’s incidence across nine countries (as of this writing) from which Ebola reports have issued, and in nearly to-the-minute fashion (my data posts records dated January 20). You can download it here (access the csv rendition; I couldn’t get the Excel offering to work).

Once inside, you’ll be able to get to work on the sheet’s 3200+ rows of time-stamped data (as of this writing), and a parameter – Indicator – that will force your attention span to… snap to attention. Indicator comprises 36 different items, a good many of them same-sounding; and if you’re program calls for a scrupulous run through the data, you’ll have to make sure you understand what each of these uniquely mean. For example, the eight discrete items headed by the phrase “Cumulative number of…” simply demand clarification, as do the 12 “Proportion of… ” rubrics. Put alternatively, each country receives 36 parameters worth of records about its Ebola status.

Moreover, keep in mind that the Proportion data are registered in whole-number terms, and as such would be more meaningfully comprehended by dividing each by 100 (and thanks to HDX’s email reply to this question).  Thus a proportion of 43 is to be quantitatively read 43%, for example.

That obligation to clarify might be illustratively served by pivot-tabling Ebola cases by Indicator and one country, say Sierra Leone:

Row Labels: Indicator

Values:  Value (Sum)

Slicer (or Report Filter): Sierra Leone

And here’s where I made my first mistake. I saw that the country had apparently incurred 257,286 confirmed cases of Ebola as of January 20th, but that figure isn’t remotely true-, and fortunately not. My pivot table had added each successive, as-of-the-specified date case total – a textbook case of double, triple, and quadruple counting – when in fact of course each figure tabulates the cumulative case total until just then.

And that makes for a rather cautionary item. After all, review an unassuming datset of sales recorded by say, salesperson, and Jane or Rob’s transactions are there for the summing. But adding successive cumulative totals makes for one very wrong answer.  It’s as if the data have already submitted to something like Excel’s Show Values As > Running Total In… option. It’s the records that are doing the aggregating for us, as the dates proceed.

And all that means that for certain reads the data should to be lined up one date at a time. For example, the Sierra Leone data for October 8, 2014 total 2455 cumulative, confirmed cases of Ebola. The number of confirmed cases in the previous 21 days as of that date at stood at 924, which indeed yields the 38 (really 38%) in the “Proportion of confirmed Ebola cases that are from the last 21 days” cell – that is, 924/2455.

Note at the same time the empty cells that, for whatever reason, have no data to report, and doubtless data collection in affected regions has its challenges. October 3 counts no confirmed cases of Ebola, even though it precedes the 8th by only five days and even as it follows Oct 1 and its 788 cases by but two. And because the dataset’s 51 dates aren’t identically spaced in time, one needs to reflect on the mathematics of those “…last 21 days” items. As with the cumulative data above, these numbers are rolling, but here they confine their locomotion to a moving 21-day band. That means of course that days in the data set fewer than 21 days apart share many of the same cases in their respective enumerations (as another HDX e-mail acknowledged), certainly not a bad thing by itself. After all, the recency and magnitude of Ebola case distribution is a perhaps literally a vital concern for researchers. But the journalist needs to appreciate the necessary resort to a kind of double-counting in the process. Still, the variable distancing of the 51 dates from one another does ask a question about how case counts should be properly interpreted.

And there’s something else. Cumulative figures should naturally increment across time, or at the very least remain unchanged; but the cumulative death totals (for all countries) for October 29 and 31 2014 appear as 9 for both days, following a total of 2990 for October 25. And November 5 presents a figure of 1739, still notably smaller than the October 25 report. It seems that only certain countries were considered for the 29th and 31st, a undercount that seriously disrupts the accumulating total. To see what I mean, try:

Row Labels: Indicator (Filter for Cumulative number of confirmed Ebola deaths only)

Column Labels: Date

Values: Value (Sum)

This one requires follow-up, I’d allow.

And at the risk of baring my medical ignorance I’ll put one more table on the table:

Row Labels:  Indicator (filtered for the four “Case Fatality rate” items

Values: Value (Average).

I get

 Eb1

Why should probable Ebola cases exhibit so markedly higher a fatality rate? Could the differential be ascribed at least in part to post-mortem assessments of individuals whose conditions went undiagnosed or to whom medical attention was delayed? I don’t know, but the rate differentials are notable and investigation-worthy.

Trumped Cards: Consumer Credit Card Complaint Data

15 Jan

The topical suggestions on the US www.data.gov home page flash across its master search field like a battery of rotating advertisements, and it was the Credit Card Complaints catchphrase that road-blocked my fleeting attention span, if you must know.

Ok – yours will likely resonate to a more exotic theme, but it must have been the cool alliteration that got me. Either way, if you’re looking for 22.5 megabytes and 334,000 records worth of consumer discontent about their plastic you’ve come to the right place:

http://catalog.data.gov/dataset/consumer-complaint-database

(You’ll also want to review the companion Data Dictionary in there, at its own linked destination for elucidations of and field references for the above workbook. I have a bit more to say about these later.) Click at the Comma Separated Values Download link, and when the hurly burly’s done save it all as an Excel sheet. (I’m now beginning to think that a workbook’s CSV derivation is what explains the un-fit columns that burrow their way into so many spreadsheets across the net; here too, our data call for the redoubtable autofit.)

Once you actually get to the data (which are rather current; beginning with December, 2011, my download called up complaints through January 8) you’ll be enraptured by their organization and fitness. The dates are actually dates (that is, they’re verifiably numeric), and their fields are awash in pivot-table friendliness – lots of meaningful permutations in there to be spun. Let’s start basic, with a state-by-state complaint count:

Row Labels: State

Values: State (Count, necessarily; then by Show Values As > % of Column Total)

Sort the numbers by largest to smallest and you won’t be struck by California’s petition to call itself the credit card complaint capital of the US. That soubriquet, of course, merely hooks adventitiously into the state’s largest population but still, my Wikipedia census (circa 2012) fractionates Californians into 12.1% of the country’s total – and they’re registering 15.16% of the complaints, doubtless a meaningful disparity. (And, blank identities aside, you’ve taken note of the 62 records spilled onto the pivot table, taking the data rather over and above America’s 50-state inventory. Those excess rows point to US territories, whose 2-character codes are deciphered here; and so California’s complaint quotient is in actuality a bit higher, if you filter out the territories.) They’re laid back out there, but unhappy.

And other state discrepancies break towards statistical significance, too. Florida’s second-place complaint count – 9.79% – doesn’t really square with its 6.2% of the US population, for example. Perhaps they’ve experienced an influx of Californians.

And how about the problems with Product, i.e., the kinds of credit card services most likely to put the pique in the public?

Row Labels: Product

Values: Product (again, Show As > % of Column Total). Never mind the Grand Totals.

 credit1

Mortgage complaints, however defined, trouble complainants most widely (though these evince considerable inter-state variation), with debt collection – presumably remonstrances against dunning tactics – running a few laps behind in the second position.  (It goes without saying, however, that the numbers here need can’t be confidently understood without sizing up the respective customer base for each product, though I just said it.)s  The few complaints about student and payday loans might surprise Britons accustomed to recurrent news coverage of both types of transactions.

Of course the data could also be broken out by time, e.g.

Row Labels: Date received (Grouped by Years and Months)

Values: Date received (Count, Show Values As > % of Parent Row Total)

I get (in excerpt)

credit2

That % of Parent Row Total command is new to this blog, and merits some detailing. It works something like this:  Because the Years/Month grouping naturally accords priority to years, that chronological unit is vested with a Parent status, with the kindred month rows sired as a kind of progeny. The above screen shot fits subtotals within subtotals, then; 2013’s 32.36% marks that year’s slice of all consumer complaints, but its monthly percentages proportion themselves not to all months in turn, but only those contributory to 2013’s totals. Thus January’s 9.07% is 2013-specific, even as the strikingly scant 4.50% for the previous January binds itself to 2012 alone. It’s what % of Parent Row Total does.

When it’s all unrolled before you onscreen you’ll observe the very substantial uptick in complaints for 2014, though it’s difficult, though not unimaginable, to suppose that service quality faltered so badly from but one year to the next. More plausible, though again not definitively so, it’s possible that more consumers were made aware, or perhaps made themselves aware, of the complaint filing process. This one might be worth researching as a result.

Now about that Data Dictionary. Jump down to the Field reference section and see that, most, but not all, of the text fields are termed categorical, what are elsewhere called nominal – that is, fields whose items can’t be meaningfully ordered or measured, e.g., gender, race, or nationality. What’s slightly odd is that the zip (or postal) code field is likewise understood here as plain text, even as each and every available code (there are some blanks as well) is a certifiable number. What the sheet wants us to presumably understand, then, is that these numbers aren’t for adding or otherwise manipulating. They’re “text” in the way that telephone and social security numbers are text.

But ok; those are quibbles, and does the worksheet really need a 334,565th complaint?

Data Input and Scholarly Output: More UK Research Excellence

6 Jan

Don’t touch that dial, and hands off your mouse wheel; it isn’t déjà vu that’s spooking your reverie but rather another look at the UK’s 2014 Research Excellence data, this one pictured by an extensive piece in the Times Higher Education magazine’s January 1-7 issue. THE (the acronym, not the article) makes considerable room across its pages for a subject-by-subject, peer-judged drilldown of the research impact unleashed by the country’s university departments, an expository angle that points it quite a few degrees from the Research Fortnight study of university funding prospects and their relation to the Excellence figures that we reviewed a couple of posts ago (and thanks to Fortnight’s Gretchen Ransow, who sent me RF’s original data in its pre-Guardian/Telegraph/Datawrapper state).

Impact? It’s defined by THE, at least in the first instance, by a department’s GPA, or grade point average – a summatory index of department staff research activity scored on a 1-to-4 scale, the higher the more impressive. The foundational calculus is pretty off-the-shelf:

(4s x number of staff so achieving+3s x number of achievers+2 x number of achievers+1 x number of achievers)/count of all staff

Thus a department of 20 members, 10 of whom are adjudged 4-star researchers and 10 pegged at 3, would earn a GPA of 3.5. But because departments were free to submit just the work they elected and were equally freed to detain other researches from judgemental view, THE coined a compensatory intensity-weighted GPA, arrived at after swelling the above denominator, by counting all staff whose work could have been slid beneath the REF microscope, even if these were kept off the slide, for whatever reason. If the above department, then, had 30 eligible assessment-eligible instructors in its employ, its intensity-weighted GPA would slump to 2.33 (its 70 grade points now divided by 30; of course, a department’s intensity-weighted GPA can never exceed its GPA, and can only match it if all faculty had submitted work).

In any case the REF has kept us in mind by marshalling its raw data into none other than a spreadsheet, obtainable here:

http://results.ref.ac.uk/DownloadResults

(And yes, you’ll have to autofit the F column. Note as well that the sheet does not offer the entire-staff figures with which THE scaled its intensity-weighted GPA.)

You’ll see each department receives three sub-ratings – Outputs, Impact, and Environment (explained here), these weighted in turn (Outputs contributes 65% of the whole) and shook well into the Overall figure, the impact metric which THE reports on its pages (in effect, then, the Overall datum operationalizes impact).

If nothing else, the dataset seems to exhibit all due good form, what with its proper concern to treat as items what could have, and more troublesomely, been awarded field status (again, see my discussions of the matter in my August 22 and 29, 2013 posts). For example, the Profile field properly ingathers these items – Outputs, Impact, Environment, Overall – when a heavier hand would have parcelled them into independent fields. But a few curiosities fray the edges, nevertheless.

If, for example, you want to reconsider the data in pivot table form, well, go ahead. And when you do you’ll get here:

 re21

Allow me to suggest that the single-columned range requesting your OK isn’t what you’re looking for – but what’s the request doing there to begin with? Scroll about the sheet until you pull over by the merged cell L6:

re22

And it’s that that unassuming, title-bearing text that’s jamming the table construction, by squatting directly atop the data and stringing itself to the data directly atop it in rows 3 to 5 (that latter row is hidden, and features another bank of apparent, and extraneous, field headers. I can’t quite explain what they’re doing there). Thus the contiguous data set – the warp and woof of pivot tables – starts in row 3, and the pivot table launch process regards it as such; but because its “headers” head column A only, that’s all Excel chooses to work with for starters; hence the curtailed A column range in the screen shot above. The fix of course is simply to delete the title in L6, restoring the data set’s inception to row 8 (note that because L6 is a merged cell, it takes in row 7, too).

Next, in the interests of corroboration, you may very well want to go on to calculate the sub-ratings streaming across each data row. Let’s look at the numbers in row 9, the Outputs for the Allied Health Professions, Dentistry, Nursing, and Pharmacy area at Anglia Ruskin University. The 1-4 ratings (along with unclassified, the rating set aside for work of the most wanting sort) – really field headers – patrol cels L8:O8, and so I’d enter the following in Q9:

=SUMPRODUCT(L9:P9,L$8:P$8)/100

Here the most nifty SUMPRODUCT, a deft function whose Swiss-army-knife utility is very much worth learning (and that goes for me too) multiplies paired values in L8/L9, M8/M9, etc. across the two identified ranges (or arrays, as they’re officially called) and adds all those products, before we divide the whole thing by 100, because the REF’s staff counts need to be expressed anew in percentage terms. The $ signs hold L8:P8 fast; they store the rating values, and because they do, will be brought to all  the 7,500 rows below.

But if you’ve gotten this far you haven’t gotten very far at all, because you’ll be told that Anglia Ruskin’s Health Professions scholarly exertions earned that department a 0, even though THE awards it a 2.92 – and if you don’t believe that, it says so right there on page 41 of its hard copy.

Someone’s wrong, and it’s us – though with an excuse, because the REF spreadsheet has shipped its field headers in label format. 4* isn’t a number, and neither is “unclassified”. Mulitply them, and you get nothing – in our case, literally. Of course the remedy is simple – replace 4* with a 4, and so on, though it should be observed that you can leave “unclassified” alone; after all, multiply a value by 0 or by “unclassified”, and either way you get zero.

In any case, once we get it right you can copy the formula down the Q column and name its field Averages or whatever, and you’re ready to get your pivot tables in gear. But the larger, or at least the prior, question that could be rightfully asked is why the REF did what it did. It’s clear that its data stalwarts are happy to make their handiwork available to folks like us; that’s why they’ve given us a spreadsheet. How then, do we account for the little bumps in the road they’ve raised across our path? There’s nothing secretive or conspiratorial about this, of course, but the very fact that the REF averages of departmental accomplishment just aren’t there leave them to be computed by interested others. And that bit of computation could have been eased with bona fide, quantified ratings in L8:O8, and a repositioned title in L6. Those little touches would have been…excellent.

Set Theory: Tennis Stats, Beneath the Surface

28 Dec

Don’t bother looking; the year-end men’s tennis rankings are in, and I’ve failed to crack the top 100 – or even a few orders of magnitude beyond, if you insist on shining the flashlight on the whole list. I’m searching for a new coach but Ivan Lendl won’t take my calls, and I remain open to suggestions.

But if you need to know the names of the boys in white who did manage to serve-and-volley their ways past yours truly, check out the CoreTennis site’s Association of Tennis Professionals (ATP) listings for men, along with the Women’s Tennis Association standings, too. Here’s the men’s file:

 atp 2014

(You’ll probably want to untie the ribbons on the wrapped text in there and refit the columns, too.)

Now while the players are ranked indeed, the measure by which they’re ordered – the number of matches each contested – doesn’t quite seem to make the point. Are the best players those who simply step onto the court most often? I would think not, at least not invariably, though it makes considerable sense to suppose that players who’ve plied more matches are the ones who’ve likely gotten farther into the tournaments in which they’ve appeared. Or a large match quotient might tell us little more than the player’s determination to slot more tournaments in his diary. In fact, the correlation between a player’s match total and his winning percentage (the PCT field) works out to .681, impressively high to be sure, but not invariable. In any case a click of most of the headers will grant a re-sort of the data by that field instead (and note the pasted, sortable headers preserve their hyperlinks back to the CoreTennis site, and as such will respond to clicks in their cells with a return to their web sources. By creeping into the first link-bearer via the mouse and selecting the all others with a Shift-right arrow, click-free feint, and right-clicking the Remove Hyperlink option, unimpaired, standard cell editing will be enabled anew, however).

But the decision to sort the data by matches played is more editorial than operational, as is the confining of the dataset to a 40-player elite. So if you’re looking for something operational , then, look at the PCT field and its train of leading zeros. One runs into this supererogation rather often, and I don’t quite know why the sheetmeister here couldn’t have served up this custom format:

 tennis1

(Warning: tennis-themed pieces are dangerously pun-provocative. I’m trying to control myself, but no promises.)

Moreover – and this is an operational quibble, and a not entirely fair one here – it seems to me that the analyst, again as opposed to the reader, would be able to shape the data set sans any ranking field at all. And that’s because a pivot table can rank its contributory fields with the Show Values As > Rank Largest to Smallest/Smallest to Largest option as needed. But of course CoreTennis can’t be expected to capitulate to the spreadsheet lobby on this count.

Now about those headers, in particular three of the last four:

 tennis2

The source data don’t read that way, or course – so what’s happening here? Remember that, as with last week’s post, I simply copied-and-posted the rankings to my local spreadsheet (and by the way, appearances to the contrary, the data do not fill an Excel table, in that term’s contemporary sense; they’re nothing but a data set, all snug in their columns and rows), and the recipient cells have chosen to treat the headers as dates. Thus Excel reformats the first of the problem headers, natively headed 6-0 (and totalling the matches the player won six sets to none), into June 1900, the reckoning my US-date-formatted system assigns the expression. You want to see 6-0, though, and perhaps the simplest restorative is to type the Neolithic, text-defining apostrophe into the cell and simply re-enter 6-0, and repeat for the remaining dated cells. Note on the other hand that cell H1 – the one recording the 0-6 header – is, and always was, a texted thing, because if Excel regards the 6 in 6-0 as June, it can’t regard the 0 in 0-6 as anything at all. And when all else fails, the torpid text format jumps right in.

And had CoreTennis asked me, and rest assured they didn’t, I would have reassigned those three-lettered player country designations to a field all their own, and so freeing those data up to some useful pivot tabling. But as those abbreviations share a space with their citizens’ names, we might want to think about a formulaic extraction of the country codes.

And that’s a job for the MID function, a device I recruited in this post to extricate two-digit month identifiers from larger expressions. Here the task is a touch more formidable, because we don’t know exactly where in the respective player/country cells the county codes start. So here’s one – and just one – stab at the problem, working from cell K2 and literally addressing Roger Federer’s info in B2:

=MID(B2,FIND(“(“,B2)+1,3)

Here MID outsources its second argument – the character position at which the extraction is to commence – to FIND, which discovers and enumerates the whereabouts of the open parenthesis/bracket, in turn always one character to the left of the inception of the country code 0, and thus accounting for the +1. The final 3 instructs MID to grab that many characters from the cell, as all the country codes comprise precisely 3 letters.

Once you get that going, copy the formula down the K column (which I’ve field-named Country). Now you can pivot table a country breakout, e.g.

Row Labels: Country

Values: Country

Sorting the numbers from Largest to Smallest I get:

 tennis3

France leads the country representation, and that’s Spain (Espana) counting one less player (the women’s top 40 leads with the USA and Czechoslovakia, with five entries each. You could then slide the PCT field into Values, summarize these by Average, and subject the numbers to the same customizing pictured in the first screen shot up there:

tennis4

(Remember the numbers above average different players’ percentages, and as such don’t weight these for numbers of player matches played. But, those numbers are sufficiently close to accord them weighting parity, and in any event there may be good reason here to treat the individual percentages as equally contributory). Note Serbia’s (SRB) chart-topping .884. That kudo of course belongs to Novak Djokovic, but he’s a universe of exactly one. Needed here: more players to rank. And if you want to see my name in there, needed here: Ivan Lendl.

Let’s REFer to the Research Excellence Data

22 Dec

Compare and contrast, the test-makers like to say, and that alliterative injunction packs a redoubled punch across the Research Excellence Framework (REF) data put together by Research Fortnight, the self-described “leading independent source of news, analysis, funding opportunities and jobs for the academic research community” in the United Kingdom. You can compare the estimated six-year estimated funding prospects for 154 UK universities, driven by a weighted assay of their research standings and Research Fortnight-devised power ranks, and then compare the very handling of those numbers by the data managers at two national publications, the Guardian and the Telegraph here:

http://www.theguardian.com/news/datablog/ng-interactive/2014/dec/18/university-research-excellence-framework-2014-full-rankings

and here:

http://www.telegraph.co.uk/education/universityeducation/11299261/League-tables-the-top-universities-for-research.html

(Transparency check: remember, I teach an Excel class for the Guardian.)

 

Ok; now do you notice any formatting curiosities in the Guardian’s data capture?

 re1

Why, for example, should the wholly-integered Fortnight Power Ranks splurge to two decimal places, a question that could be even more pointedly posed of the counts in the Number of departments field?  And indeed – why Velcro a second decimal to the RF quality index data, when no meaningful values fill that space?

I’m not sure I can answer those questions, but eager to slake the acquisitive pangs of my hard drive, I clicked the petite Get the data link spaced below the Guardian table, and was taken nowhere. A pair of right clicks atop the table – one, to spark an internet-familiar menu, the other, to stab at the Export to Microsoft Excel option did what it usually does, nothing:

re2

 

Having vainly fought the good fight, it was time for a shameless resort to Plan C – a drag across the data, and a simpering copy-and-paste of them all into a blank Excel sheet (you may be able to start the copy with a right click > Select all try). But what I saw, apart from the need to unwrap the text from their constricting cells and auto-fit the Research staff column, was nothing if not interesting.

For one thing, those Power Ranks are more than just zero-freighted.  That digital exorbitance is aggrandized by still more, only-now-seen values pulling rightward from their decimal points, and if you like your bewilderment full-blown, take a hard look at those decimals. Each one mirrors the substantive value to its left, e.g., 11.0011, 81.0081, 53.0053, and this curious echo appears to sound across every number in the table in the Excel edition. I don’t presume to know what it means, but it’s there, as is the second decimal point speckling all but six of the ostensible numbers in the Power Rating and most of the other fields : 97.3097.3, 37.8037.8, and the like, an intrusive fillip which, among other things, degrades 147 aspiring Rating numbers into text.

There’s more.  The 2,409 Oxford staff attributed by the Guardian site undergoes a logarithmic leap to 24,092,409 in our spreadsheet, making for an astonishingly favorable faculty-to-student ratio, one bettered only by University College London’s 25,662,566, and that’s in a city numbering a village-small 8,500,000 residents.

What happened here? Again I really don’t know, although doubtless there are folks somewhere out there who have the answer (I read my emails). The revelatory powers of a simple copy-and-paste, through which net-sourced data make their actual, lengthier, suppressed values/text known to the spreadsheet setting, is something I’ve experienced before, and must have something to do with the movement from source data to web page. But in any case, my mystification sends me back to my semi-rhetorical question: do the complications we’ve encountered here matter?

Again, for the reader qua reader they don’t. As it stands and apart from its redundant decimals, the Guardian table presents no particular challenges to legibility, making for a rather straightforward read, once you acquaint yourself with the meaning of its fields. But the problem, of course, besets the analyst, those seekers of new and arresting finds who want to saddle up atop the data and take them for a ride. You can’t do much with 97.3097.3 if you want to count or average it.

That’s not to suggest that the data are beyond repair, but the process could get messy. To bring research staff sizes back to reality, for example, you could divide them, but  staff sizes above 1,000,000 require a denominator of 10,000, even as the smaller others make do with 1,000. And you could perform a Find and Replace for all the data’s decimals by replacing each point with nothing and thus requantifying these entries; but again, dividing these in order to restore their original magnitudes seems to require differently-sized divisors.

On the other hand, you could copy and paste the Telegraph’s rendition (itself a tricky business, as the table needs to be scrolled though; try keeping the left mouse button down and position it just beneath data as you continue to scroll; you may also have to hand-enter the field headers) and go to work. The numbers here are numbers (and by the way, the Power Ratings are calculated by dividing a university’s predicted share of funding by Oxford’s 6.24% percentage).

It’s clear, then, that the Telegraph thought through the data – precisely the same data – differently.  You’ll have to ask them all what they thought – and “them” includes Datawrapper, credited with doing the heavy lifting for the Guardian.

Let me know what they tell you.

 

 

 

Its Byte is Worse Than Its Bark: New York’s Tree Census

15 Dec

I am two with nature, professes Woody Allen, and this New Yorker is only too happy to corroborate his math.  I know nothing about the flora that roots itself in the city’s spaces, and with my gaze locked in perceptual auto-filter, I hardly see them at all. The only branch I’ve noticed lately is the one that belongs to my bank.

In other words, I can’t see the trees for the forest, but they’re there – about 623,000 of them, and the New York City Open Data site’s borough-wide arboreal inventory fills me in on each and every one of them, should I at last decide to make the acquaintance. Look here for the data:

https://data.cityofnewyork.us/browse?q=Street%20Tree%20Census&sortBy=relevance&utf8=%E2%9C%93

The census has pretensions to completeness, and given its address-level molecularity I’d guess it comes pretty close. In the interests of cross-county comparisons (borough equals county) I downloaded and copied-and-pasted the records to a consolidating master workbook, thus harvesting all 600,000 or so trees into one Sequoia-large data set, one that at 60 MB would do severe ecological damage to Word Press’s downloading channel. In other words, you’ll need to consolidate the above link’s offerings on your own.  (Note: the Staten Island data originate in DBF, or dBase mode, but follow the standard File> Open check-in with a specification for dBase Files (*.dbf), and the data will effect its proper spreadsheet mien. Note in addition that Staten Island’s Treepit numbers – referencing a field about which I know absolutely nothing – were minted in label format, and I don’t know why; but I’ve yet to work with them, either, so the matter remains moot, though curious.)

And the meaning of the entries in the set’s fields isn’t entirely forthcoming; preeminent among the recalcitrant is Species, whose coded tree names had me searching for a Boy Scout, at least until stepped on this document in the wilderness:

http://www.nycgovparks.org/sub_your_park/trees_greenstreets/treescount/images/species_list.pdf

Courtesy of none other than New York’s tree census.

After saving it all as a text file and pasting it into a new workbook sheet, feeding the names into a text-to-columns grinder and mashing the columnized names via a concatenating ampersand or two (e.g. =B3&” “&C3&” “&D3&” “&E3), I nailed and glued the whole goopy effluvium into a tottering VLOOKUP table that seems to be standing up to New York’s December gusts and showers. After next inundating the sheet’s next available column (which I titled Species Name) with lookup formulas that exchanged the codes for the full species names, and paved the formulas over with a copy Paste Special > Values hard-coded surface, the deed was done.

Now what? Well, note for starters that the folks of the New York design studio Cloudred  have cultivated an interactive chart of the trees and their species/borough distributions, leaving your character-based correspondent here to wonder about such other tree novella in there as might be learned. So how might the tree census tally broken out by the city’s zip (i.e. postal) codes work? Try running this by or through a pivot table:

Slicer (or Report Filter): BOROUGH

Row Labels: ZIPCODE

Values: ZIPCODE (Count)

Then try Slicing Manhattan, for example, and sorting the results Largest to Smallest. I get, in excerpt:

 trees1

Now here is where some New York-percolated smarts would hand-craft your understanding of the numbers should you want to key tree demographics to zip codes and their constituent neighborhoods (and if you’re in need of percolation, look to this zip code/service-area legend:)

https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm

Heading the count is 10021, coding none other than Manhattan’s fabled Upper East Side, among the country’s poshest districts and 50% greener than silver medalist 10025, right across Central Park in the island’s Upper West Side, itself literally just above third-place 10024, somewhere in the middle reaches of the Upper West Side. The sociologists among you will wonder if class, politics, or the politics of class have committed their invidious inputs to the tree populations, or rather if territorial receptivities to the planting enterprise better explain the variance.  And for the curious, Staten Island –  containing far fewer human New Yorkers than any other borough but second only to Queens in the hardwood department  – can market its 10312 as the New York’s most tree-swarmed zip code, reporting 23,156 of those life forms to the authorities. (Note by the way the 24,000 zip-code-less entries in the data set, though, and I see no enumerations of park-resident trees, clearly an enormous and apparently uncounted cohort. It appears as if our data surveys address-bearing trees alone, unless I’ve got it way wrong.)

And you’ve been asking about New York’s tree diameters, and I hear you. They’re recorded here in what must be inches, and we can kick off with something simple:

Row Labels: BOROUGH

Values: DIAMETER (Average, squeezed to two decimal points. I get

The averages diverge notably, though you’ll need a greener thumb than mine pressing these keys to understand and explain why. Perhaps Manhattan’s penumbrous skyline stunts tree girths, but again you’re hearing that supposition from the wrong person – just ask Woody. If, on the other hand, you try this table out:

Row Labels: Species Name

Column Labels: BOROUGH

Values: DIAMETER (Average)

trees2

You’ll find some inter-borough comparisons across the same species that place the Manhattan diameters on the low side.

A few final points: As could be anticipated with a 623,000-record data set, some distinctly outsized numbers appear to have crept into the park. The 148-32 86th Avenue address in Queens’ Jamaica neighborhood boasts a 2100-inch-diametered  tree, and my Google Maps photo just doesn’t capture it. At 175 or so feet in width, I think it would show up.

The last point: the data with which we’ve been working hold up the back end of what New York Open Data showcases in the first instance as a set of borough tree-positioning maps of the census. Here’s the one for Manhattan:

trees3

Maximum coverage – maximized information? That’s called a rhetorical question.

Follow

Get every new post delivered to your Inbox.

Join 175 other followers