Search results for 'structuring the fields'

Toiling with, and in, the Fields: Restructuring the Data, Part 2

29 Aug

If I may grab back the baton bearing the idea with which I had begun my sprint last week, you’ll recall I’d sung the praises of a certain manner of data organization, namely this:

cons1

Rather than this:

cons2

(And yes, as Andrew Marratt commented last week, you can indeed do the work espoused here with Google/Open Refine (for a lucid tutorial on the data-righting process Refine calls transposing, see this post). Whether Refine does the deed identically, or more speedily or fluently is something one could talk about, but the option is in place. But for the sake of continuity we’ll honor the generic, delightfully retro Alt-D, P protocol. )

In any case, what ALT-D, P does among other things is trade in on the ambiguity between a field and a field item, by demoting the test subject fields with which we had started above to item status in a single, superordinate field, and you’ll recall as well that the routine’s dénouement comprised a double-click of the resulting pivot table’s Grand Total cell, which at last recombined the data into the recommended arrangement. And if you’re wondering about alternatives, guide our data through a standard pivot table (that is, sans ALT-D, P) and click Grand Total there; all you’ll do is replicate the data layout with which you started. And the larger is point looms: if here you want the pivot table to look like this

cons3

you’d be bidden to tug the subject fields into the Values area, and not Column Labels:

cons4

Because again, each subject continues holds to its field status; and if you do try and assign the subjects to Columns – something you might reflexively try –the subjects won’t line up alongside one another; rather, they’ll nest in precisely the superordinate-subordinate hierarchy that the Column/Row Labels areas always promote – and again, if that sounds impenetrably techie, try it, and you should see what I mean.  But pile all the subjects into the same field, and now your pivot table assumes your wished-for state:

cons5

And that’s what Alt-D, P does. It consolidates multiple data sets by strapping all the data into three generic Row, Column, and Values fields, perhaps in part because it has to be prepared to deal with variously-headed fields across the data sets.  Consolidate two data sets, each comprising three fields with incomparable names (and that’s something you can do here) and you’re left to pack them all into three consolidated fields – sporting six names. Three fields, six possible names – do the math. In the interests of decision-making, then, Excel fires three fields in the crucible, and names them, with all due practicality, Row, Column, and Values. Well, it beats Moe, Larry, and Curly.

But that’s all deep background, and if something’s been lost in the translation you can prevail nevertheless. The fact is that we’re not consolidating data sets here; we’re exploiting Alt-D, P for its three-column, data-digestible, revamping potential, and by way of another exercise we can return to the very scene of our original exposition – that very NPP file (which I’ve modified in the interests of pedagogical simplicity, by expurgating those fields that could be easily derived in a pivot table) about which I mused on September 28:

NPP_FPDB_PopulationbyGender take two

Remember that we want to single-columnize all those year/gender fields, the better to pivot table them (for all those reasons we’ve detailed above). Feed the data through the Alt-D, P process we stepped through in part 1, and you should be brought here (in excerpt below):

cons6

(My inclination here, and one you may not share, would be to entertain the TableTools > Convert to Range (in the Tools button group) possibility, so as to step over those nasty table-specific formulas – and you could also rename the field headings).

Now even as we’ve succeeded in reforming the data we’d almost surely want to break the Column data into the standard year and gender variables, a prospect that necessitates another, albeit more garden-variety chore. We need to in effect to split the Column data above into yearly and gender apportionments, and so I’d head the right-adjoining column Year and write, immediately below (assuming it all starts in row 1):

=LEFT(B2,4)

And copy down. As for the gender variable, I’d hop into the next column, call it gender, and write in E2:

=IF(RIGHT(B2,6)=”Female”,”W”,”M)

That is: if the cell’s 6 rightmost characters yield the term Female, register a W in the cell, and if the condition isn’t met, post an M therein. Copy down and you’re done. (You could also optionally impose a Copy > Paste Special > Values treatment upon these columns, and simply delete the Column column.)

Now you’ve granted yourself three substantive breakout variables – State (again, primevally called Row), Year, and Gender, for your pivot table delectation.

Toiling in, and with, the Fields: Restructuring the Data, Part 1

22 Aug

No entomologist am I, but if you have a bee in your bonnet it seems to me you take the hat off – that is, you do something about it. The bee buzzing in and around my haberdashery? You’ve heard it from me before, and now you’re going to hear it again – it’s that curious relation between fields and the items furrowing the fields, or put otherwise, parameters and their values.

Past exposition notwithstanding, the matter could stand a bit of review, although the archivists among you could reopen my September 28, 2012 blog as well for a quick reread. In any case, consider these demo data (and don’t inquire after the all-over-the-place student test scores – they were randomly generated):

con1

To click along with me, download the above data here:

Student grades by subject

Not much to controvert here, you’d allow. But in fact the field status accorded each and every academic subject strikes a blow against spreadsheet efficiency, and for this reason: plow the data into a pivot table, and if you want to calculate a student’s test average, for example, you’ll need to plant each subject into a calculated field – and that’s a bigger-than-necessary, click-happy deal. And ask yourself how this would play were you forced to roll your eyeballs across 50 such fields instead of our half-dozen. Moreover, if you wanted to shift the mathematical operation – say, move from an average to a maximum – you’d either have to edit the calculated field, or calculate a new one.

Restructure the data, however, into something that starts like this

con2

Wherein each subject presents itself as a value in the same, unitary field, and that pivot-tabled average enables itself thusly and simply:

Row Labels: Name

Values: Score (Average)

But yet one could reasonably opine that all that analytical elegance comes with a price, charged to the data entry department. After all, the student/test score matrix comprises 70 cells worth of data (absent the header entries) – ten student names keyed to six scores per. My “improvement”, on the other hand, enjoins the user to enter 180 cells – six test subjects and their associated scores, each wired in turn to a student name. That’s a lot of elbow grease, and merely to lubricate the wheels of a pivot table; and in fact that’s a point well taken, one we need to keep in mind; but just the same, I’ll stand by the idea that the greater analytical good is served by lining the data up in the manner I recommend above. And anyway, if the data are already dubiously arrayed, we want to know what pains we’d need to take in order to overturn that fait accompli, and if those pains are worth it. The question, in other words: what would we have to do in order to get the data as pictured in the first screen shot to look like the data pictured in the second, short of simply re-entering them     ?

Here’s how (understand that we’re just reciting the instructions here, and want to reserve the explanations for another time. You can also read this boffo tip first-hand here, where I came upon it):

  • First, click anywhere among the data.
  • Then click Alt-D and follow by tapping P, thereby blowing the doorknob off one of the great clandestine, closeted, latter-day Excel options, the PivotTable and Pivot Chart Wizard (as the caption attending Alt-D whispers to us, the Alt-D keystroke tandem bores through the time tunnel back to older Excel command sequences).
  • Tick Multiple consolidation ranges, even if our intentions here remain consolidation-free.
  • Click Next and tick “I will create the page fields” (though it appears you can get away here with “Create a single page field for me”, too), an alternative that time-stamps the command’s antiquity. “Page”, after all, is the terminological forebear of “Report Filter”.
  • Select the range of student/test score data and click Add.
  • Click Finish. You should see

con3

a pivot table, its default grand totals set to sum, but don’t worry about that. Now here’s your homework assignment: Look right, at the PivotTable field list:

con4

How many fields do you count? I get three – but remember, we started the table-making process with seven, the student names and their six subjects. Think about it.

No­w double click the grand, grand total – in our case, the value 3812. You’ll see (in part):

con5

And that’s exactly what we want: all the data are columned in the way we want them, with all the test subjects redefined as items, all lowered beneath the same field, and in a real table besides. And now you can do

Row Labels: Student

Values: Value (Average)

And so on. But there’s more to say about all this.

(Note; Many thanks to commenter Andrew Marritt for apprising me about Google (now Open) Refine’s ALT-D, P data-reorienting equivalent. Hope to say more about this in the next post, too.)

Sociology Section Enrollments, Part 2: Collective Behaviors

16 Oct

Curious but methodical, did I quotably describe the American Sociological Association’s section membership data’s spreadsheet, calling attention first of all to the sheet’s functional header row in 2 that manages to ignore the data resting atop it in 1; and in that connection I owned up to my own curiosity about row 1’s resistance to any sort applied to the data set beneath it. It turns out that curiosity was apparently misplaced. If, for example, you proceed to sort this data set:

asa21

Its numbers will sort properly – for whatever reason – even if you click among the data in the top row. even as that attainment seems at least slightly curious too, at least to me (for a resume of some of the vagaries of sorting see this post by Excel ace Allan Wyatt).

And while I’m mid-course correcting, you’ll recall the supposition I entered in the previous post – namely, that ASA members are confined to a maximum of three selection enrollments, a deduction I prised from the enrollment averages springing from the race data workbook. It turns out, however, that I was wrong again; my misunderstanding was righted by ASA Director of Academic and Professional Affairs Teresa Ciabattari, who tipped me to the fact that no enrollment limitations are imposed on members. But that very open-endedness rouses the question why per-member enrollments thus hover around the 2 mark. Academics are busy people, of course, and their collective professional experience may have centered the normal curve around that number. What isn’t here is a frequency distribution of sorts, one that would count the numbers of members who affiliated with this or that many sections.

And I learned something else not entirely extraneous to the above: that section memberships must be paid for, the costs of which are interestingly variable. I was thus bidden to wonder if the dues might be negatively, or even positively, associated with section enrollment numbers (the most expensive: $52, for the Community and Urban Sociology section, a fee that at the same time grants members a subscription to the City & Community quarterly).

They aren’t, apparently. The correlation between section dues and their numbers comes to .127, a value that surely won’t quicken the heart of a sociologist, but at least tells us that money isn’t everything.

But back to the curious/methodical theme. Consider the organization of the membership-by-race worksheet as it was conceived by the ASA:

asa22

Now suppose you’re been entrusted with the raw membership data in their primeval, pre-spreadsheet form, along with a remit to put the numbers to some good analytical use. As such, you need to bang together a sheet that serves your purposes and only yours, and not the interests of a general readership who intends nothing more than a breezy scan of the numbers. What, then, would your sheet look like?

Here’s the form that functional recreation might assume: First, I’d efface the gray banding that could discolor a piquant conditional formatting of the cells. Next, I’d decouple all those merged cells in row 1 and see to it that each and every column/field was properly headed, e.g. in excerpt:

asa23

And by extension, I’d instate Row 1 as the data set’s header.

But these steps toward a remake of the worksheet conceal a false start – because If I’m shaping the sheet for myself and not for the general reader, I’ll thus decide that I don’t need the section % fields in the data set at all, headers notwithstanding. Once I’ve subjected the set to a proper restructuring on my terms, I can tap into the Show Values As > % of Row Total option and derive section percentages by race when needed; and if I don’t need to know those percentages, then I don’t need the section percentage fields either.

So on behalf of a prudent minimalism I’d delete all the section % columns, and by the same token I’d delete the Membership field; its aggregates can likewise be derived in a pivot table from the individual race section numbers.

But a question needs to be asked about row 2, the sheet’s erstwhile header. That row records ASA membership totals by racial category, and those data are not of a piece with the holdings in the following 52 rows. Row 2 enumerates the actual totals of ASA members by claimed racial identity, but they do not inflict a double-count upon the sheet’s fields just the same. Because members can enlist in as many sections as they wish, the totals for the eight race fields won’t add up to the figures in 2. Thus, the enrollment numbers in row 2 possess a discrete, non-duplicating character, but at the same time they don’t belong with their companion rows. On balance, then, I’d say the row has to go.

And for something like the coup de gras –remember, I’m earmarking this data set chiefly for myself – I’d run the records through the Get & Transform routine described here, thereby unpivoting (blame Microsoft for that word; I take no responsibility) the columns into this most pivot-table-friendly arrangement, in excerpt:

asa24

And of course I can rename the Attribute field Race, or whatever suits me (but if I’m working in the privacy of my own room and left to my own analytical devices, I may not even bother to rename the field).

And once I’ve gotten this far, my three fields can do almost all of the work of the original sheet design, and some more besides, for example:

asa25

That is, here I rank section enrollment preferences by race (the ranks read down the column), and thus learn that Sex & Gender proves the most popular section for White sociologists, while Racial & Ethnic Minorities stands in the first position for both African American and Multiple Selected ASA members. And for the 1162 members who, far whatever statistical or preferential reason slip into the Missing category, Political Sociology tops the enrollments. And again, the Get & Transform rendition of the data set enables the user to tee up a Slicer, though which race or section-specific data – whose items are all now domiciled in a single field – can be neatly filtered.

But the larger point is that even if my revamp does nothing more than emulate the potential functionality of the sheet I download from the ASA site, I’ve achieved in three fields what it took the original 18. That’s prudently minimalist, isn’t it?

Going Back To Your Routes: NYC Bus Data, Part 2

26 Jan

With its buzzing confusion of 1’s and 0’s, the spreadsheet ledger of the stops and starts of the Q44 looks for all the world like the leaked innards of some secret binary code, but the numbers are assuredly public; each 1 remembers the time and the place when and where a Q44 passed by or came to momentary rest at a scheduled stop.

And if you sum all the 1’s (add the 0’s too, if you like) your formula is going to report 507,218 back to you, rather a large total that counts the number of times the Q44 traversed a stop in September, 2015. And do the subsequent math; divide the sum by 138, the number of apparently operative stops on the route (remember that stop 103900, wherever it is, is empty and apparently isn’t serving riders) and you get the approximate number of Q44 runs (about 3,600) across the month. Divide that result in turn by September’s 720 hours and you’ll find a Q44 heading your way around five times an hour, and that sounds about right.

But companion analyses of the stops aren’t quite as forthcoming. Take note of the 1’s informing any particular row, for example 14959, a pinpointing of the 44’s daily (or nightly) round at 4:39:28 AM. A sideways scan along the row turns up several rivulets of consecutive 1’s staking contiguous stops, e.g.

bus21

Now even those among you who’ve never stepped aboard a 44 will reject the likelihood that four of them would find themselves queued along adjacent stops at 4:39 in the morning. Far more probable, of course, is a tracking of a very solitary bus sailing along its appointed round at that time with all due speed, rolling through the four stops in the two-or-so minutes culminating in the next interval – 4:41:36 (thanks to Open Bus contact Justin Tillman for corroborating the above). As a result of that sort of double and quadruple count, an unerring read of the number of different buses motoring through a particular interval’s row fails to resolve, at least not without a deep and lengthy think.

But there may be practicably more to learn by thinking vertically instead – that is, for example, by asking about the timing activity of buses pulling into, or past, any given stop (thanks again to Justin for commending the strategy); and those comings and goings are headed down the columns.

But if stop-specific activity is what you want to track you won’t want to pursue an average times-between-buses per metric; that’s because averages can’t do justice to the commuter’s daily scourge – the bunching of buses in maddening fraternity that should have been properly spaced apart (something I’ve actually seen today; I’m in New York and have witnessed several bunched 44s) . After all – think of 5 buses in a hour wheeling their way past stop A, each arrival lock-stepped precisely 12 minutes after the previous one. Then think of one bus traveling through A and succeeded 55 minutes later by four buses packed into the next 300 seconds. Either way you get 5 buses an hour, or one every 12 minutes – and that average isn’t telling your story.

The stat that you need, then, is a standard deviation, a measure that’ll say something about the variation in the bus spacing, and looks past the deceptive flattening that averages perpetrate upon the stop times. And indeed it’s standard deviations on which Open Bus relies in their Final Averages spreadsheets (click on the 2015/9 data link to see one. You’ll have to parse these data with a Text to Columns routine once you download them, by the way), and for two reasons, I think: the one I’ve just cited, and the scheduling realities of routes, on which some buses simply run more often, and the same routes run variously often per time of day. Again, then, an average can only mislead and fail to properly compare the numbers, both between and within different bus lines.

But the route toward standard deviation is paved with issues, not the least of which is the assignment of each bus stop to its own field, and not to the item status I’ve espoused many times in past posts. On the other hand, with 2,000,000 cells of data the sheet simply can’t reinvent itself into a like number of record rows, so we have to make do with what we have, and my hopeful suspicions notwithstanding I’ve been unable to set forth a single array formula that could be copied beneath the columns of each stop. So what I’m pushing now with all due diffidence is a pivot-table-fronted “solution”, one that imperfectly requires that only stop be considered at a time – and that’s imperfect to be sure. But it goes like this:

Row Labels: Any stop (I’ve selected the first, stop_503964; filter for its 1’s only)
Time

Values: Time (Show Values As > Difference From > Base Field: (time), Base Item: (previous)

bus22

In an occupied cell – I’ve chosen I1 – enter =STDEV.P(B5:B10000)/60
I get, for stop 503964, a standard deviation of 10.28 minutes between bus appearances.

Now to translate: I’ve selected the Time field – comprising the rawest of UNIX raw data counting the number of seconds elapsed between the values in the data and January 1, 1970 – because we’re concerned to figure times elapsed between buses sighted at the particular stop. We’re not now interested in the date of the events, but rather only their timings relative to all the other bus sightings. We need to filter for the 1’s populating a given stop simply because that value represents the appearance of a bus at the given time, and so we want to compare these and only these. The Time field is made to reappear in the Values area because we need to subject Time’s numbers to a mathematical operation – each time’s difference from the immediately previous time, and these won’t appear unless they’re called up, as it were, by the times in Row Labels. Thus each time at which a bus appears is subtracted from the next bus appearance, returning the intervals out of which the standard deviation is fashioned. The result is then divided by 60, because all the times are expressed in seconds. The standard deviation formula is itself external to the pivot table, because while you can ask a pivot table to perform standard deviations here it’ll calculate the SD for each item, and there’s no standard deviation for one value.

To learn the standard deviation for any other stop you need to drag the active stop field name off the table, substitute the next desired stop, and again filter for the 1’s. Imperfect indeed, because the process is necessarily sequential. But it works, more or less.

Got that? While you work on it, I’m trying to figure out how many rides I get on my Metro Card – now that’s a challenge.

Data Input and Scholarly Output: More UK Research Excellence

6 Jan

Don’t touch that dial, and hands off your mouse wheel; it isn’t déjà vu that’s spooking your reverie but rather another look at the UK’s 2014 Research Excellence data, this one pictured by an extensive piece in the Times Higher Education magazine’s January 1-7 issue. THE (the acronym, not the article) makes considerable room across its pages for a subject-by-subject, peer-judged drilldown of the research impact unleashed by the country’s university departments, an expository angle that points it quite a few degrees from the Research Fortnight study of university funding prospects and their relation to the Excellence figures that we reviewed a couple of posts ago (and thanks to Fortnight’s Gretchen Ransow, who sent me RF’s original data in its pre-Guardian/Telegraph/Datawrapper state).

Impact? It’s defined by THE, at least in the first instance, by a department’s GPA, or grade point average – a summatory index of department staff research activity scored on a 1-to-4 scale, the higher the more impressive. The foundational calculus is pretty off-the-shelf:

(4s x number of staff so achieving+3s x number of achievers+2 x number of achievers+1 x number of achievers)/count of all staff

Thus a department of 20 members, 10 of whom are adjudged 4-star researchers and 10 pegged at 3, would earn a GPA of 3.5. But because departments were free to submit just the work they elected and were equally freed to detain other researches from judgemental view, THE coined a compensatory intensity-weighted GPA, arrived at after swelling the above denominator, by counting all staff whose work could have been slid beneath the REF microscope, even if these were kept off the slide, for whatever reason. If the above department, then, had 30 eligible assessment-eligible instructors in its employ, its intensity-weighted GPA would slump to 2.33 (its 70 grade points now divided by 30; of course, a department’s intensity-weighted GPA can never exceed its GPA, and can only match it if all faculty had submitted work).

In any case the REF has kept us in mind by marshalling its raw data into none other than a spreadsheet, obtainable here:

http://results.ref.ac.uk/DownloadResults

(And yes, you’ll have to autofit the F column. Note as well that the sheet does not offer the entire-staff figures with which THE scaled its intensity-weighted GPA.)

You’ll see each department receives three sub-ratings – Outputs, Impact, and Environment (explained here), these weighted in turn (Outputs contributes 65% of the whole) and shook well into the Overall figure, the impact metric which THE reports on its pages (in effect, then, the Overall datum operationalizes impact).

If nothing else, the dataset seems to exhibit all due good form, what with its proper concern to treat as items what could have, and more troublesomely, been awarded field status (again, see my discussions of the matter in my August 22 and 29, 2013 posts). For example, the Profile field properly ingathers these items – Outputs, Impact, Environment, Overall – when a heavier hand would have parcelled them into independent fields. But a few curiosities fray the edges, nevertheless.

If, for example, you want to reconsider the data in pivot table form, well, go ahead. And when you do you’ll get here:

 re21

Allow me to suggest that the single-columned range requesting your OK isn’t what you’re looking for – but what’s the request doing there to begin with? Scroll about the sheet until you pull over by the merged cell L6:

re22

And it’s that that unassuming, title-bearing text that’s jamming the table construction, by squatting directly atop the data and stringing itself to the data directly atop it in rows 3 to 5 (that latter row is hidden, and features another bank of apparent, and extraneous, field headers. I can’t quite explain what they’re doing there). Thus the contiguous data set – the warp and woof of pivot tables – starts in row 3, and the pivot table launch process regards it as such; but because its “headers” head column A only, that’s all Excel chooses to work with for starters; hence the curtailed A column range in the screen shot above. The fix of course is simply to delete the title in L6, restoring the data set’s inception to row 8 (note that because L6 is a merged cell, it takes in row 7, too).

Next, in the interests of corroboration, you may very well want to go on to calculate the sub-ratings streaming across each data row. Let’s look at the numbers in row 9, the Outputs for the Allied Health Professions, Dentistry, Nursing, and Pharmacy area at Anglia Ruskin University. The 1-4 ratings (along with unclassified, the rating set aside for work of the most wanting sort) – really field headers – patrol cels L8:O8, and so I’d enter the following in Q9:

=SUMPRODUCT(L9:P9,L$8:P$8)/100

Here the most nifty SUMPRODUCT, a deft function whose Swiss-army-knife utility is very much worth learning (and that goes for me too) multiplies paired values in L8/L9, M8/M9, etc. across the two identified ranges (or arrays, as they’re officially called) and adds all those products, before we divide the whole thing by 100, because the REF’s staff counts need to be expressed anew in percentage terms. The $ signs hold L8:P8 fast; they store the rating values, and because they do, will be brought to all  the 7,500 rows below.

But if you’ve gotten this far you haven’t gotten very far at all, because you’ll be told that Anglia Ruskin’s Health Professions scholarly exertions earned that department a 0, even though THE awards it a 2.92 – and if you don’t believe that, it says so right there on page 41 of its hard copy.

Someone’s wrong, and it’s us – though with an excuse, because the REF spreadsheet has shipped its field headers in label format. 4* isn’t a number, and neither is “unclassified”. Mulitply them, and you get nothing – in our case, literally. Of course the remedy is simple – replace 4* with a 4, and so on, though it should be observed that you can leave “unclassified” alone; after all, multiply a value by 0 or by “unclassified”, and either way you get zero.

In any case, once we get it right you can copy the formula down the Q column and name its field Averages or whatever, and you’re ready to get your pivot tables in gear. But the larger, or at least the prior, question that could be rightfully asked is why the REF did what it did. It’s clear that its data stalwarts are happy to make their handiwork available to folks like us; that’s why they’ve given us a spreadsheet. How then, do we account for the little bumps in the road they’ve raised across our path? There’s nothing secretive or conspiratorial about this, of course, but the very fact that the REF averages of departmental accomplishment just aren’t there leave them to be computed by interested others. And that bit of computation could have been eased with bona fide, quantified ratings in L8:O8, and a repositioned title in L6. Those little touches would have been…excellent.

Eurostat Data: Some Footnotes and NUTs

30 Nov

Sorry to butt in in the middle of breakfast/lunch/dinner (tick one – I’m requesting your location just once, and my intentions are cookie-free), but allow me to ask: if a web site makes its data – the same data – available in both spreadsheet and PDF incarnations, what are we being told? It seems to me that the latter possibility speaks to the former; that is, the shipping of information into a read-only, do-not-touch mode suggests the other shipment is meant to be used.

One man’s opinion; and having so opined, I took my self-satisfied take on the matter to the data depot in the European Union’s Eurostat site, the statistical office for that far-flung association. There’s a lot of stuff in there; climb its tree structure and you’ll come away with all manner of facts and figures and nuts and bolts about member nations, a great deal of which (perhaps even all) clothe themselves in spreadsheet raiment. And again for seekers of alternatives, a good many of these parcels of stuff can come to you in PDF suit-ups too.

And so I shot an arrow into the air, and it came down upon this spreadsheet

Eurostat_Table_guest establishments

(It’s original name isn’t nearly as graceful)

and headed in row 1 “Number of establishments and bed-places by NUTS 2 regions”, that pun-evocative acronym abbreviating Nomenclature of territorial units for statistics (look here for more), a vocabulary for hierarchically tiering EU territories by are.

In fact the downloads offer themselves to us thusly:

 euro1

For starters, at least, I went for the XLS With Footnotes and short description, the workbook I’ve linked above.

Once you’ve auto-fit the columns (and as usual, the spreadsheet author hasn’t ) you’ll understand that rows 1 and 2 have to be pushed away from the data below with your true-blue blank row in between, as does 321, directly beneath the cell that among others names the Batman region (face it – you can’t help but admire my comedic restraint here). Next lower yourself into the data and think about the colons (e.g. row 139), that plant data-not-available markers all over the sheet. True, a replacement zero wouldn’t work here, representing as it would an authentic, quantified data point; but still, the labels are data weeds in a field otherwise marked out for values (and hold on to that thought).

But then there’s column C, and E, and G, and all those other empty corridors elbowing their way between the yearly numbers. Were these vacuities nothing but nothing, that is, airy space-makers between the data, we could simply delete them. But in fact the columns aren’t quite all empty, and have after all received a marching order. Remember my download asked for a workbook with footnotes somewhere in the workbook, even as I couldn’t quite fathom what that ask was going to deliver. It turns out that C, E, and G (I know; sounds like a major chord) and their compatriots have reserved their spaces for those very footnotes. Look for example at S130:S133 and their (u) entries; and if you look in turn back to what should now be row 322, its (u) explication, flagging a datum of low reliability, reads loud and clear.

And those largely but not wholly unattended footnote columns really mean well, because had the (u)s been made to cling to the values they footnote in their several cells, the value would  have regressed to a text standing as a result, and would have repelled their data-purporting charge. Thus the sheet designer has rightly put the footnote-labels in their place – a place at a decisive remove from the numbers to which they’re keyed, in a column all their own. And yet the colons, themselves footnoted in row 322 as well, punctuate the columns of the values, albeit in otherwise unoccupied cells.

But cavils aside, what is the analyst to do with those footnote-spattered columns? How, for example, can a pivot table look past those extraneous, heartily non-contributory soft spots in the data set, and succeed in doing something manageable with the whole potpourri?

I think it can, by doing something exceedingly simple. How about naming the footnote columns? Once named, the columns advertise themselves as bona fide fields, however sparse, which can be then be properly ignored by the pivot tabler who wants to study only the annual numbers in the year fields.

Simple indeed, but putting new fields into play won’t complete the assignment. You’ll still need to do something about the colons, and because they’re sprinkled among the data-toting year columns, that something won’t be our standard sort-them-to-the-bottom-and-jab-in-an-empty-row patch, a nice but futile try here that would continue to leave colons above the bottom of the data by which you sorted in many of the other columns, by which you hadn’t sorted. Better here, then, might be a find and replace, in which the colons could be overwritten by zeros. Those numeric replenishers would at least restore some numeric standing to the cells, and so qualifying the data to run through their quantitative paces, including some grouping activity. Again, however, the zeros won’t be entirely truth-telling, as the NUT in question will be understood to have no guesting places, when in fact we don’t really know if they do or not.

The larger question, then, is what we do with data of the NA kind, however labelled. Another alternative here, then: find the colons and replace with say, the number 1,000,000, sort the numbers ascendingly, and hammer in the blank row right there. That is, a prudent surrender. There isn’t much else you can do with records that don’t know what’s in their fields.

I also feel obliged here to return the discussion to the field-item disquisition I’ve conducted on these pages, and the analytically advisable demotion of the year fields to item status, all rounded up beneath one, grand field.

And the plea to fairness again obliges me to begin to close the circle and track back to Eurostat’s square-one download options. Had I downloaded the data Without footnotes, I would have spread this sheet before me instead (in excerpt):

euro2

And that looks more like it, doesn’t it? But we still have those colons, don’t we?

But please – go back to breakfast/lunch/dinner. And sorry about the coffee spill  – dry cleaning should get it out. Don’t forget to bill me.

NHS Bed-Occupancy Data: I Need to Sleep On It

19 Sep

If health service provision is here, can the provision of data be far behind? That affirmatively-answered question comes to you from the UK’s National Health Service and corollary entities, and a great many of the holdings are columned and rowed, e.g.

http://www.england.nhs.uk/statistics/

http://www.hscic.gov.uk/searchcatalogue

http://www.ons.gov.uk/ons/taxonomy/index.html?nscl=Health+and+Social+Care

And among the many, you’ll find this workbook, compiling a series of counts of overnight bed occupancy numbers and rates across 226 hospitals nationwide:

 Beds-Open-Overnight-Web_File-Final-Q1-2014-15-34655

 

Now in fact the book’s Area Team by Sector sheet appears to have done much of the analytical work for us, by aggregating the totals by what it calls Area Teams from the finer-grained NHS Trust by Sector sheet alongside of it, with “sector” presumably denoting the medical rubrics beneath which the bed figures gather: General & Acute, Learning Disabilities, Maternity, and Mental Illness. Aggregation there is, but the sheet’s geography suggests that attainment was spurred by some instrument other than a pivot table, and that’s fine, of course. Where does the Constitution of your or any other country mandate pivot tables upon its citizen data crunchers?

But there could be more. Apart from reconsidering the question advanced by last week’s post, the one asking what remains to be done with data that seem to have already been put through their paces, we’re yet left with the Occupied by Specialty sheet, breaking out the bed-occupancy numbers by 76 medical treatment categories. If you want to pivot-table that one, think about these emendatory matters:

  • Blank row 17 must be deleted.
  • As must the grand totals reaching across row16. Again, a totals data row masquerading as just another record will double any tabled results – and that’s just wrong; and affirming England in F16 as but another Org Name is likewise and self-evidently wrong, too.
  • The data in Columns B and C – Year and Period – are identical throughout and hence superfluous, but as such inflict no actual impairment on the sheet and could simply be ignored. I’d delete them, though.
  • Note the same-width speciality columns and their variably-spaced headings; the broken text wraps in the latter is a necessary consequence of the former. That’s a simple design complication, but if the concern is to promote parity among specialities through column equivalence the answers are none too forthcoming. A Scaled-down font would ultimately achieve a universatl heading fit, but will take its toll on legibility. Dilating columns to about 22 from their current 12 would smooth out the wrap-raised bumps, but users will have to do that much more scrolling across the sheet before they’ll be able to see it all.
  • On a deeper level (and we’ve been here before, too): the specialities, each of which has been allotted a field all its own, should have been demoted to item status and made to stand beneath one all-subsuming field. Given the plurality of specialties it would be useful to be able to tick each one up and down a single Slicer – but you can’t, because Slicers slice through one field at a time:

nhs1

76 specialities? You’ll need 76 Slicers.

And the transformative Alt-D, P recombination of fields into the desirable data set I detailed in my August 29, 2013 post won’t work here, because Alt-D, P will err on the side of a hyper-inclusiveness, investing only the left-most Year field with Row Label status and relegating all the other labelled fields into item status along with the specialties – and that won’t work. You don’t want Org Name to liken itself to Ophthalmology; one’s a bona fine Row/Column parameter, the other a proper Values-sited datum. Try it, if you don’t see what I mean.

  • Click on any cell among the specialty data and it turns out the bulk of them drag a long train of decimals behind or before them. It is clear these numbers must have emerged from some spell of formulaic activity – they’re averages, after all, and they’ve simply been treated to a bit of presentational sprucing by rounding themselves off – kind of. Now click in any number-sporting cell and observe the format – Custom. Check into Number Format and you’ll see

nhs2

 

Now I have neither used nor understand that format yet, and I need to take it into the shop and probe its whys and wherefores, assuming it’s been used with all due number-managerial intent. But click in a number in the Area Team by Sector sheet and its number format reports Accounting instead I can’t account (no pun intended, but let’s go with it) for the disparity here, but it matters, and it matters in this way: Remember our numbers are really extended affairs, pulling across all those decimals, and if you Accounting-format them, throw them into a pivot table and group them, they’ll look something like this:

nhs3

That is, the decimals overthrow their Accounting round-offs and burst into the pivot table as is, and it isn’t a pretty sight, and it can’t be reformatted, because Row Labels are labels and numeric-format resistant. But impose the Custom format above upon a set of values and they obediently group, as they do in Occupied by Specialty e.g.

nhs4

 

But for other numbers in the workbook, Accounting-formatted grouping does happen, and for some other Custom-formatted values. it doesn’t. I’m duly puzzled, but has to be an explainable take on these several outcomes, but I don’t have it yet, and it requires a think-through; hope to get back to you on it shortly. But I’m on vacation now, and they tell me if I don’t take it now I can’t roll it over. Hope to see you in about two weeks.

Time (and Record) Management: OECD Data

25 Jul

The OECD clearly has a lot of time on its hands – after all, think how long it takes to say The Organisation for Economic Co-operation and Development. But time likewise figures in its research remit, here keying a survey of the time management habits of denizens of its member nations (at least most of them, along with representations from China, India, and South Africa, designated OECD partners), and recapitulated here:

OECD_1564_TUSupdatePortal

 

The data – and it’s the Total sheet I’m reviewing – spread out some neat cross-national takes on how respondents apportion their daily round, normalized for each nation’s activities to the day’s 1440 minutes (and I assume normalization means reconciling the inter-national variance in activity definitions; see the Activity category tab). You’ll be interested to know, for example, that South Africans appear to be able to devote 562 minutes a day or night – nine hours and 22 minutes – to sleeping, and that while earmarking but 463 minutes to that circadian necessity we’re left to conclude that the Japanese and Norwegians seem to be drinking the most coffee. It is also noteworthy that respondents from Turkey – the only preponderantly Muslim country in the national roster – spend the most time in religious activity (row 32), at 19 minutes (South Africans come in one minute behind). It’s all interesting, but I can’t resist the geekish aside to the effect that the row 32 numbers are variously positioned – with some middle-aligned, (along the vertical axis), and others right (and you might very well also want to pump up the 8-point text prevailing throughout).

But at the same time the numbers urge a familiar question upon the plenary: For whom is the spreadsheet intended – a public of passive, if interested, readers, or that A-Team of caffeinated deep thinkers who want nothing better than to do something with the data? If you’re siding with the latter insurgents, and want to act upon what it is you’re viewing here with your canteen of slicers (and Slicers) and dicers, then the data – and we’ve seen this before – have to be reshaped. That’s a question we’ve asked in so many words before, and it requires asking here as well.

Let’s see. For starters, we need to do something about row 3, harboring as it does survey date identifiers impersonating as a row of data. And a similar reprisal need be practiced upon those purple subtotal rows, which should be deleted. Leaving them in place and proceeding to drop the data as they stand into a pivot table will do nothing less than double any summed results, and yes, we’ve seen this before (e.g., my January 24, 2013 post).

On the other hand stripping those purple rows will also deprive the sheet of their associated headings, e.g., Paid Work or Study, Unpaid Work. But again, rows of data should be of a piece; insinuating rows of titles into the average daily minutes is tantamount to playing checkers, when in fact we’re playing chess. Moreover, the Men & Wom field in the A column sufficiently identifies the activities subject to the minutes-per-day estimates. But there’s more to be said about this.

But before we expound the larger issue betokened above, swing over to column AC and its swath of computations averaging the numbers lining up to their immediate left. Try cell AC4:

ti1

And try explaining it. Note all the references in the expression are alphabetically contiguous; and given its cellular continuity how do we defend the formulation above, when we have

=AVERAGE(C4:AB4)

at our disposal? I can’t answer my own question.

And indeed – those averages ignore the three partner nations on the other side of AC, thus plunking the wrong kind of data amid the time-estimate numbers. And the blank AD column has to go.

But there’s that larger issue, reprising an earlier question, the one I asked and attempted to resolve in my August 22, 2013 post. The Total sheet confers field status upon tufts of data that should more rightly conceive of themselves as items relenting to a larger, governing field. Thus, for example, the field-defined member countries in the sheet should be made to submit to the controlling aegis of a Country field. It seems to me, then, that a reconstructed data set would wheel out records looking something like this:

ti2

Again, umbrella fields on the order of Gender or Country are far mightier enablers of the kinds of grouping and ordering feats that pivot tables perform, and well justify the necessary surcharge the additional data entry would exact (again, turn back to August 22).

For example, the data reforms I’m (again) proposing would free a pivot tabler to rank national time outlays by any activity, by grabbing the Rank Largest to Smallest marble from the Show Values As bag of alternatives. The point is that Rank Largest to Smallest can do what it does only among items planted in the same field – and not between data settled in different ones.

So if you’re wearing the A-Team colors, why not reorganize the data here and shop it to the OECD – if you have the time?

My Little Town: London Real Estate Data

3 Apr

Location, location, location; and no, it’s not the real estate that’s got me waxing bromidic, it’s the spreadsheets, of course, and those recurring second thoughts about how their data are conveyed, and exactly how and where they’re…located.

Recurring, indeed, at least on these WordPressed pages; and apropos that perennial restiveness let us pack up and move to the Average House Prices Borough workbook on offer at the data.london.gov.uk site, and learn a little about that city’s combustible and ever-newsworthy real estate sector, and its no-less-combustible spreadsheet. It’s coming to your neighbourhood here:

 Average-house-prices-borough

The data here clamber about eleven scarily similar sheets, strewing price and sales volume numbers across the 33 London boroughs (they’re counting the interstitial City of London as a borough here). And while the data are readable, are they usable? That is, can you do something with them?

A loaded question, replete with half-full answers, because of course the answer depends. Take, for example, the identically-structured Median and Mean Annual sheets, asking after their receptivity to pivot tabling. My answer: iffy. The sheets, you will note, drape the data in year-per-field mode, e.g.:

 prop1

And that peculiar positioning, as I took some pains to urge upon you in my August 22 and 29, 2013 posts, clogs the pivot tabling enterprise, and so if you want to effect the kind of data transposition I described there (but won’t belabor) here, you’ll need to:

  • Delete the vacant row 2.
  • Consider this: because the restructuring espoused in the August posts trisects the data into Row, Column, and Value fields only (which in turn are cast into a three-legged, or field, Excel Table), and because it is invariably the leftmost field in a data set that resets itself in Row terms, the Area field text data will assimilate into the Column field, even though its borough names are not of a piece with the numeric year data. What you could do, then, is simply delete the not-terribly-meaningful Code field, thus vesting Area – now the leftmost of the fields – with Row status.
  • Earmark only A1:S34 for the pivot table-eligible makeover, that is, just the London borough-identifying rows. The summatory data beneath row 34 communicate different kind of data.

You could do much the same with the Mean Quarterly sheet, but their double-headed fields – Q1, Q2, etc. supervened by the year-referencing 2010, 2011, etc., in the merged cells in row 1, won’t work. Delete row 1 and rewrite the Quarterly heads to a differentiating Q12010, Q2010, and so on. (You’d need to do something similar with the Upper Quartile Quarterly, which for all the world sounds like a scholarly journal. Here, though, you’d have to pad the 1,2,3,4 quarterly headers with the year data instead.)

It’s also occurred to me that once you’ve submitted the data to the above process you could march on and develop pivot-table breakouts along Inner-Outer London lines, as per the aggregated Inner-Outer mentions in the lower reaches of most of the sheets and which won’t appear in the pivot tables we’ve commended above. (For a detailing of the Inner-Outer borough opposition see this piece. There is, by the way, a dispute about where the borough of Newham goes.) So once you’ve forced the data into the Row, Column, Value triumvirate you can then add a fourth field (say, called I/O, and that’s a not a computer throughput allusion) stocked with borough Inner-Outer identifiers.

To head in that direction you first need to hammer together a lookup table, the data for which I’ve carried off from this bit of real estate:

http://www.londoncouncils.gov.uk/londonfacts/londonlocalgovernment/londonboroughs.htm

 

Camden

I
Greenwich I
Hackney I
Hammersmith and Fulham I
Islington I
Royal Borough of Kensington and Chelsea I
Lambeth I
Lewisham I
Southwark I
Tower Hamlets I
Wandsworth I
Westminster I
City of London I
Barking and Dagenham O
Barnet O
Bexley O
Brent O
Bromley O
Croydon O
Ealing O
Enfield O
Haringey O
Harrow O
Havering O
Hillingdon O
Hounslow O
Kingston upon Thames O
Merton O
Newham O
Redbridge O
Richmond upon Thames O
Sutton O
Waltham Forest O

 

You’ll then need to edit the borough names here so that they correspond to those in the workbook, e.g., trim Royal Borough of Kensington and Chelsea, and keep an eye on Hammersmith and Fulham. Best practice: copy one instance of each of those two from the Row data into the lookup table.) Copy the table somewhere into the workbook, and call it say, InnerOuter. Then once you’ve put Row, Column, and Value in place, say with the Median Annual data, write this formula in D2 (after naming this new table column I/O):

=VLOOKUP([@Row],innerouter,2,FALSE)

Tables being tables, the formula automatically writes itself down the D column (the [@Row] syntactical bit is how Tables express a cell upon which you’ve clicked, instead of the standard A2, for example. They’re called structured references, and aren’t quite a day at the beach; see a Microsoft discussion here).

Once you fused the data thusly, e.g.:

prop2

(the formatting is basic Table defaulting at work)

You can pivot table this take, for one:

Row Labels: Column

Column Labels: I/O

Values: Value (Average, and formatted as you see here)

 prop3

Remember those are median house prices cross-tabbed by year and Inner/Outer values.

I live in Outer London, by the way. Is that why Outer prices are so much lower?