Excel Live in London, Nov 3-24: Yours Truly On the Bill

21 Oct

You write enough posts, they give you a free ad. Here’s mine: Your correspondent is scheduled to give a quartet of Excel courses in London on consecutive Monday evenings, November 3-24, to be exact, at the offices of the Guardian newspaper near the Kings Cross station. The topics:

• Pivot Tables
• Formulas and Functions
• Visual enhancements (e.g., charts, sparklines, conditional formats)
• Grabbing and Grooming the Numbers (Text to Columns, Data Validation, removing duplicates, issues in spreadsheet design, etc.)

Each offering can be taken singly or in discounted group mode.

To learn more, see http://www.theguardian.com/guardian-masterclasses/the-essentials-of-excel-abbott-katz-digital-course

Handouts autographed upon request.

Mystery Solved, II: A Crackerjack Sequel

8 Oct

Didn’t I tell you Debra Dalgleish is a pivot table expert? You’ll recall I prevailed upon her expertise, in last week’s nail-biter of an installment about grouped pivot table data whose tousled decimal points would not be rounded.

And you’ll further recall my self-satisfying workaround: importing a faux, zero-bearing record into the data set, a supplement that convinced the numbers above it to submit to a rounding at last.

That patch did what it set out to do, but at a cost: the empty zero could bias record counts, averages, minimums and the like, and so my Excalibur exhibits a bit of rust.

And that’s where Debra Dalgleish comes in. In an email communique beamed all the way from Canada she reported this spot-on resolution:

Once the grouping sequence is set in motion and you get here, e.g.


Once there now simply alter the Starting at number to the degree of precision you need, for example:


Or eliminate all the decimals if you wish. Then click OK. The grouping commences, and hews to your decimal significance:


Tres cool. Note that the tweak need only recalibrate the Starting at field, and happily swerves past the Ending at junction. All due kudos to Debra for the superior tack, one which decisively obviates the need to contrive my counterfeit zero record, and which smuggles a meaningful round-off into the grouping process at just the right point.

That last allusion reminds us about the essential, petrified character of grouped pivot table numbers. They’re typically emplaced in the Row Labels area because they’re…labels, in spite of their indubitably numeric inception. After all, a grouped figure such as


can’t really be anything but. We would normally hold this truth to be of the self-evident character kind – namely, that a hyphenated phone number is not a quantitative thing – but again, because grouped numbers are numbers one needs to appreciate the loss in translation once a grouping is thrust upon them.

Of course again if the data happen to boast a zero or two, the data will group nicely all by themselves as a matter of course, and as such won’t need to call up the Starting At intervention. But Debra Dalgliesh’s tip should work all the time.

And it’s holiday time for me again. Sorry – I’ve been working nights to earn this vacation.

Mystery Solved, I Think

3 Oct

Mystification has its place. As an antidotal corrective to haughtiness and unbecoming outbreaks of self-confidence, an anomic stumble or two through the data can work wonders for the reality checker within.

Beware, my brother’s philosophy professor warned, of people who are sure of themselves; and sureness gets taken for a ride through the Beds Open Overnight workbook that scratched at my head in the previous post and across the last couple of weeks.

The problems, again, are those intermittently cooperative pivot tables we had put to the task of grouping various worksheet data, some reporting the data in fastidiously rounded-off integers, and some in tranches spewing rancid jets of decimals, like this:


These several outcomes appeared to perhaps have be geared to respective formats, with the excesses above perpetrated by the garden-variety Accounting mode, and the prim, decimal-less groupings tonsured by some exotic custom format, e.g.


But, if I may anticipate your question, when I painted the custom format atop the numbers suited in Accounting, the decimals yet refused to go away; nothing changed. And that’s mystifying.

In search of enlightenment I retold the mystery to Michelle Tenwick, the contact person making herself thoughtfully available right there in the NHS Trust by Sector sheet. She replied, in part:

“Here the variations seen in the formatting of the data fields are as mentioned so that they are consistent with previous publications. Most importantly, the figures remain the same regardless of whether they are custom or accountancy format.”

That’s all true, but not quite explanatory, and indeed – why would the sheets have been troubled to assume different numeric formats to begin with?

I then put the matter before Debra Dalgleish, a bona fide pivot table expert who’s authored three books on the feature. She offered that because some of the data cells reveal hyphens in lieu of numbers – apparently standing in for 0, that would cause Excel to treat the entire field in Text terms, thus resisting any efforts at rounding. Perhaps, but when I drew up a demo data set comprising nothing but decimal-freighted numbers, the rounding off would not happen.

I was thus left to speculate that the operative formattings on the sheets – be they Custom or Accounting – have nothing to do with the problem. My suspicion – though remember, I’m officially mystified – was that the data themselves had been put through some sort of ringer, perhaps via a download from some other, presently unknown source or sources (note Tenwick’s note about consistency with past publications). I should add that my past experience with the grouping/rounding issue has instructed me that decimal-retention is the default, and so it’s the data here that are amenable to the rounding that pose the question.

But then I had an idea, one which returns me, via the back door, to Debra Dalgleish’s thought about the hyphenated entries, viewed this time on its head. Remember that in fact the hyphens aren’t text – they’re numeric zeros, formatted as an Accountant might (keep the jokes to yourself). It seems, after having subjected the data to a bit of controlled experimentation, that as long as the data has at least one zero, however it’s formatted, the grouping will consent to a rounding. It may be that the intervallic intelligence of the grouping feature requires the grounding of a zero somewhere in the data, thus establishing a floor atop which the intervals can be built.


But whatever the reasoning, a data set seems to call for at least one zero, at least if you like your grouped numbers all smoothed and polished and rid of those scruffy decimals. But what if your data have no zero? You could throw in a dummy, zero-valued record, but that intention would have to be thought through, particularly if you’re counting and/or averaging the data.

And so if the data are zero-less you may also alternatively want, or need, to round the numbers in a user-introduced column in the data set with the ROUND, or FLOOR, or CEILING functions, a matter I’ve already broached here:


And that link points to wider pivot table strategy, an open secret of the enterprise: the trick is to do as much preparatory work in the data set before it washes up onto the pivot table. For all its aggregating might, pivot tables won’t do everything, and a little data-organizational proactivity can help the table along.

And that’s because the pivot table Group Selection command is zero-tolerant.

NHS Bed-Occupancy Data: I Need to Sleep On It

19 Sep

If health service provision is here, can the provision of data be far behind? That affirmatively-answered question comes to you from the UK’s National Health Service and corollary entities, and a great many of the holdings are columned and rowed, e.g.




And among the many, you’ll find this workbook, compiling a series of counts of overnight bed occupancy numbers and rates across 226 hospitals nationwide:



Now in fact the book’s Area Team by Sector sheet appears to have done much of the analytical work for us, by aggregating the totals by what it calls Area Teams from the finer-grained NHS Trust by Sector sheet alongside of it, with “sector” presumably denoting the medical rubrics beneath which the bed figures gather: General & Acute, Learning Disabilities, Maternity, and Mental Illness. Aggregation there is, but the sheet’s geography suggests that attainment was spurred by some instrument other than a pivot table, and that’s fine, of course. Where does the Constitution of your or any other country mandate pivot tables upon its citizen data crunchers?

But there could be more. Apart from reconsidering the question advanced by last week’s post, the one asking what remains to be done with data that seem to have already been put through their paces, we’re yet left with the Occupied by Specialty sheet, breaking out the bed-occupancy numbers by 76 medical treatment categories. If you want to pivot-table that one, think about these emendatory matters:

  • Blank row 17 must be deleted.
  • As must the grand totals reaching across row16. Again, a totals data row masquerading as just another record will double any tabled results – and that’s just wrong; and affirming England in F16 as but another Org Name is likewise and self-evidently wrong, too.
  • The data in Columns B and C – Year and Period – are identical throughout and hence superfluous, but as such inflict no actual impairment on the sheet and could simply be ignored. I’d delete them, though.
  • Note the same-width speciality columns and their variably-spaced headings; the broken text wraps in the latter is a necessary consequence of the former. That’s a simple design complication, but if the concern is to promote parity among specialities through column equivalence the answers are none too forthcoming. A Scaled-down font would ultimately achieve a universatl heading fit, but will take its toll on legibility. Dilating columns to about 22 from their current 12 would smooth out the wrap-raised bumps, but users will have to do that much more scrolling across the sheet before they’ll be able to see it all.
  • On a deeper level (and we’ve been here before, too): the specialities, each of which has been allotted a field all its own, should have been demoted to item status and made to stand beneath one all-subsuming field. Given the plurality of specialties it would be useful to be able to tick each one up and down a single Slicer – but you can’t, because Slicers slice through one field at a time:


76 specialities? You’ll need 76 Slicers.

And the transformative Alt-D, P recombination of fields into the desirable data set I detailed in my August 29, 2013 post won’t work here, because Alt-D, P will err on the side of a hyper-inclusiveness, investing only the left-most Year field with Row Label status and relegating all the other labelled fields into item status along with the specialties – and that won’t work. You don’t want Org Name to liken itself to Ophthalmology; one’s a bona fine Row/Column parameter, the other a proper Values-sited datum. Try it, if you don’t see what I mean.

  • Click on any cell among the specialty data and it turns out the bulk of them drag a long train of decimals behind or before them. It is clear these numbers must have emerged from some spell of formulaic activity – they’re averages, after all, and they’ve simply been treated to a bit of presentational sprucing by rounding themselves off – kind of. Now click in any number-sporting cell and observe the format – Custom. Check into Number Format and you’ll see



Now I have neither used nor understand that format yet, and I need to take it into the shop and probe its whys and wherefores, assuming it’s been used with all due number-managerial intent. But click in a number in the Area Team by Sector sheet and its number format reports Accounting instead I can’t account (no pun intended, but let’s go with it) for the disparity here, but it matters, and it matters in this way: Remember our numbers are really extended affairs, pulling across all those decimals, and if you Accounting-format them, throw them into a pivot table and group them, they’ll look something like this:


That is, the decimals overthrow their Accounting round-offs and burst into the pivot table as is, and it isn’t a pretty sight, and it can’t be reformatted, because Row Labels are labels and numeric-format resistant. But impose the Custom format above upon a set of values and they obediently group, as they do in Occupied by Specialty e.g.



But for other numbers in the workbook, Accounting-formatted grouping does happen, and for some other Custom-formatted values. it doesn’t. I’m duly puzzled, but has to be an explainable take on these several outcomes, but I don’t have it yet, and it requires a think-through; hope to get back to you on it shortly. But I’m on vacation now, and they tell me if I don’t take it now I can’t roll it over. Hope to see you in about two weeks.

Class Attendance, Part 2: British Elites and Their Schools

12 Sep

Now that we’ve righted those faux schooling percentages – that is, the ranges of school-attendance-reporting, whole numbers posing as fractions of 1 – of the achievers enrolled in the Elite Britain study’s spreadsheet –a few additional nuts and bolts need to be tightened in turn (and it goes without saying that the rectifications described in last week’s post work for the Profession by school type sheet in the workbook, too).

First, since we’ve moved past the percent misrepresentations, the field-heading (%) indicators dotting columns E through H need to go. Second, I’m not crazy about the Percentage known field; its informational contribution is trifling, because if necessary those percentages could be derived (Population known/Population size, after all), and a little design parsimony won’t impoverish the narrative.

One noteworthy stat suggests itself: the proportion of university attendees who walked through the ivied doors of Russell Group institutions. These “leading” schools – 24 in number (out of a couple of hundred in the country all told; how the global count tallies depends on designations of institutes and the like) – seemed to have commanded the lion’s share of elites’ tuitions; by simply dividing the summed Russell Group by Any University fields we arrive at 67.55% of all those in the survey who plied some manner of higher education, correlative evidence of an “elite” clustering. On the other hand, of course, correlations don’t prove a relation of necessity between A and B.  In theory, at least, a modestly-backgrounded young person could rightfully earn a place at an estimable school, an attainment which by itself would not serve to clinch the “elite” thesis. Speaking in the abstract, a duly deserved seat somewhere in the Russell Group would simply and only validate the achiever’s just desert. An equitably framed reward system that succeeds in promoting the best and brightest and assembling these in better schools (however understood) could be said to be doing nothing but its job; but of course, our data can’t by themselves point to that ideal.  Sociological robustness would call for research traveling back to the square ones – the economic class inceptions of the 4,200 biographies numbered here, before untrammelled merit could be properly ushered into the equation.

Of course, a purely methodological question jabs at all of the above: the disproportions of individual profession counts yanking that 67.55% in…proportion to their numbers. Residents of the house of Lords and Parliament contribute 31% of the survey complement, but in fact their Russell Group/Any university figure of 70% doesn’t badly skew the overall ratio. And if instead we divide all Russell Groupers by the entire Population known, the aggregate 55.85% (that’s sans the House numbers) compares most comparably with the Houses’  57.73%; not much difference there at all.

And a final consideration might qualify as something of a non-finding, notable precisely in virtue of its apparent uneventfulness. As we have them, the data don’t seem, to fall back upon a perfectly dreadful modifier, particularly actionable.  If anything, they resemble a pivot table, fait accompli – broken out, kind of, into their fields and items (and that’s a nasty equivocation to be sure) in a kind of done-deal finality. As such one could wonder where the data could be taken next, and I’m not quite prepared to signal a direction. Is there such a thing as a spreadsheet that has nothing, or nothing else, to teach us? Maybe, but only maybe, and the matter isn’t wholly captive to a smallness in the numbers that might stunt the analysis. One of the issues here asks after the prospects for aggregation that the data hold out, as the presently stand. One could, for example, perhaps dispatch a pivot table to group the professions and the associated numbers by their respective sizes, but I’m not sure what profit would accumulate thereby. Do university percentages respond to sector largeness or smallness? You’d have to look and hard before you’d put your investigative standing behind anything other than the null hypothesis here.

What might work is a coding scheme that would align sectors by some shared nature, and let the school numbers fan out along accordingly; but that plan too would have to be preceded by some deep thinking.

So here, then, is one of the larger questions: take a pivot table, the product of some concerted, reasonably well-devised intention, and treat it as an initial data set. Where do you take it next?

I think that’s a fair question – at least a reasonably well-devised one.

Class Attendance: British Elites and Their Schools, Part 1

5 Sep

Who put the higher in hierarchy? I don’t know, but the reality of class and its near-ubiquitous sidekick, the ruling elite(s), have been giving sociologists something to write about for the past 150 years or so. And if the definition-mongers can’t decide what class really means, the very tenacity of the debate has entrenched the term deeply in the discourse.

And if the stereotypes have it even half-right, nowhere does the beguilements of class charm its masses so seductively as in  England; and a recent Guardian spreadsheet, its data ported from a study by the Social Mobility and Child Poverty Commission, advances education as a proxy attestation for the imponderable class. Here it is:

 Background of different professions

The workbook’s two sheets plant their feet on two educational rungs – (pre-university) school and university types – and see how their distributions break across a set of moving-and-shaking professions (which aren’t identical in the sheets, by the way). The samples are small, but the universes from which they’re drawn are transparently described in the Elitist Britain methodology addendum, likewise available at the above link (as is the Methodology – Annex spreadsheet, which actually names the individuals surveyed). And the samples are largely just that – fractions of the smallish universe pools, for example, BBC Executives, clarified by the addendum as “Senior BBC Executives  on the BBC website for transparency purposes”. The spreadsheet numbers 187 of the functionaries, but the Profession by school type tab finds actual information for but 93 of them (and 125 on the Professional by higher education sheet).

At least it’s all up-front, and the workbook means to put an evidentiary face on British elitism by tracking the higher achievers on its radar and their schools.  And once you get past the perennial column auto-fits and confront the Profession by higher education figures, the point is more-or-less made. 59% of the Cabinet members for which the data obtain, for example, were Oxbridge attendees, with 77% having found their way into a Russell Group school (the Group is a consortium of “better” universities). Still, you’ll find interesting if expectable variation, e.g. the 0% Oxbridge representation among pop stars (defined as “UK artists who had one of the top 40 selling albums of 2010, 2011, 2012 or 2013″), although 27% of these did misspend their youth at a Russell Group school (although it also appears that attendance at, rather than graduation from, is counted here).

Again note the Percentage known data are exactly that, proportions of members of the sample for whom educational data have made themselves available (with the rather curious exception – in both sheets – of Select Committee Chairs, entered as 85, and not .85. Simplest tack, here: replace with .85 and commit to the Percentage format) . And the attending fields purport to convey attendance data in percentage terms, too – but they don’t.  They’re rather delivered in whole-number terms, and the 91 in cell E2 in the Profession by higher education sheet, for instance, won’t register as the putative .91 but as a number 100 times greater instead; and if you want to work with numbers in lieu of the percentages – and see say, a 114 in E2 (a rounded 91% of the 125 BBC execs for whom information availed) – well, then, you can’t multiply E2 by C2, because you’ll get 11,375. Now the remedial spreadsheet math isn’t difficult by itself, but it has to be done somewhere; and here’s what I’d do.

In a blank cell – I’m working in J2, for a reason – enter $C2/100 and copy it (yes, I do plan on explaining). Then select E2:H28, the range of to-be-percentaged numbers, and click Paste > Paste Special > Multiply:


 And OK.

And here’s what’s happened in view, and in review, of our intention: Remember we want the 91 – that ostensible percentage – in E2 to culminate in a 114 – which, after all, is 91% of its denominator, the 125 BBC executives for whom university-attendance data are on hand. Thus we want in effect to be able to multiply 125 by .91, and the strategy we’ve blueprinted here makes its first move by dividing the number in C2 – that 125 – by 100; the resulting 1.25 is then to be multiplied by the 91 in E2, realizing that 113.75 for starters. What the Paste Special orchestration does then, is copy and paste $C2/100 atop E2:H28 – but instead of the standard-issue paste, it multiplies every value in the range by its respective C-columned value. Thus the Pop star, Russell-Grouped 27% percent –  again, really the number 27 as it first appears –is multiplied by the 72 Stars in C14 for whom the educational information is available, and 72/100*27 gets you 19.44 actual Pop stars – and if you’re wondering about that .44 of a Star, hang on.

Now it’s true – an odd, unsettlingly circular cast tinctures the Paste Special Multiply routine, pasting as it does a formula (and it is a formula – don’t look for pasted values this time) that names the selfsame cell into which it’s pasted, but that’s how it works (as do some formulas imparting conditional formats, by the way), and because it works you’re spared the conventional way around the whole problem:  making yourself at home in I2 (in the first unoccupied column at hand) and entering


thereby recalling the aforementioned 113.75, copying the above across I2:L28, and finally aiming a Copy > Paste Special > Values back onto E2:H28, and deleting the now-duplicative I2:L28.

And now for that .44 and kindred decimals. Fact-checkers tend to discourage sentences reporting that “113.75 BBC executives attended some university”, although I can think of quite a few pop stars whose talent quotients fail to rise to integer level. What we need here is a numerically real, as opposed to a format-dappled round-off, and if we want to get there without incurring still more formulaic overhead there’s a venerable, but faintly recognized, option in there waiting for our call.


First, select the now-heavily-decimaled E2:H28 and format the data to zero decimals. But again, that refinement merely streamlines the numbers’ display, and doesn’t streamline their values. But watch this: click File > Options (remember – I’m in the 2010 release) > Advanced > and scroll here:


There it is – Set precision as displayed – one of the only, if not the only, format-motored command that actually re-quantifies the value of a number as per its present onscreen appearance. Format the decimals in 113.75 away until you see 114, turn Set precision as displayed on, and what’s left is a real 114. Multiply it by 2, for example, and you’ll get an authentic, usable, 228, not 227.5.

A last spreadsheet-grammatical point here: as parenthetically observed above, the Paste Special > Multiply script pastes formulas, not hard-coded values, into its targeted cells, e.g.


But this seems to work, too:


And indeed, this very matter happened to have found its way into last week’s post. Any thoughts about those Excel-spawned parentheses?


English-Wales Name Data, Part 2: Sum Interesting Formulas

28 Aug

My bookmark tucked its ribbon here:


Last week’s post, you recall, paused at the above teaser, a solitary, unexplained but efficacious formula for coaxing a weighted-average length for 2013’s top 100 boys’ and girls’ birth names from the data. Now I need to reward your patience by explaining.

A weighted average, after all, insists that a proper, proportioned due be paid every name’s incidence across the birth cohort, and the formula in our inbox appears to due just that. It enlists the SUMPRODUCT function to the task, a predominantly simple mechanism that in the first instance works this way:

Suppose I’m presented with this range of transactions in A1:C5:

(It’s clear I have no idea what these comestibles really cost.) What I want to know, of course, is what the bill comes to in toto, and the standard means toward that end for is pretty evident: enter a bank of formulas, each multiplying every Unit Sold by its allied Cost per unit, and add the four purchases.  But SUMPRODUCT lets us unify the process:


That is: simply enter the paired ranges, or arrays, as they’re known in these parts, and SUMPRODUCT does the rest – it multiplies each Units Sold by its collinear Cost per unit – B2 by C2, B3 by C3, etc. and then sums them all to boot.

But what are we hoping to do? Again, we’re aspiring to learn the weighted average length of all the birth names in our workbook, and SUMPRODUCT will consummate that hope – provided we do a bit of a syntactical rethink.

Observe that nowhere among our five extant England-Wales-name workbook fields will you find name length data, and we even politely declined the offer of the suitable, supplementary Length field that was tendered in the previous post. Again, we want to see if we can figure average name length without a length field’s well-meaning intercession.

So get back to our bookmarked SUMPRODUCT, which is, and is not, essentially coterminous with the unassuming apples and oranges computation. You’ll note first of all that both SUMPRODUCTS reference two arrays, but our average-length-finder couches the first of these with a LEN, one that stuffs a 402-row range inside its bulging parentheses, even as our previous meet-ups with LEN in previous posts sized the contents of but one cell per formula. And the very fact that we can grant LEN this superimposition atop an array reference at all is, to me at least, imperfectly obvious. But you can.  But we also need to understand in turn that SUMPRODUCT is an off-the-shelf embodiment of an array formula, a class of expressions that power through multiple calculations in but one go, a capability that in fact was made clear by the more rudimentary fruit-and-veg example, when you think about it – and you should (for an instructive introduction to array formulas look here).

Thus the LEN(D2:D403) excerpt measures the length of each name in the D column – and then multiplies each length by the corresponding number of children bearing that name (reported in the E column) – and then adds all those results…and divides that grand total by the sum of all the births. And that’s a weighted average. Here’s the formula again:


Thus the calculated length of D2 is multiplied by the associated number of births in E2, as are the like values in D3 and E3, and so on, all the way down to D403 and E403. These are then all added, turning out an aggregate length in characters. That number – which happens to be 2,238,510 – next submits to the divisor comprising the number of all births – 343,022; and 2,238,510/343,022 confers a weighted name-length average of 6.25285. Got that?

Of course it makes sense – it has to, or else let me be the first signatory to your class action suit against Microsoft. But array formulas do require a think, because the kind of syntactical sense they espouse doesn’t always tally with standard user expectations.

But I for one was perplexed by a more generic thought-provoker, one that brought me back in touch with that ancient mathematical fundament, the order of operations. It turns out, for example, that the SUMPRODUCT under our microscope could have been alternatively written this way:


Look closely, because that variant hasn’t slithered out of a photocopier. It is different from our earlier take, in virtue of the parenthesis (or bracket, depending on your IP address) right-flanking the E2:E403 segment before the divisor symbol. Here, then




are regarded of a piece, and indeed, the entire SUMPRODUCT formula ends there:




and the /SUM(E2:E403) simply tacks itself discretely to the above taking the SUMPRODUCT result further, but from the outside, as it were.


But everything you see in our original SUMPRODUCT belongs to it; nothing there is external to the function as it is written. Here, SUM(E2:E403) appears to button itself to E2:E403 alone, and then divide it; but apart from the fact that that fraction doesn’t seem to make much sense, it isn’t what’s happening anyway. Consider the difference between


=6*(4/2) and =(6*4)/2


Guess what – there is none. Either way you get 12.


P.S. If you’re a name devotee, the US Social Security site Popular Baby Names page issues a free pass to an enormous holding of birth name data dating back to 1883 in easy-to-convert text file mode, apparently listing every first name given to at least five babies each year. And they have state-by-state breakouts, too.


Get every new post delivered to your Inbox.

Join 163 other followers