Search results for 'DATEVALUE'

Team USA Stats, Part 2: Some Data Gymnastics

4 Sep

You can’t tell the players without a scorecard, they’ll tell you in the States, and you can’t tell the data without the formulas.

You’ve heard more memorable pronouncements than that opener, I’ll grant, but that less-than-bromidic avowal above makes sense. We saw in the last post how the Team USA height, weight, and birth state data threw more than a few curves at us, and baseball isn’t even an Olympic sport any more (it’s coming back in 2020, though); fail to straighten out the data curves and your analysis will be straitened.

Now that line’s a few sights more memorable, but I do go on. Our next round of data inspection takes us through the DOB (date of birth) field, and we want to get this one right, too. Quality control here starts by aiming a COUNT function at the rows J2:J559 that populate DOB. Since dates are numbers, and since COUNT only acknowledges a range’s duly numeric data, we should aspire here to a count of 558, or one date per cell. But my COUNT totals 520, a shortfall that exposes 38 dates manqué that, facades aside, are nothing but text.

Our COUNT is all the more incisive in view of the fact that every entry in the J column really does look like an unassailable date. But now that we’ve made ourselves aware of the turncoats among the loyal numerics, we’d do well, if possible, to rehabilitate them into the real things too. Here’s a pretty painless way: commandeer the next free column (it could be S, if you’ve left the formulas from last week’s exercises in R alone; if so, you’ll have to format S in Date terms), title it Birth Date, and enter, in row 2:

=IF(ISTEXT(J2),DATEVALUE(J2),J2)

And copy it down. The formula asks if the entry in J2 is textual. If it is, the DATEVALUE function – a rather useful transformative means for turning pure text such as 12/14/1989 – if that expression has been formatted into text – into 12/14/1989, the date version. If the entry in J is an authentic date, on the other hand, the formula simply invokes it its cell entry as is.

Surprise – my eminently sensible tip doesn’t always work. Copy the formula and you’ll be treated to nine #VALUE!-laden cells; a first review of these implicates an old nemesis – a superfluous space, in the source field e.g.:

olyp1

DATEVALUE can’t handle that textual intrusion, and so this refinement:

=IF(ISTEXT(J4),DATEVALUE(TRIM(J4)),J4)

Looks right, because TRIM’s job is to make superfluous spaces unwelcome.
But that revision doesn’t work either. Puzzled but intrigued, I went for the next move: a copying of one of the problem entries in J into Word, wherein I turned on the Show/Hide feature (in Home > Paragraph) that uncovers normally unseen codes. I saw:

olyp6

Look closely and you’ll detect that special character to the immediate left of Word’s paragraph symbol. That Lilliputian circle appears to signal a non-breaking space, and apart from any conjecture about what and why it’s doing there we know one thing: it isn’t a superfluous space, and thus won’t be trimmed.

Thus again, the fix here may be short on elegance but long on common sense: edit out the circle from the nine problem entries in J (in fact some of the cells require two taps of Backspace in order to rid them of their codes. If you sort the dates by Largest to Smallest by first clicking on an actual date all the #VALUE! errors will cluster at the bottom).

And that works, leaving us with yet one last irritation – the birth date of Shooting team member Daniel Lowe, which reads

11/181992

No need to get fancy here – just enter that slash.

Now, I think, you can go about your analytical business, e.g., breaking out athletes by birth month. You may recall my consideration, in a piece on 2012 Olympic data, of the alleged August effect, in which American athlete births in that month appeared to significantly depart from a chance prediction. Let’s see what the current team data tell us, via a pivot table:

Rows: Birth Date (grouped by Month)

Values: Birth Date (Count)

Birth Date (again, by % of Column Total)

(We don’t need Grand Totals here). I get:

olyp2

Here August – generally the most fecund month in the United States – shares the modal figure with June and March, its proportion substantially smaller than August’s 2012-team contribution. The numbers here simply suggest no special birth skew impacting the US complement, at least for this Olympics.

We now can also calculate each athlete’s age in conjunction with the most able assistance of the nifty and unsung YEARFRAC function. Enter the Olympics’ start date – August 5, 2016 – in any available cell, name the cell start, and proceed to column T, or whichever’s next available on your sheet. Name it Age and in row 2 try (assuming the corrected dates lie in the R column):

=YEARFRAC(R2,start)

YEAR calculates the distance in years between the two dates on either side of its comma. Thus, cell-reference four-gold-medalist Katie Ledecky’s birthday – March 17, 1997 – in YEARFRAC, and with the start date cell holding down the second argument you get 19.38, Ledecky’s age in years on day one of the Olympics (note that can’t actually enter 3/17/1997 in the function, because YEARFRAC will treat the entry as text. You need to either reference the cell bearing that date or enter 35506, the date’s native numeric equivalence).

Copy down the column and this pivot table beckons:

Rows: Sport

But guess what…

olyp5

Yep, it’s that superfluous space thing again, this time practicing its mischief on four records among the Track and Field data. The simplest repair in this case, as it turns out: select the Sport field and run a Find and Replace at the column, finding Track and Field[space] and replacing it with Track and Field. That works, because in this case each of the errant four have incurred one space.

Now introduce the Age field to Values (Average, formatted to two decimals). Bring back Age a second time, now exhibiting Count sans decimals. If you sort the results Largest to Smallest you’ll see the 12-member equestrian team holding down the age-senior position, with Team US’s eight boxers computing to a lowest-age 20.75.

We could also correlate average athlete weight by event, an association which might drum up some less-than-obvious numbers, e.g.

Rows: Sport

Columns: Gender

Value: Weight (Average, formatted to two decimals)

I get:

olyp4

Of course the per-team numbers are small, but they make for interesting reading, particularly the respective by-sport gender disparities (and note some absent teams among the men’s delegation).

I was surprised by the greater average weights of the two basketball teams measured against their rugby colleagues, even if the latter is (officially) the contact sport. And I did a double-take when I caught up with the respective boxing team weights; women boxers outweigh their male teammates by an average of 18 pounds. But here we’ve been thrown a sampling curve – the six male pugilists are concentrated in the lower weight divisions, even as the women – comprising exactly two boxers – weigh 132 and 165 pounds.

Eek  – there was a lot of hard work to do in there; I think I deserve a podium finish for this one.

NHS Sick-day data: It Depends Where and When

20 May

Not every sick day is a sick day, a definitional caution to which every employee – and employer – will doubtless assent. People take days off for all sorts of reasons, and even when they feel perfectly fine. The Fit for Work Scotland service, a UK National Health Service initiative, avers that “Most [worker] absence is for genuine reasons and is likely to end within 7 days”, but apart from thresholding the word “most”, any significant inter-industry variation in sick-day rates begs an interesting question – namely, why. And an NHS workbook downloadable here (click the 12/15 link, and save as an Excel workbook) recording 35,000 monthly-rate figures across various care institutions in England (that is, England the country, not the United Kingdom writ large) and across parts of seven years doesn’t answer it, but does cast a beam or two of enlightenment on the matter just the same (and note this look at what appear to be more or less the same data; some nice “interactive” charts plot them, too, but I don’t seem to be getting precisely the same results as it does).

First, you may want to do something about the date field, whose contents beat back my attempts to impute a usable date format to them. I had wanted to assume that its entries, e.g.

days1

reflected a European regional syntax that had duped my American-formatted system into regarding them as merely textual; but when I tried the data on a PC in a London library, of all places, nothing changed.

So if you’re getting what I’m getting, and you want the dates to behave like actual dates, poke a new column alongside Date, name it Date2 or some such, and enter in what is now B2:

=DATEVALUE(RIGHT(A2,3)&”-“&LEFT(A2,4))

DATEVALUE takes date-appearing text and recycles it into the real thing. The expression above in effect reverses the cell constituents into something like APR-2009, through a couple of concatenations. Copy it down B, format as you like and the dates are yours, though I’m still not sure why that British computer didn’t understand the originals.
Now we can perform a series of practicable breakouts that should say some instructive things about the sick/absence-rate phenomenon, and its supporting numbers in the FTE [Full Time Equivalent) Days Sick and FTE Days Available fields. Both the HEE_region_name and Org Type fields seem particularly apt here, e.g. as per this pivot table:

Row Labels: Org Type

Values: SA Rate (%) (Average, formatted to two decimals)

I get:

days2

We see the notable rate spread across the health-care entities; at the far, high end are Ambulance services and its daily absence rate of 5.81%, more than two-and-a-half times the figure for practitioners in Clinical Commissioning Groups. The variations are telling and doubtless significant, given the millions of FTE worker days accumulated across the 35,000 records. (Note that according to this Wikipedia entry SHA’s, or Strategic Health Authorities, were abolished in March 2013, but the worksheet contains SHA data through March, 2014).

Note that these averages in effect average the records’ numbers; that is, they simply pursue the mean of the SA Rates as they present themselves, regarding each as equally contributory to the breakouts above. But one might want of course to weight the rates by each record’s actual FTE days, and doing so would bid us to extemporize a calculated field, which I’ll call WeightedAvg:

days3

Here we simply divide FTE Sick by FTE Available; and the Org Type row labels sift and combine those records to be respectively averaged. But why are we multiplying the calculated fraction by 100? We need to do so for a presentational reason. Since the SA Rate (%) data really aren’t percentages, but rather workaday numbers that feign a percentage status – e.g. a rate of 4.59 is 4.59, not 4.59%, and 4.59 is really 459%, after all – we want the WeightedAvg results to lift themselves to the same magnitude as the SA Rates, for the sake of consistency.

And once the field is calculated and shown its place in the pivot table alongside the SA Rate (%) figures, I get

days4

(Note that the very parsing by Org Type offers itself as a kind of diffused simple average; the various Org FTE totals incline the Org comparisons toward a manner of parity, as if each average were erected atop the same FTE total.)

A few appreciable disparities point to themselves across the averages, e.g., the comparisons for PCT (Primary Care Trusts), Clinical Commissioning Group, and Specials; and the weighted composite average exceeds the simple average by 11.4%.
We could also organize the sick-rate averages by HEE (Health Education England) Region, by simply substituting the HEE_Region_Name field in the existing table for Org Type:

days5

Noteworthy here – at least to me – are the relatively small rates for the London regions. One could have allowed that the giant city’s urban, stress-stoking fray might have had its effect on its health caregivers, but the numbers suggest quite the opposite. It is in the North West and North East regions – centered around Manchester and Newcastle, respectively – that the rates ascend.

Now back to those dates. The Quality Watch site linked above line-charts SA rates along an axis comprising months:

days6

It means to describe seasonal variation in the rates, with summers associated with fewer absences. It occurred to mean that an alternative graphical tack could chart seasonality by aggregated months, that is, SAs by all Januarys, all Februarys, etc. This pivot table:

Row Labels: Date2 (our new, date-formatting field)

Values: WeightedAvg (the calculated field again)

days7

Underwrites this pivot table chart (which I’ve realized in comparable Y-axis intervals):

days8

Here too the summer trough sinks below the average, but slopes more gently slope than the Quality Watch graph, and that’s because the latter has pushed its Y-axis points farther apart, pronouncing the monthly divergences.

Of course we still haven’t answered the question I begged at the top of the post – that is, the why of SA variation. It’s possible of course that ailment-inducing winters account in part for some of the seasonality. But why ambulance-organization staffers should absent themselves from work in higher numbers than the other cohorts, for example, is a good and fair question, but one that can’t be directly broached by the workbook.

But isn’t that where you come in?

Daunting Date Data: UK Unemployment Claims

4 Aug

Here’s a droplet from the torrent of spreadsheets loosed weekly by the UK Office of National Statistics, one that embodies an interesting, if messy, set of problems for the analyst:

Claimant count data

The data historicize British monthly unemployment claims data from a June 1881 point of inception, and drill down to May of this year, across two different sheets.

There are, first of all, significant comparability issues across the years and the sheets, some of which are expounded in the linked document on the Historic Data sheet as referenced in cell A5. If your intentions here are investigatory, those assuredly beg your attention. Indeed, the qualifier in cell A7 of that sheet warns against a melding of the UK and Historic Data – the latter only counting claims from 1881 through 1970 – a recommendation that presumably justifies the very fact of two sheets.

That’s one issue and a real one, albeit of the extra-spreadsheet kind. But there yet might something to learn from each sheet regarded singly – namely, something about unemployment claims by year.

And to advance that prospect, all good pivot tablers will doubtless move to group the Historic Data (let’s start there) by year(s) and proceed to average their claim rates (which may roughly proxy for unemployment percentages).

And it’s right there where you have to start paying attention. The dates tumbling down the curiously untitled field (we’ll call it Date) in A don’t seem to be dates at all, but rather indolent text; but for once, as we hope to see, these impersonators will serve the analysis well – because Excel can’t evaluate putative dates that were calendared before its baseline day of January 1, 1900. That day receives the date number 1, with every succeeding day counting off its relation to the foundational January 1 by an increment of 1. Thus December 17, 1983 quantifies itself as 30667, for example, the number of days distancing itself from the January 1, 1900 touchstone.

Thus Excel can’t ratify Jun 1881 in native date mode (though a VBA-macro –driven workaround is described here) – and indeed, even if all the entries in the Historic Data sheet postdated January 1, 1900 they’d still resist grouping, because they’re text.

But for our purposes, that insufficiency shouldn’t matter. To commence what could be billed as faux date grouping, all we do is check into the D column, head it Year, and enter, in D12:

=VALUE(RIGHT(A12,4))

That expression drags the 1899 out of the A12’s Jun 1899 and upgrades it to full-fledged quantitative standing. Copy it down D and you’re in business, priming this sort of pivot table

Row Labels: Year

Values: Rate (%) (Average, honed to two decimal points. But note that 31 months report NA in lieu of an actual rate).

Because the Years are duly numeric, we can group these into say, five-year bins, yielding:

 torrent1

You’ll note the Depression-driven 19.10% in the 1931-1935 bin.

Now there is, apropos the discussion above, a formatting curiosity that’s wormed its way into the dates. You’ll see that the A12:A234 range bears the General format, befitting a non-committal receptivity to any and all date/numeric types, and these precisely dovetail the pre-1900, text data. But A235 – entertaining the Jan 1900 date –asserts a Custom format, consonant with a numeric entry – even as we’ve allowed that all the field’s cells are text. And sure enough – click A235 and jab at the Number Format drop-down menu in the Home > Number button group:

torrent2

You’ll see that every formatting variant therein fails to vary, each one reporting the same Jan 1900 expression; and that very constancy, Custom format notwithstanding, means – text.

But ok – again, that discovery does nothing but rubber-stamp our earlier averral, to the effect that all of the entries set down in the Date field are merely, and only, text. But what’s interesting about Jan 1900 – and by that I mean ‘Jan 1900 – is that, some wordplay intended, it isn’t completely valueless.

To explain: given the overweening centrality of numbers to the spreadsheet remit, Excel accords certain quantitative properties to apparent numbers costumed as text. Enter ’17 in A3, for example (and the apostrophe textualizes the entry) and go on to enter =A3*2 in A4, and you’ll get…34, a really, truly, usable number. And less obviously, enter ‘Jan 1900 in A3 and type =A3*2 in A4, and you’ll get 2 – and that’s because Excel takes the liberty of treating ‘Jan 1900 as January 1,1900 – numerical value 1. Multiply it by 2, then, and you get 2.

Thus even as ‘Jan 1900 owns up to its text-label lineage – and as such, can’t be grouped by a pivot table, for example – it’s happy to consort with numeric operators, and pass itself off as a number when asked to assent to the deception.

And if you want to work with the presumptive years in the UK sheet, you’ll want

=VALUE(LEFT(A6,4)

because the years in the sheet’s A column position themselves before the month in the UK date cells. And because they do, they won’t return the same kind of de facto numeric values that surprised me in the Historic Data sheet. If you want those, you’ll have to restructure the data thusly, e.g., the 1971 Jan entry in A6:

=DATEVALUE(RIGHT(A6,3)&” “&LEFT(A6))

The RIGHT/LEFT extraction reverses the placements of the 1971 and the Jan, after which an authentic date/number should emerge.

Should. But scroll A column downwards, and you’ll drive past entries such as this one, in A230:

 1989 Sep#

That footnoting pound sign, or hashmark (its name depends on the side of the Atlantic on which you’re reading this), inflicts one more dollop of complication into the mix, requiring this formulaic rewrite:

=DATEVALUE(MID(A230,6,3)&LEFT(A230,4))

The nested MID rescues the Sep from its entanglement with that sign,as that month name begins in character position 6 in A230.

Ho hum; all in a date’s work.

Travelin’ Man: David Cameron Trip Stats, Part 2

24 Mar

It was somewhere in the closing miles of the first leg of our own journey – a skein of uncertain steps shuffling across the David Cameron overseas trip data – that I red-flagged my hardy companions – that’s you – about a bump in the road, a set of roadblocks blockading what, under other data-design circumstances, could have been a direct route toward calculating the Prime Minister’s trip durations. To wit: had the designer been so minded, a Start and Finish Date field would have been made to come along for the ride on the sheet at the outset, these posting embarkation and return dates for each of the PM’s 116 trips. Subtract Start from Finish (plus one; see last week’s post), then, and trip duration is right there for all to see.

But the data don’t break that way, not with 91 of the trip dates read into the manifest in label terms, e.g.

3-5 July 2011

Question, then: can we get there from here? Can we wrest three days from that miserly text construction above, and its 90 fellow travellers?

Good question, though transparency impels me to confess I ‘m the one who asked it. For one thing, and as previewed by the previous post, it seems to me that the Finish Dates pose the far easier ask – because the text on the other side of the date-offsetting dash own the month and year information, and the specifying properties attaching to both.

That incentive took me to cell E5, the first cell in the Finish Date field we had mapped in the last post. And there I wrote:

=IF(ISNUMBER(A5),A5,(DATEVALUE(RIGHT(A5,LEN(A5)-FIND(“-“,A5)))))

Ok – so what’s this one all about? I goes something like this: if the entry in A5 (in the Date of Trip field) is a number – and again (previous post, again), all the Cameron trips comprising exactly one day assume numeric form, then simply return that number/date in A5. If, on the other hand, A5 holds one of those pertinacious labels with a forged date passport, we need to isolate all the text to the right of its dash, which, if nothing else, looks like a date; then, appearances notwithstanding, we’d brace the isolate with a DATEVALUE, which in fact turns a label imposter into a real live, measurable date.

This formula works in E5; and because it does, I entitled myself to copy it to all other the Finish Date cells, wherein they work, too – until E57, where David Cameron’s trip to Strasbourg hits the runway and comes to rest right by the #VALUE! message – and he was expecting “Bienvenue, Monsieur Prime Minister”.

What happened in E57, and about a dozen other Finish Date cells that resound with the same error message? My standard suspicion – that these faltering cells had been saddled with superfluous spaces in ruinous places that were jamming the formulas – proved wrong.

But what proved right was a small – a very small – difference that typified, and beset, the errant formulas. It’s in the dash, of all things; if you look closer, you’ll note that the problem cells inscribe a dash just a touch lengthier than the one nestled in the formulas that do work. My nested FIND up there asks for this

 =FIND(“-“,A5)

 And the delinquent cells don’t have that; they have this:

 And that isn’t the same dash – trust me.

And because it isn’t you can select the Date of Trip range and drive this Find and Replace through it:

 trav21

(You can actually select the respective dashes in any cells in which they featured, and copy-and-paste them into the appropriate Find and Replace fields.)

And that works, too. Every Finish Date cell now boasts a finish date.

But then there is the matter of the Start Date(s).  Extracting whatever is there from the left of every dash in the Date of Trip entries and turning these out in numeric terms is conceivable, e.g. for A5:

=VALUE(LEFT(A5,FIND(“-“,A5)-1))

 That is, find the dash (after the makeover endorsed above), grasp all the characters to its left, and reintroduce them to their numeric provenance via VALUE. For A5 we’d get 20. Moving into the Trip Duration field in F, you could press ahead with

=DAY(E5)-D5+1

That is, use DAY to skim the day value from the full-fledged data in Finish Date (and that’s 21 in E5), subtract 20 from it, and add 1 to account for both the 20th and 21st in the duration (again, 21-20=1).

Now rewrite the expression in D5 to read:

=IF(ISNUMBER(A5),A5,VALUE(LEFT(A5,FIND(“-“,A5)-1)))

 Again, that IF statement leaves the actual, native dates in the A column as they are; they require no retooling. And remember that the real dates always return 1 as their trip duration (see above), requiring in turn this amendment to the expressions in the Trip Duration field:

=IF(ISNUMBER(A5),1,DAY(E5)-D5+1)

If you’re still with me, let’s understand that this all works – most of the time. But it won’t work for the Date of Trip plunked down in A19 – 30 November- 2 December 2010. In fact, on the one hand, none of the formulaic activity on that row concedes an error message. On the other, what we get for Trip Duration in F19 is -42335 – because the Start Date formula in D19 treats 30 November as an authentic date – from 2015, because that’s precisely what happens when you actually enter 30 November in a cell. Absent a concerted year reference, Excel defaults the year to the one during which you’ve entered it. And the same thing happens in A88-A90. (What’s remarkable here is that the VALUE –not DATEVALUE – function is prepared to turn 30 November into a number – and it does).

Question, then: given the determined inconsistency of the data, is it possible to write a formula that could contend with them all, and succeed in realizing Trip Duration in every case? I’m not exactly sure; but if it is, you’d have to do battle with a rather variegated – and long – set of nested IF contingencies before you could walk away from it with head held high. What, for example, are we do with

31 December 2013-2 January 2014 ?

And if you want to reply that in fact none of the Prime Minister’s trips overarched two years, is that quite the point? You want a formula that handles all comers – actual or potential.  Can it be written?

Over to you, learned colleague.

Left-Handed Complement: A Look at Data on Handedness, Part 1

6 Mar

I’m left-handed, sort of, predilection enough to qualify myself for a subscription to the Left-Handers Club newsletter, a miscellany of products and disquisitions about the minority group whose members belong to all the other ones, too. (OK; I’m an in-betweener – writes, eats, and dribbles a basketball with his left hand, throws, bowls, and actually shoots a basketball with his right.)

It was a recent piece in the newsletter that raised at least one of my eyebrows (can’t recall if it was the left or right one), and got me to thinking about possible rejoinders to its central claim – namely, that  handedness somehow associates itself with birth month. A survey of its left-leaning readers, respondents gathered via self-selection, measured notable left-handed disproportions for August and July, with smaller precipitates stacking in January and February. Moreover, the survey has nicely packaged its data in spreadsheet form, complete with none other than a pivot table to season the analysis. Let’s devote this installment to the workbook itself, after which we hope to think about the results and position a supplementary data alternative to the thesis.  You can take grab the workbook with either hand here:

 Lefthander Birthdates

Note first, and for what it’s worth, the Line field numbering the 3000+ survey records in apparent chronological sequence of submission as counted off in neighboring Entry field. Again for what it’s worth note that the sequence is not unrelievedly consecutive; dozens of gaps poke at it throughout, and so the 3148 records top out in the Line field at number 3348. Does any of this matter? Probably not here but it could, depending on where you’d want to take the data. If, for example, you wanted to group the records by Line number in bins of 100 each, you’d pull up short of 100 in nearly every bin. (I’m not sure what you’d do with the Entry data themselves and their five survey submission days flanked by the 9th and 13th of January, but that’s a quibble.)

More than a quibble, though, could be registered over the piecing together of the DATEVALUE expressions in the Date field. We spent some time on DATEVALUE in my February 21, 2013 post, but that introductory note doesn’t speak to the data contingencies placed before us here.

DATEVALUE remixes a collection of (basically) text data into actual, quantity-bearing dates, but it’s quirkier than I would have supposed. For example(s):

=DATEVALUE(“6/8/2013”)

And

=DATEVALUE(“6/Aug/2013”)  (as per European date syntax)

Will both yield 41492, the numeric identity of August 6, 2013. Format the container cell in Date mode and the date obtains (one of those aforementioned quirks allows you even to enter Augu and the like in the expression without impairment, so long as the month name invokes at least its first three characters. The names whip up date values even If the attending year and day data are numeric, and not text. See the discussion immediately below in that regard. I told you it was quirky).

The birthdate workbook, however, concatenates, or combines, three cell references instead, e.g. cell H9:

=DATEVALUE(+E9&”-“&F9&”-“&G9)

 The references signifying Day, Month, and Year respectively (as per the Left Handed organization’s UK-based  time protocol). The ampersands concatenate, or join, all the references into a single expression, along with the hyphens sidled by quotes in our case. The result is captured in H9.

 

But the formula raises some syntactical  questions. For one thing, the + sign grazing E9 is superfluous – and so, for that matter, are the quote-surrounded hyphens. Had the formula read

=DATEVALUE(E9&F9&G9)

7/25/1958 would have worked itself out just the same. Edit the formula that way if you want to see what I mean. (Note by the way that the hyphens turn into slashes, because the H-column cells had received this format:

  hand1

in which the slashes rule.)

Now check out the formulas in the J and K columns, each directing a VLOOKUP to a table in the Countries tab. The plan here is to additionally locate each respondent’s country of birth inside what are labelled the Zone and Hemisphere rubrics, and deliver these to the J and K column. Note the interjection in all cases of the FALSE term (or argument, as it’s officially called), e.g. cell J9:

=VLOOKUP(D9,Countries!$A$2:$B$238,2,FALSE)

FALSE is a compensatory means for negotiating a lookup table whose first column – the one comprising the values to be looked up (be they text or numeric) – isn’t sorted ascendingly. FALSE means that the lookup is to confine itself precisely to the contents of that first column, and not entertain any value that squeezes between two rungs in the column. Thus you need FALSE in order to work with this lookup table of students and their grades:

 

Mary

87

Alice

64

Jack

82

Jill

91

Carol

73

Once you sort column one, though, you can drop FALSE from the formula – truly.

Thus if we sort the first column in the data set on the Countries sheet – and then, in the service of additional formulaic understatement name the range there (let’s call it Regions), our lookup statement slims down to something like this (in J9, for instance):

=VLOOKUP(D9,Regions,2)

 

The expressions in the Hemisphere field imitate the above formula, save one departure – the 2 is replaced by a 3, the column number in which Hemisphere info awaits looking up. But are the lookups as we found them “wrong”, then? Not quite – they all issue irreproachably correct results, after all. But formula efficiency is better served by the rewrite.

Next, push over to cell M9:

=IF(LEN(I9)=1,”0″&I9,I9)&IF(LEN(E9)=1,”0″&E9,E9)

 

What’s this formula striving to do? This: reap four-character yields down the M column by concatenating the respective contents of any given row’s I and E cells (in that order). Thus M9 returns 0725 – ultimately =I9&E9 – but because I9 offers but one character, a 7, the formula tacks on a zero (remember that the result is textual and possesses no direct quantitative standing) in order to pad the yield to four characters; and should the day value in E likewise be single-digited, the formula rolls in a zero there, too.

But again, the formula could be pared, first by formatting both the E and I column thusly:

 

 hand2

As advertised, you’re looking at a custom number format comprising a two-digit 00 code. That code supplies a 0 alongside any single-digit number, but if the number already features two digits, the format simply leaves the value alone.

Once the E and I values are so formatted the data will take on this appearance:

Then get back to M9 and enter:

=I9&E9

And copy down the column. I get, in excerpt:

But that isn’t working, is it – not if we demand four-digit entries in each and every cell. It isn’t working because the & sign, which concatenates values in specified cells, demotes the values into text, and the 00 format only attaches to bona fide values (keep in mind that combining a 7 and a 25 here isn’t an act of addition; rather, it’s stringing the 7 and 25 together). The way out? Try this:

=VALUE(I9&E9)

And the 725 acquires numeric status. Then format the M column:

hand3

 

Four zeros this time, because you’re aiming for four-digit numbers.

Got all that? This stuff is likely to be on the mid-term.

UK Inflation, Part 2: A Tale of Three Formats

21 Feb

Last week’s cliff-hanger of a post pulled up at a familiar precipice, dangling its topical teaser from the edge of but one more data-formatting conundrum – a long skein of cells donning this – or that – format.

The cells in question skid down the B, or Dates, column in the inflation uk spreadsheet we introduced last week,s and they’re an eclectic bunch, styled in no fewer than three different formats,  fitted out across four different ranges.

Cells B2:B488 – whose adjoining C column evince RPI data and whose adjoining vacant D and E columns have yet to receive any CPI and Change in weekly pay numbers – ostensibly dress themselves in the Accounting format, a standard numeric guise:

rpi1

The problem is that the data bob and weave somewhere in the interstice between numbers and text. Click the Number Format drop-down arrow and textual evidence abounds:

rpi2

Authentically numeric data would appear in all their numeric finery beneath the Number rubric above, and under other headings, too. The Aug-81 you see, on the other hand, evidences something else – something decidedly non-numeric. Moreover, if you write:

=COUNT(B2:B488)

The function that counts cells in a range sporting numeric values, you’ll realize a grand total of 0. But if you multiply any cell between B2 and B488, e.g,

=B100*2

You’ll muster an actual, numeric result. Strange.

I can’t account for the bizarre duality visited upon these cells, other than to wonder if they’ve experienced some sort of mutation in the course of being spirited away from some other, primary source (that this format precisely blankets only those RPI data that precede the entry of the CPI figures at row 489 suggests an external provenance.)

In any case, a workaround is at hand. You’ve doubtless taken note of the green-notched alerts attaching to each and every cell in B2:B488; select the range and note in turn the exclamation-marked indicator and its clarifying legend:

rpi3

I will confess some uncertainty over the official standing of the “date string” phrase, but that’s my problem. In any event, clicking the exclamation mark uncorks two Convert options, and because all our dates are appear to be of 20th-century vintage, click Convert XX to 19XX (I don’t recall seeing these possibilities before either). Proceed and our problem range has been reformed into real, obedient, well-meaning dates.

You’ll be happy to know that the next 268 rows in column B are already irreproachably date-formatted, and that brings us to B757, at which the data seem to have been handed over, at least for the next 13 rows, to what the number format field in the Number button group reports as a Text format. There’s a quick fix here, too: click in I757 and enter

=DATEVALUE(B757)

DATEVALUE knocks some numerical sense into date-looking text, thus making them fit for all those pivot tables and formulas you have at the ready. Copy it down to I769, then copy those results, and apply a Paste Special to B757:B69 and take the rest of the morning off.

But the admirable efficacy of DATEVALUE got me to thinking: could we have done this same formulaic thing to B2:B489? It turns out we could have indeed; that is, we could have written

=DATEVALUE(B2) etc.

and emulate the very same text-to-bona-fide-numeric date rewrite that Convert XX worked out.

It seems the two text-laden ranges – B2:B489 and B757:B769 – share some genetic code, even though the former triggered the green notches and exclamation marks, and the latter didn’t. I can’t explain the disparity, other than to float the idea that these patches of data emanate from heterogeneous sources; your comments and conjectures are welcome.

And last and numerically least, cells B770:B777 revert to duly certified date format. Once you’ve unified the date format throughout the Date range and you can rev up the pivot tables and attendant date groupings, etc.

A few codicils: you’ll need to either delete the data-source attributing rows 778 and 779 or splay 777 and 778 with a blank row, thus exiling these extraneous text entries from the greater body of data. There’s also the matter of those month names gathering in the A column. I’m not sure what forthright advantage redounds to these unalloyed text data; we can already chisel months from the date data in B, after all. If for some reason you wanted to groups months qua months, absent their association with their respective years (e.g. the RPI averages for all Aprils), you could go ahead with the B-column data.

Of course the larger question is to ask why they data were compiled in the way in which they were, uncombed and contrary. I don’t know, and short of braving a query to the folks at the Guardian, the evidence suggests a polyglot workbook glued from shards of data prospected from hither and yon. The aim, again: get all the data to speak the same language. Esperanto, anyone?