Archive | May, 2014

Pounds for Quarters: British Home Sales Stats

29 May

The first rule of data entry: No blank rows. The second: A field without a header is terra incognito. Rule two gets violated here, in the UK-government’s count of the nation’s (that is, England and Wales) property transactions for the first quarter this year (click the 2014 Price Paid Data sheet on the page when you get there).

Exactly why the headers went missing is one of those questions I can’t answer, other than to channel some data-designer’s sense that the names just weren’t necessary, even though they are, and even though field explications are out there, on the web site:

price1

(And for additional enlightenment look here; thanks to the Land Registry’s Roger Petty for the tip.)

Duly enlightened, I beamed another row atop the data and minted these names:

Column

A – Transaction No.

B – Price

C- Transfer Date

D – Postal Code

E – Property Type

F – Old/New (I would have coded these data O,N, not the slightly oblique N,Y)

G – Duration (i.e,  purchase or a rental)

H – Address

I – Secondary Address

J – Street

K – Locality

L – Town/City

M – District

N –   County

The Duration codes abbreviate Freeholder or Leaseholder, the former a permanent mode of ownership, the latter fixed-term (for additional enlightenment here, see this review).

Column O – Record Status – comprises an endless skein of As, and as such should be deleted. In fact, I’d delete H, I, and J as a practical operational tack (street names won’t add much to the narrative), but I’m leaving them be simply for exposition’s sake. (I should add that it seems to me that Column N really names cities and towns, and not the purported County, but I’m holding (here, at least) to the web site’s field-naming sequence).

Now we can proceed, at least after adding that the date data in C are over-formatted, interjecting as they do a constant 12:00:00 AM into every record, to no particular informational purpose. The Short Date format here should say bonne nuit to all those midnights.

Peculiarities aside, the data are nothing if not informative. What about, for starters, a basic price-by-month breakout?

Row Labels – Transfer Date (grouped by Month)

Values – Price (with suitable formatting, including the 1,000s separator, or comma)

price2

Remember that we’re looking at 145,000 transactions, and so that small pushback in price could be trend-worthy – though that is hardly the last word on the finding. Reintroduce Price to the Values area, this time via Count:

price3

That one’s a head-scratcher. Might the precipitous retraction in March activity owe more to a lag in the data gathering than any real buyer hesitancy? That’s a question for the data gatherers.

And what about price and a property’s old/new standing?

Row Labels: Transfer Date (grouped by Months)

Column Labels: Old/New

Values: Price

price4

 

Newer properties (that is, those coded Y; I told you I’d would have alphabetized these differently) are expectably dearer; but the February spike is not as easily foretold. And if you think the February new-property sales totals are statistically scant, re-present the Price data in Count mode (scotching the decimals in the process):

price5

Thus something real is at work, I think.

Now this one might be one part learned speculation, nine parts non sequitur, but here goes: What about average sales price by transfer day? Granted, I see no salient reason why transactions should so vary along such lines, but let’s throw caution to the winds. Take over column O (remember I’d deleted its original data inhabitants), name the field Day, and enter in O2

=WEEKDAY(D2)

and as usual, copy down the column (1 is Sunday, by the way). You could then run a Paste > Special > Values through the numbers. Then try

Row Labels: Day

Values: Price (Average)

Price again (Count)

I get

price6

No, I can’t explain Fridays – day number 6. It’s possible some institution-wide protocol stays the processing of files until the work week’s end, but that’s one outsider’s supposition. Nor can I account for the Friday downswing in average price, which is notable and likely not driven by the probabilities. Looking for a story line, anyone?

And for some local color, I thought I’d take a look at London prices alone, lining them up thusly:

Report Filter: County (click Greater London)

Row Labels: District

Values: Price (Average)

Price (Count)

price7

It’s the borough of Kensington and Chelsea – redoubt of oligarchs and other ruling-class sectaries – that leads the pricing parade, but that’s a dog-bites-man truism. No more revelatory is London’s average £496,074 transaction figure, nearly double the overall average. What is interesting, however, is the city’s 20,784 sales during the quarter – 14.3% of the month’s total – when held up to London’s proportion of the England/Wales population – about 14.8%. In other words, the housing market in the capital appears to be no more frenetic, though considerably more pricey, than it is anywhere else – in spite of London’s vaunted economic hegemony.

So what does it mean? This: sell your house in Chelsea, and buy one in Bexley.

 

 

 

New York’s Citi Bike Data, Part 2: The Case of the $5,000 Ride

22 May

You get 45 minutes, you’ll recall, before your free (as it were) Citi Bike ride shifts into overtime and the journey begins to take its toll – literally:

bike1

That + sign requires elaboration, herewith furnished in an email from a Citi Bike staffer:

“Trips that are shorter than 30 minutes are no extra charge. Trips longer than 30 minutes incur overtime fees. After the initial 30 minutes, it’s an additional $4 for the second half hour, $13 for the third half hour, and $12 for every half hour after that. A trip begins when a bike is unlocked and ends when a bike is securely returned to any Citi Bike station.”

Keep those pricing strata in mind. In any case, because our Citi Bike data stopwatches every ride’s duration – in the very first field, in fact – we’re ready to run with that variable (inapt transportational metaphor there – sorry).

But note that the duration data express themselves in seconds – and because minutes is the more negotiable currency here I’ll call upon a simple transformational tool, one I confess to have badly underused.  Enter 60 in any blank cell, and copy it. Select all the tripduration data in column A, and click Home > Paste >  Paste Special…Divide:

bike2

And format to two decimals.

Paste Special Divide smartly substitutes for the traditional paste value routine – i.e., the inserted column, in which values in the source column are formulaically divided by 60 in the new column and pasted back via the Values option. Paste Special Divide here seems faintly circular, with a kind of self-referential division and paste-back slam-banged into the same cell. But it works.

Once done, you could plant a simple AVERAGE in at the underside of column A and learn that the mean Citi Bike trip in February lasted 14.58 minutes, thus jibing with the riders’ purpose-driven riding intentions (that number compares interestingly with Washington DC’s 13.75-minute average for the first quarter of 2014; see those data here). Citi Bikers probably have to get somewhere soon; exercisers and riders in the park don’t start their picnic just 14.58 minutes after they’ve screwed their helmets on.

Now sort the durations in descending order and look for the value 45. Again, that’s the threshold in minutes after which the Citi Bike billing scheme begins to achieve lift-off.  You’ll find that the 45ers and beyond command 2921 rows, or an intelligibly small 1.3% of all the February rides. But look at the first record. It archives the activity of a woman, née 1982, who decoupled her cycle from the South Portland Avenue and Hanson Place dock in the Fort Greene quarter of hippest Brooklyn at 3:52 on February 8, and redeemed her vehicle at the posh Grand Army Plaza and Central Park South station, hard by the Plaza Hotel, at 12:41 on February 17 –12768.47 minutes, or nearly nine days, later.

What went on here? Do we have a case of extreme right-tailed forget fullness, or a grave Citi Bank typo? Or is there a tale to be told about those nine wintry days in February? I for one am clueless, but cluelessness begets its prodigies of speculation; and at the very least I sense a New Yorker short story in there, or a maybe a TV movie. How about Clare Danes for the lead?

I do go on, but there one thing about the cyclist who wasn’t there that we can know for sure. Throw 12768.47 minutes into Citi Bike’s multivariate pricing equation and go figure; and when the lead snaps in the pencil the total charge for our lady in (the) red reads … $5057.88. You’re welcome to check my math, but I get 419.116 half-hours at $12 per, and she was probably charged another 12 bucks for the .116 besides. Does anyone have Clare’s number?

But while you work on that, you can spend a bit more time on the rest of the data, which has lots to offer. For example, how about a pivot table listing of the ten busiest stations from which trips get set in motion? That one’s easy, for starters:

Row Labels: Start Station ID

Values: Start Station ID (Count)

Then right-click in the Row Labels and click Filter > Top 10… and click ok in the resulting dialog box.

The for addition ID slide Start Station Name into Row Labels beneath Start Station ID, and see to it that the report layout is set for Tabular Form, and with all sub and grand totals turned off:

bike3

It helps to be a New Yorker here, but if you aren’t, trust me – all of the above are in Manhattan.

The options keep coming. Try adding trip duration to Values and:

 bike5

Those numbers are seven-figured; riders did nearly 2,000,000 minutes, or almost 32,000 hours’ worth of traveling, from the Lafayette/East 8th embarkation point (even as the Pershing Square station in midtown clocked more time with fewer trips). (Note: if your top 10 Values numbers don’t sort descendingly – as mine didn’t – go here:  right-click in either Row Label field and click Sort > More Sort Options… and observe below:

 bike6

So what about end stations? Barter the parallel start station fields for their end station equivalents and don’t touch the tripduration sum:

bike7

(And you may have to attend to that sort order nicety once more.)

Again, it’s the intersection of Lafayette and East 8th Streets that holds pride of cycling place, and it stands to reason: after all, if one travels from A to B, the prospects are good that a return trip to A bulks large on the itinerary.  Indeed – the start and end Top 10s share nine stations. (And if you want those expand/collapse buttons  to disappear – at least in version 2010 and above – click in the pivot table and proceed to PivotTable Tools > Options > Options in the PivotTable button group and:

bike8

Tick off that first display box.

Again, the analytical possibilities here are quite considerable. How about breaking out average trip durations by both time of day and gender?

Row Labels: starttime (group by Hours)

Column Labels: gender

Values: tripduration (Average)

 bike9

Note that male trips are longest in the midnight hour, a datum that remains to be thought about; and while the trough in women’s durations at 3 and 4am is expectable, the contiguous spikes at 5 and 6 perhaps aren’t. And why are women’s trips (and there were 524 of them) nearly three times longer at 6am than men’s? In fact overall, female trip durations exceed males’, 16.22 minutes to 13.57. (And yes, you’ll collect different average durations if you import endtimes instead.)

There’s quite a bit more to look at in here, but I gotta go; my movie idea is gaining traction, and I mustn’t keep Harvey Weinstein waiting.

He takes my calls, you see. Hey -we went to the same high school.

What’s Up, Dock? New York’s Citi Bike Data, Part 1

15 May

It doesn’t take much to make me happy – a loaf of bread, a jug of wine, and a couple-hundred thousand records worth of data to play with. What else does a guy need?

OK, so I’m low maintenance – but what about that dataset? I’m thinking about the monthly data racked up by Citi Bikes, the quick-ride bicycle rental service sponsored in large measure by Citi Bank, the firm whose name bears a serendipitously close resemblance to that of the service itself (must be another case of monkey-at-the-typewriter; it just happened, I guess).  I downloaded the data for February 2014, the latest (and smallest) workbook on offer, here:

http://www.citibikenyc.com/system-data

The above superlatives aside, it’s still a big file – 224,000 records big – but nothing you can’t handle.

A cousin of sorts to London’s Boris Bike and Paris’ Velib schemes (though that latter, far smaller city has far more biking stations than New York does), Citi Bikes rents the two-wheelers from docking stations 24/7 to denizens needing to get here-to-there with dispatch; it’s a cheaper deal than a cab and, given the New York’s  viscous streets , very possibly faster.

Bear in mind that Citi Bike isn’t citywide. “Station locations are based on population and transit needs, and were selected through a participatory public input process”, its web site lets us know; what it means right now is that its 330 stations teem in the lower half of Manhattan and certain  next-big-thing districts in the west of Brooklyn, (check out its real-time bike-availability map here). As such, one could properly ask if its cycling demographic could be extrapolated to the remainder of the city, much less anywhere else.

In any event, an analysis of the Citi Bike data by Sarah Kaufman of New York University’s Rudin Center for Transportation Policy and Management has something to say about biking and gender, at least among Citi Bikers – namely that far more men – about 68% – lean over the handlebars then women, comparing notably to England’s 73% (see this review, which also points to an upward slope in female cycling in London), even as the Netherlands edges toward parity (Kaufman allows that with wider city coverage of biking services the ratio in New York should flatten as well). In relative terms, the female biking presence looms larger in Brooklyn (see Kaufman’s report).

Now to the workbook, and a few data quality issues. First, 6717 records – about 3% of the whole – bear no birth date and gender information; these absentees belong to a cohort Citi Bike calls Customers – riders who purchase one or seven-day user passes and presumably aren’t duty-bound to complete their dossiers. The remaining 97% or so are Subscribers, who remit an annual membership fee entitling themselves to “free” 45-minute rides, after which overtime charges begin to gnaw at their credit cards. Thus if you want to break out the rides by age and/or gender you may want to sort the empty records to the bottom of the stack, and loosen them from the data set with our tried-and-true empty row technique (but don’t delete the records; you may want to rejoin them for other breakouts. You could also filter these inside a pivot table, but admitting blanks to a table could court grouping problems). There are also 101 rides ostensibly launched by men (yes, all men) born in 1917 or earlier – a spry bunch to be sure. And finally, note that the data track numbers of rides, not the numbers of discrete individuals taking them.

Those qualifications in place, dock your mouse in the gender field and its numeric codes – 1 for male, 2 for female, and 0 for those unknown Customers. Those identities are workable, but for presentation purposes I’d substitute alphas M for 1 and W for 2 via a pair of Find and Replaces, assuming your PC can withstand 224,000 worth of replaces. And there’s no shortage of interesting data combos spearheaded by gender , e.g., this pivot table hour of ride inception by gender:

Row Labels: starttime (Group by Hours only)

Column Labels: gender

Values: gender (Count)

 citi1

Bet you’re itching to know where those 433 riders actually went between 3 and 4 AM; well, the information is in the data set. (And note that female rides in February accounted for only 19% of all journeys, at some substantial remove from Sarah Kaufman’s global 32%).

And to uncover the gender proportions across the above times, click in the Values area and step through PivotTable Tools > Options >Show Values As > % of Row Total:

citi2

Female ridership peaks in midday, and you’re probably unsurprised by its decided underrepresentation at 3 AM.

And then I had a bright idea, an event of sufficient intermittency to leave me no choice but to share it with each and every one of you. The Citi Bank dataset my well be the first I’ve encountered in which each of its records incorporates a pair of latitudes/longitudes – the first fixing the point of ride inception, the second its terminus. And so it seemed to me that if you could plot the distance separating the two latitudes and then the comparable interval for the longitudes, you’d have in effect measured the two perpendicular sides of a right triangle, whereupon the no less tried-and-true Pythagorean theorem could be called off the bench to calculate the distance between start and end stations. Remember a2+b2=c2?

It sounds downright doable, and in fact it is; indeed, I was avid to recount just how and to the inevitable peals of reader appreciation, when Pythagoras tapped me on the shoulder, took me aside, and in his best broken English reminded me that his hypotenuse describes an unrelentingly straight, as-the-crow-flies line, the kind of locus beaming through Manhattan’s rectilinear uproar that only a crow – or a helicopter – could navigate. In reality, though, on-the-ground biking distances could, and probably should, be totalled more unassumingly, by simply adding the latitude-to-latitude and longitude-to-longitude distances – that is, two of the perimeters of that right triangle. Square root? No – more like a square route, then.

Having twisted the dimmer down on my bright idea, let’s try then to pace off the distance of Feburary’s very Citi Bike first ride (sorted chronologically by Citi Bike in row 2), one embarking precisely at midnight from Washington Square East in deepest Greenwich Village, and coming to rest at 12:06:22 at Stanton and Chrystie Streets in the storied Lower East Side.

So what do we have to do? First, understand that lines of latitude are spaced equidistantly up and down the planet (or so the web tells me), with contiguous latitudes subtending about 69.65 miles. Thus, with the start and end latitudes declaring themselves in columns F and J, let’s claim the P column and write in P2

=ABS(J2-F2)*69.65

I get about .571. The ABS function relieves a formula’s result of its sign, such that applying ABS to either 3-7 or 7-3 will return just 4; and since we can’t know when subtracting a ride’s end latitude from  its start will output a negative number (depending on the rider’s direction), the ABS enables us not care about it.

Marking off the distance between any two longitudes is a touch more pause-giving, because these verticals radiate from the poles, the distances between them dilating as they near the equator. After some stops and starts across the web I came up with a New York-specific inter-longitude distance of about 52 miles between degrees. Going with that, let’s creep into the Q column and enter in Q2

=ABS(K2-G2)*52

 

Here I get around .221. Add .571 and .221, or

=P2+Q2

in the R column and our midnight cowboy seems to have pulled into Stanton and Chrystie Streets around .792 miles after he started (Pythagoras would have gotten around .61 miles, by the way).

Of course we’re approximating (after all, Broadway traces a hypotenuse-like diagonal through much of Manhattan), though I think our guess is pretty sound; but for rigor’s sake I turned to Stephen P. Morse’s cool distance-calculation site, on which users can submit pairs of latitudes and longitudes and learn how far it is from A to B . Working with my original Pythagorean metric and comparing my numbers with Morse’s (who seems to be working with right triangles, too), my numbers reached a 99% agreement with Morse’s; and after playing around with our original New York 52 miles-per-longitude and bumping it down to 49.2 miles I came closer still to Morse’s values. Thus by substituting 49.2 for 52 in the formula above, I would up with a trip of .780 miles.

Whew. In any case, if you’re agreeable to the above you can copy the =P2+Q2 formula in R2 down its column (and name this field) and hasten to pave it all over with a Copy > Paste Special > Values, thus sending 220,000 byte-eating formulas packing. And you could probably delete the P and Q columns, too, dispelling another 440,000 formulas; those data have already played their part.

And now you can do some neat stuff. I told you I was happy; but now it’s time for lunch.

Hey – anyone know where that loaf of bread went?

 

Marathon Data, Part 3: A Matter of Time(s)

9 May

 

It’s the marathon we’ve been considering but it’s raised some hurdles en route, even as we seem to have risen above them without skinning our knees. And now that we’ve handcrafted our gender variable –to look back at one of those hurdles – we might as well mix it in a pivot table or two with some other ingredients, and season to taste.

We could, for example, see what we shall see about the distributional relation between gender and age category, as our data understand that latter attribute, e.g.

18-39
40-44
45-49
50-54
55-59
60-64
65-69
70+

You’re wondering – with good reason – about that 18-39 interval and its anomalous 21-year wingspan, its incongruity underlined by runners’ overwhelming representation in that most spacious category. Try, simply

Row Labels: Category
Values: Category (Count; it’s a text field, after all)

3race1

 

55% of the competitors converge inside the 18-39 bracket, with no drill-downs in sight. To coin a phrase, you can’t always get what you want. And note the invisible count for blanks, because by definition category-less records have no data to count. But what about a gender proportioning of the runners by age?
Steer Gender into Column Labels and swap Gender for Category in the Values area, because the former field is wholly populated, and hence thoroughly countable. Click PivotTable Tools > Options > Show Values As > % of Row Total and:

3race2

I’m moderately surprised. The data limn a near-straight-line association between category and gender (excepting the blanks, an interesting discrepancy); female participation droops with age, and, demographics being what they are I would’ve have supposed quite the opposite. Crudely line-charted, I get

race chart

 

The story angle’s all yours, journos.
And while we’re at it, what about split differences reckoned by age and gender?
Row Labels: Category
Column Labels: Gender
Values: Split Pct Difference (Average)

3race3

 

That’s interesting too. Women match their marathon halves more closely than men across the age gamut; and how that might square with women’s relatively slower times is something to be thought through. This one’s for you, sports physiologists.
And for some breakouts driven by runner country? Again, no field devotes itself to nationality alone, but the data are there, packed alongside runners’ first names in the Name field in what appears to be a constant position following the name. As such, we can open a column to the right of name (that’s G in my workbook), dub it Country, and enter in the next row down:

=RIGHT(SUBSTITUTE(F2,”)”,””),3)

And copy down the column. That’s “” and not ” ” grazing the closed parenthesis in there, and it imparts a character of no length, if that’s not too metaphysical. Put otherwise, we’ve effectively deleted the ) character (for a review of SUBSTITUTE, see my May 17 post). The expression above exchanges the closed parenthesis trailing the three-letter country code for that “”, delivering, for example:

Laura J (GBR

And once that reformation is effected a simple RIGHT procedure shaves each entry’s last three characters, and brushes them into the Country field.

That looks good, but in the course of vetting our new country data I tripped over a thimble’s worth of records naming countries that you won’t find anywhere in your atlas. For example, Stuart Richard McDonald (overall place finisher 22182) appears to hail from a country called ard;

3race4

You see what’s happened. A shard from Mr. McDonald’s first names has splintered into the Country field, with GBR joining a country club – that is, the Club field. It’s a data entry misstep, but if to err is human, the repair calls for a formula. I’ll try this rewrite:

=IF(RIGHT(F2,1)=”)”,RIGHT(SUBSTITUTE(F2,”)”,””),3),RIGHT(SUBSTITUTE(H2,”)”,””),3)

That one seems to work. The formula asks if the final character in the F cell data is that closed parenthesis. If it is, proceed with the first expression as we detailed it above. If it isn’t – that is, if the closed parenthesis sealing off the country code has gone missing – look to the H column instead and execute the RIGHT/SUBSTITUTE routine there; and you could finalize the process by putting a Paste Special > Values through its paces to cut back on the byte count. And now you can break out the numbers and names by country, too.

And there’s another data curiosity making a rightful claim on your attention. You may have been drawn to the unrelieved pound signs (or hash marks, to you Britons) obtruding upon the DNS field (don’t be diverted by that field name; the data comprise runner split differentials expressed in time terms), here and there:

3race5

The pound signs won’t be relieved by a column autofit; rather, they hint at a mathematical impossibility of sorts. Because the numbers in effect express chronological time-of-day data, an outlaw consequence of the time format (see last week’s post), negative splits – in which second marathon halves are run more quickly than first –register an inexpressible negative time of day. You can’t have that in this cosmology, and hence the pound signs. If, however, you format the column in standard Number terms, actual usable negative numbers emerge. Just one of those spreadsheet things.

And here’s something else to look at. If you sort the data in the Split Pct Difference row smallest to largest – thus heading the column with the runners whose second halves were comparatively the fastest – the splits get rather interesting, starting with Makaela Alison Papworth, a woman whose 30,579th-place finish belies her first split of 4:00:02, a mere warm-up for a second half of 1;28:38. And if you can’t explain that 63% reduction neither can I. Let’s hope the race officials did.

And then there’s Jason Scotland, he of a first split of 2:45:15 and a second of 1:19:09. And if that sounds dodgy consider that Mr. Scotland, running in this year’s marathon as Scotland-Williams, appears to have smashed that personal best, breaking out times of 2:07:05 and 1:01:42 – the latter hurtling him into world-record territory. But in between – that is, in the 2013 race – he checked in with splits of 2:24:09 and 4:59:45.  Have we reached the limit of the benefit of the doubt, yet?

What do you think, Ms. Papworth?

 

Marathon Data, Part 2: A Matter of Time(s)

2 May

Our analysis of the London marathon data has proceeded apace, but the next bend in the road will slow our jaunt a trifle. We spent the latter part of last week’s post on a think about runners’ split ratios and how these comport with their race finish, and we found a comfortable if imperfect fit between finish positions with the evenness of runner splits. That is: the higher the finish, the greater the split symmetry. Now we want to try something different: to associate splits with the clocked times of the runners, rather than their finishing positions.

And this project poses a few new problems peculiar to the pivot-tabling enterprise. As with last week’s split/finish position table, we again want to group the runners, but this time by some grouped interval of the measured times. We can start here:

Row Labels: Finish Time

Values: Split Pct Difference (Average; again format Percentage to two decimals)

Enter the problem.  Initiate a Group Selection for Finish Time and click off the Months grouping default. Then tap the tandem of Seconds, Mintues, and Hours (note each of these can be clicked independently of the others – don’t press the Ctrl key as you click). But now what?

race21

That was a semi-rhetorical question, if you’re keeping score; and I’m asking it because once you ratify the above selections you get something like this:

race22

 

The data have indeed been grouped, after a fashion, but not in fulfillment of the wish I’d harbored for them. Here the results have been organized and broken out by respective units of time, but have not beenbinned by any interval I might want to specify for them, e.g. ten-minute tranches:

2:00-2:10

2:11-2:20

2:21-2:30

And so on. Then note that the Grouping shot above seems to have secluded the By field that typically solicits the user’s interval call, e.g.

 race23

There thus seems to be something about numbers – and remember, at base that’s exactly what they are – formatted in Time terms that rebuffs traditional grouping overtures. I’ll repeat my question, then – now what?

I should add that I’ve taken this general problem up elsewhere, namely on the datadrivenjournalism site, but in any event we want to group runner times in what should assume passably time-resembling appearances. So here’s what I’d do: Type the field heading Times in the first free column (mine is Q; remember that last week we had planted that Split Pct Difference field in the midst of the existing fields) and enter, in Q2:

=ROUND(K2*24,3)

Returning 2.079 (when formatted to three decimals) – and that’s not two hours and eight minutes (We’ll get to ROUND shortly). What it is is 2.079 hours, or in the language of clock time marathon winner Wilson Kapsang’s 2:04:44, or thereabouts.  What we’ve done is simply multiple race times – which again are really numbers reflecting the portion of a 24-hour day taken by any runner – and subjected these to the 24x multiple. Thus three hours – or .125 of a day – will appear here as 3:00. Considered purely numerically, then, Kapsang’s victory quantifies as .087, or 8.7% of a day, which in 24-hour-multiplier terms batches up to 2.079 – a real number, one which can be grouped with all the others. 2.079 – not 2:079.

And what about ROUND, the function that materially rewrites the real (as opposed to formatted) value of the number to which it’s applied? I want it because had the formula plunged ahead without it here, that 2.079 would have exhibited many more decimal points, all of which would in turn displayed in the pivot table (carry out the steps below without the ROUND supplement and you’ll see what I mean).

And once all the backstaging is done we can proceed to this table:

race24

Row Labels: Times (grouped at least here for illustration purposes by .2, or 20% of an hour, or 12 minutes)

Values:  Split Pct Difference (again by Average, formatted in Percentage and two decimals)

I get, in excerpt:

race25

 

Here too as with last week’s finish position-split difference correlation, split difference averages match up with fleetness of foot.

Now what about that gender workaround to which I pointed last week, a restorative step that would institute a gender field in the source workbook? After a not inconsiderable time (probably more inconsiderable than it should have been) I toughed out something like this:

Custom Sort the data, first by Place Gender and then by Place Overall.

Shoot a new column between B and C, call it Gender, and in what is now C2 enter:

=IF(A2<max(a$2:A3),”M”,”W”)

And copy down the column.

What is the formula attempting to do? First, the two-field sort puts Place Gender finishers in their place, such that both 1s, 2s, etc. stand neatly atop one another. Next, the second sorted field, Place Overall,  proceeds from an assumption: that, for any pair of men and women holding down an equivalent Place Gender – for example, a man and woman finishing in a respective 600th place – the man’s time should be the swifter. That isn’t sexist, just a statistical and empirical likelihood (the first 263 finishers were men), and an assumption that clearly couldn’t be mapped to other sorts of data , e.g., test scores. The absolute reference for A2 in the nested MAX function allows the range to be extended down the column, e.g., A$2:A3, A$2:A4, A$2:A5, etc., thus identifying the maximum overall finish, or slowest time, to that point in the column. And since the maximum at those successive points should – should – be keyed to a female runner, the formula should post Ms and Ws alongside the appropriate names.

I think this works, though be aware that some finish-place numbers appear to have dropped out of the listing, and that a number of runners completed the course with exactly the same time, earning them duplicated finish numbers. But I don’t think those loose ends compromise the results, which I think are pretty good and presage an interesting series of gender-driven breakouts.

Anyway, it seems to me a Part 3 is called for in this discussion, or at least I’m calling for it. Two concluding points, here, then:  This post’s second screen shot, the one portraying the pivot-tabled Hour, Minute, and Second groupings, pictures an odd chronological datum:

See that 2 AM? In fact, the London Marathon workbook’s Half and Finish Time numbers all seem to record themselves in AM terms, and that’s really and clearly not what we want. Wilson Kipsang’s  2:04:44 hardly means to be understood as a middle-of-the-night four minutes after two, even as it is formatted as such. And it turns out that ridding the numbers of their AM/PM tags is a rather sticky wicket. (See, for example, this discussion.) Just be advised.

And the other point is this: click on that 2 AM, for example, and type something – anything – else in its cell. That something else suddenly overwrites the data, a bit of pre-emption that is, if I may be so colloquial, fairly mind-blowing. Try it; whip up a pivot table flowing from any data source and type into a Row Label cell. Your text takes over – and I thought you couldn’t do that, and I can’t explain it, at least not yet.

I need to think about it. Or is this something everyone knew all along?