Making Up For Lost Time: London Tube Lateness Data

22 May

If you haven’t heard it before – and if you’re espying this post in Sydney or Sofia you probably haven’t – you’re hearing it here:  ride the London underground 15 minutes longer than the ride is supposed to take and you get your money back.

In fact, I’d allow that a sizeable crew of Londoners haven’t heard that before either, but it’s true, and if you doubt me look here. Indeed – your intrepid correspondent has collected twice, a pair of lucrative postscripts to some down time I experienced down under.

It’s all true; but how do the powers that be know you’re on the up-and-up when you press your cash-back claim? They know because they’re watching your Oyster card, your ticket through the transit system that means to record your every point of departure and disembarkation, and whether you really were stuck on that Northern Line idling at Tooting Bec on the 17th at 2:12 pm (yes, editor, there is such a stop).

And that means in turn that all the trips of all the descenders into the underground have to be summed and reckoned, with much of that tallying brought to us here:

 Copy of tfl-tube-performance

The dates in the Key Trends sheet kick off in cell A2 from the second reporting period of the 2011-12 financial year, and don’t be flummoxed by the 02 at the head of the cell. That isn’t February, but rather the second of the fiscal year’s 13 28-day reporting intervals, a curious baker’s dozen that I suspect has something to do with 364’s divisibility by 13, and the cooperative absence of any underground service at all on December 25. But the apportionments aren’t quite that neat. In fact the day total for 2012-13 comes to 365 (note the 29 in cell C26, probably a leap-year emendation), while C27 registers a 27-day interval, and C39 and C52 have inherited 30-days stretches.

In any case, Key Trends gives us more to think about than period day distributions – for one thing, that arpeggio of cascading comments, in striking but largely illegible array:

late1

Apart from bidding us to think about how extended textual reflections might fitly reconcile themselves to the data to which they’re annexed, clarity calls for a most simple remedial step – just click the Show All Comments button in the Comments button group (Review tab), and make the comments disappear. To recall any comment in particular, click on its cell and try Show/Hide Comment, in the singular. Why the sheet insisted on its in-your-face comment revelation is a matter to put before Transport for London.

But once you do get to the data you’ll need to run through a number of think-throughs in order to decide what exactly it is you want to learn from them. Consider for starters the Number of Lost Customer Hours entries. These could be divided by the corresponding Excess Journey Times, thus returning the apparent number of passengers experiencing a tardy ride (of course in view of the 15-minute rule most of these won’t be compensated in any event).  You’ll also need to multiply the Lost Hours by 60 so that you’ll be dividing minutes by minutes.  If we command the J column, then, and title it Number Late, we could enter in J2:

=F2*60/I2

(Note: no parentheses required here. Excel’s order of operations of will attend to the math.) That expression gives me 23,609,932.35 or so latecomers, a rather large number to be sure, but one of course that needs be juxtaposed to the total rider figure for any given 28-day period. And to proceed here we need to hazard an educated guess. Transport for London puts the annual tube traffic nowadays at around 1.265 billon riders a year. Giving leap years their due, divide that number by 364.25 (remember December 25 takes the day off) and a day’s worth of tube travel yields an average of 3,472,889.5 passengers, which in turn needs to be multiplied by the Days in Period numbers in C. Simplest tack: enter that average in a free cell and name it daily, then slide into the K column, call the prospective field Pct. Late, and post, in K2:

=J2/(daily*C2)

(This time you do need the parentheses. Note in addition the 3.4 million daily average is a current one, and as such may artificially, if slightly depress lateness percentages for earlier years. The determined researcher could doubtless recover the actual year-specific totals.)

Copy down K and observe the notable flux among the percentages, particularly the spikes in C38 and C41. Those quantum leaps owe their king-sizes to job actions, and here you would profit from consulting the attached explanatory comments.

How you’d choose to evaluate these lateness proportions, which tend to oscillate across the 20-30% band (though the variation is real – observe the 14.88% for the period ending October 13 2012 in row 20), is perhaps a matter for the glass half empty/full disputants. A rough-hewn null hypothesis might predict that as many rides would pull into their rider-intended stops late as early, by some margin. A minute-late ride is late, after all, and so a 25% aggregate lateness measure might not fairly serve as a full-on indictment of the underground service. Still, late is late, once the baseline, expected lengths of journeys are decided.

But another simple but trenchant measure of punctuality offers itself from the data. We could divide the number of customer hours lost by Operated Kilometres (as Transport for London would spell it)counted in the G column; that is, the accumulated distance travelled by all trains in the period. It stands to reason that all things being equal greater distances traveled associate with more lost hours, and because we have all the data we need here – unlike the Pct Late metric that works with the same daily passenger average throughout – this little fraction may be all the more incisive. We could title the L column something like Ratio of Hrs Lost to KM, simply divide the value in F by the one in G, and copy down.

You’ll also note the matrix of lost customer hours in its eponymous tab, an enhanced compilation of data pushing further back in time to the 2003-4 financial year and treated to an off-the-shelf conditional format. Just one question here: the matrix provides lost customer hours for the first reporting period of 2011-12 – even as the listings for that year in the Key Trends sheet only start at period 2.

Is that what they mean by lost customer hours?

US Railroad Accident Data: Getting Them On Track, Part 2

15 May

Well, give me this: I’ve been trying. I’ve put the data quality questions I put to you in last week’s railroad accidents post to the parties of the first part, the Federal Railway Association, and the rest has been silence – at least so far, apart from the auto-responses that, if nothing else, have persuaded me that I’ve indeed emailed the proper agency (and a pair of additional messages to Dr. Frank Raslear, the stated contact person for the accidents workbook, bounced back to me).

With silence comes ambiguity. Have my queries been remanded to some inert, serpentine queue, or have they rather cowed the Association’s best and brightest into a collective hush?

My self-flattery is not becoming.  Either way, and after the battery of amendatory spreadsheet moves I recommended in the last post, we’re left with a data set of 50,000-plus worth of records that are, and aren’t, completely usable. They are, because the data qua data are fit for pivot tabling purposes and the like; and they aren’t because we – or I – remain unsure about their validity and meaning.

But let us suspend disbelief in the interests of proceeding apace, and assume – at least for now – that the data do remember the work hours as they truly unfolded (e.g. we’ll assume someone actually plied that 23 1/2 –hour shift clocked in there). You’ll recall that I had called for a pasting of all the data into a single sheet, and then supplementing the set with a new field that figured work shift duration (and we can call it Shift Duration).  I remain bothered about the apparent 2003 vintage of the records, but again, that’s what they’re telling us.

In any case, it occurred to me that we might first want to construct a pivot table that would simply count and differentiate accident reports by Accident Type – that is, return the number of records citing non-human and human factors:

Row Labels: Accident Type

Values: Accident Type (Count)

But guess what – when I initiated the standard pivot table routine a decidedly non-routine message jutted its jaw in my path:

 blograil1

 I don’t recall having seen that alert before, but look behind the message to the Create Pivot Table window and note the dimming of the typically-defaulted New Worksheet radio button. Then return to the Review ribbon and view what I had failed to see during last week’s post: the illuminated Protect Workbook button. (It should be added here that the FRA’s instruction to unprotect the sheets – an effort that bore fruit last week – asks the user to “click on “Tools” from the top of your screen, then select “Protection” and click on “Unprotect”, all of which sound like a Excel 2003-specific command sequence. Perhaps the data really are that old.)  Protect Workbook – a preventive step I’d allow that few users actually take – seals off spreadsheets in ways that are wholly incomparable to the work performed by Protect Sheet. That latter command bats away attempts to enter data (among other things); Protect Workbook stymies what it calls structural changes to the entire file, e.g., the introduction of a new worksheet – and that’s why New Worksheet is grayed out; there’s no new worksheet to be had. And that’s the faulty destination reference (and as you see in the above shot Protect Workbook also resists pivot table construction on an existing sheet).

And while under default circumstances one can simply turn Protect Workbook on and off successively with alternating mouse clicks as per last week’s turn-off of Protect Sheet, the protection here is password-controlled:

blograiil2

And I can’t find the password, nor do I see an allusion to it in the FRA page. If I’m missing something you know where to reach me.

But not to worry. We can simply select the entire data set, paste it to a new, blank workbook, and do our thing as if nothing had happened.  I just can’t explain why this bump in the road decided to bump.

But obstacles aside, once we get there that pivot table looks like this:

blograil3

Shift the numbers into Show Values As > % of Column Total and Human Factors account for 26.98% of the reported accidents, an appreciable but smallish fraction of the whole.

Of course, the data’s ellipticality – that is, the relative dearth on FRA’s page of explanatory deep background on the data – throws a caution at any accounting of what the numbers really tell us, but this is what we have.

We could then pursue the surmise that human-factored accidents associate themselves with work shifts starting at the extremities of the day, that is very early or very late, and perhaps in different distributions from the non-human incidents . We could try something like this:

Row Labels: OnDutyTime

Column Labels: Accident Type

Labels: Count of OnDutyTime (again, as % of Column total. And because we’re compiling percentages that necessarily add to 100% we don’t need Grand Totals):

 blograil4

Recall that those are 50,000 cases tossed into the table, no small universe. Note the human factor percentage for shifts beginning at 6AM, which in absolute numeric terms comes to 1093 accidents. Of course, these data require a definitive set of denominators before salient disproportions in occurrence can be proclaimed; that is, we need to know the actual numbers of workers reporting to their job, and when.  After all, 6AM may simply be the start time that predominates.

And while we’re continuing to wonder about data quality, I’ve only lately come about across a record – in row 1300 to be exact, in the pre-merged Quarter 4 sheet. It archives a worker who commenced his/her day at 7:01AM on October 19, and punched-out at 1:01AM – on October 21, exactly 42 hours later. The accident on that shift befell the worker at 3:30AM on the 20th, the only day in the span that could have contained a 3:30AM. Among other things, it would appear that, accident notwithstanding, the worker remained on the job for another 21 or so hours. But you’ll have to ask the union about that one.

But I’m still waiting for an answer from the FRA to my questions. I’ve emailed them, after all – and that means they know where to reach me, too.

US Railroad Accident Data: Getting Them On Track, Part 1

7 May

Falling asleep on the job – literally – can be an occupational hazard – literally. Some evidence, for example: the incidences of fatigue-associated accidents among railroad staffers, tracked by the US Department of Transportation and compiled here:

 HF_Deidentified_FRA_Work_Histories

Encountered straight on, the data seem analyst-compliant and only too happy to offer their services to interested pivot tablers and other deep thinkers; all the reported dates and incident times, for example, sport verifiably numeric status. But there are some sticking points in there that need to blunted before they’re smoothed into talking points.

First, a data-organizational alert.  The records for the year have been sliced into quarter-specific sheets, thus placing your permit for pivot table construction on momentary hold; but by copying and pasting the four sheets’ worth of data into a solitary one you’ll be able to start building, and without bribing the inspector besides (the time-warped Alt-D, P-conjured PivotTable and PivotTable Chart Wizard consolidation routine we’ve discussed in an earlier post won’t work here, for reasons with which I won’t bore you).

And then we’re left to account for the 490 rows in Quarter 4 starting at 10024, all of whose 2,450 cells store data that, diverse formats notwithstanding, evaluate to zero. Speculation aside they should be deleted, and not merely separated from the substantive pack by our celebrated interpolated row. Rather, delete them simply because they’re effectively empty.

And for another curiosity – about which I may not have to inform you if you’ve already downloaded the workbook – note that the sheets are protected, a defensive mode I don’t recall having encountered in a net-available file before (and remember we’ve culled this one from an open data site).The fact of the protection is acknowledged in this supplementary page, that step taken “to prevent accidental data corruption”.  Given the workbook’s placement and public availability on the Data.gov site I’m not sure where that corruption might be perpetrated; but in any case a simple tapping of the Review > Unprotect Sheet button will release the cells to your beneficent custodianship.

I was also temporarily mystified by the two Accident Types sharing that eponymous column, until it occurred to me that HF almost surely initials some human factor contributory to the accidents, corroborating the sheets’ title; and by extension NHF points to non-human causatives. And I don’t know whether the matched entries in the Subject ID  field always denote the same railroad worker; we see, for example, that subject ID 051500.0101 in the Quarter 2 sheet records 17 incidents. Same person?

And what about that title, in the wrapped, merged A1? That composited address comprises worksheet titles and field headers, and that won’t do. The very simple workaround: click the Merge & Center button; that’ll restore the headers to their appointed cells in – and remember to either delete the title proper, or insert our redoubtable blank row between what are now rows 1 and 2. (But of course if you plan on carrying out that copy-and-paste multi-sheet consolidation, you need perform this  cell decoupling only once.)

The On and OffDuty times, on the other hand, seem unproblematic; they appear to mean what they say – that is, recall the start and end points of a worker’s daily stint, although I’m assuming – and not necessarily rightly – the associated accident in column E may have necessarily shortened any given day in the data set. But even here my confidence in that reading is not doubt-free..

To explain: in the prospective interest in correlating accident times with its point of occurrence during the given work shift, I’d move into the F column and enter, in F3:

=C3-B3

That simple subtraction should turn a workday length in hours and minutes, discernible as such if you Custom format the results this way:

h:mm

(Note that only one h will nevertheless properly capture a double-figured hourly duration.)

Copy the formula down the column and, among other things, (and provided you haven’t copied-and-pasted yet everything into a single sheet yet) you’ll find very many unaccountably lengthy shifts; in Quarter 2, for example, a 23:30 turn materializes in row 5989. Does that sound possible, or legal?

Moreover, the year references in the sheets appear to hark to 2003, and I’m not sure the data aren’t more current than that. The workbook was first made public in 2008, but we’re told here that an update was patched in this March 31. If in fact the years are anachronisms and as such nothing but a grand, aggregated mistake, they could perhaps at the same time be defensibly ignored, precisely in virtue of their constancy. If, after all, the data emanate from the same year, the intrepid pivot tabler will concentrate the analysis on months and days alone.

And for another discrepancy: look, turn for example, to Quarter 1’s row 650. It cites an accident at 12:30 am, even as the associated work shift outlined in B650 and C650 spanned 2:55 am to 9:55 am.

rail1

That can’t check out; 12:30 am fell outside that worker’s day, and quite apart from the 2/15 date (see below).

And while you’ve probably brought this one to your own attention already (and if you haven’t it’s been brought there right above), just for the record we note the unvarying dates accompanying the accident times. I think we can safely ignore these without much commentary or controversy; I suspect a format misfire explains them all.

But I’d appreciate some answers to the other questions. I hope to get back to you on them, if the Federal Railroad Administration gets back to me.

Word Counts Refined: Part 2

30 Apr

It was not a week ago that I put the matter before you, in our course of our stone-unturning, spreadsheet-pressed search for keywords:  In light of the methodology we’ve been touting, would it also be possible to uncover multi-word phrases, e.g. “tax relief”, in addition to the singletons “tax” and “relief” we’ve already learned to count?

It seems we can, and the means of choice toward that end is an array formula, a feature I’ve sedulously avoided until now because of its rep for conceptual obtuseness, a not entirely misdirected allegation. And while it’s true that array formulas have their moments –some of them rather hairy – and equally true that I’m not the industry’s go-to guy for array constructions, but you’ll nevertheless profit from making at least a passing acquaintance with them.

Array formulas do the work of multiple formulas, but muscle them into but one expression. In fact, you’ve probably used array formulas already without having been formally introduced to them, because a good many of Excel’s built-in functions buy into the array programme. By way of very good example, consider SUMPRODUCT, which multiplies pairs of values and then, befitting its name, add them all up. Thus if we subject these entries in A4:B8

text21

to SUMPRODUCT, written as follows:

=SUMPRODUCT(A4:A8,B4:B8)

Excel will multiply the pairs A4 by B4, A5 by B5, etc., and cap the process by totalling all those products, turning out in this case 1369. The by-the -book alternative, on the other hand, would of course require the user to compose five multiplication formulas, e.g. =A4*B4, =A5*B5 (positioned say in the C column), and so on, and bring these together with a =SUM(C4:C8); but SUMPRODUCT subsumes all that activity in its solitary expression (and imagine working with 500,000 pairs of values instead).

Thus SUMPRODUCT qualifies as an array formula because it scoots through all those multiplications, each one glancing off the same pair of parentheses and bounding into the same sum calculation. But some tasks call for user-inspired, customized array solutions, and the savvy to build them .

For example, compare these student answers to ten multiple-choice test questions with the correct ones:

text22

These data occupy H6:I16. To deliver an array-derived total of correct answers, I’d enter in a free cell

=SUM(IF(H6:H15,I6:I15,1,0))

Translation: let the formula compare the ten pairs of same-row responses, e.g. H6 with I6, H7 with I7, and so on. Each match delivers a 1, after which all the 1’s are summed (just for the record, the 0 in the expression –signifying the value-if-false contingency of the IF segment – could in fact be omitted, in contradistinction to its indispensability to a standard IF statement. But we’re speaking in Readers Digest terms here, so not to worry. And  note that the above expression was originally mistyped: thanks to Mark Serebriakoff for the error catch).

And when that formula is good to go, don’t put it into play with a reflexive press of the Enter key. At the point of completion, array formulas always require the spreadsheet equivalent of a musical triad instead – Ctrl-Shift-Enter. That chordal touch presumably instructs the formula to do its array thing, which here should return 5. And know that the hallmark squiggly brackets that attend every array formula can’t be merely typed; they properly appear only after Ctrl-Shift-Enter is struck. (For a relatively straightforward introduction to array formulas see this entry.)

Now that our crash course has collided with your medulla let’s reprise the keyword question with which we inaugurated the exercise. In the first instance, we’re looking for a way to count the appearances of a specified two-word phrase in our text.

For the sake of the illustration, I’ve conducted the text-preparation routine I described last week on about 620 words lifted from the Conservative Manifesto, which I’ve pasted them at B2. You can view the set-up here:

 manifesto excerpt

Suppose I’m looking to count incidences of the phrase “We will”, a phrase that seems to proliferate here, and probably not insignificantly; I’ll enter “We will” in D1. And in another free cell I’ll enter, after having remembered the Ctrl-Shift-Enter coda:

{=SUM(IF(B2:B624&” “&B3:B625=D1,1,0))}

Now what is the formula meaning to do? It’s string-concatenating consecutive words in the single-column text range, but doing so again and again, for each pair of words brocaded down the entire column. Note here that the two ranges, or arrays, as they’re officially called, are staggered – B2 is concatenated with B3, B3, concatenates with B4, and so on, thus capturing every stepped pair all the way down the column (the ” ” interposes a space that emulates the phrase as it’s written in D1). But note – and this is rather essential – the arrays are equivalently sized; each one here consists of 623 cells, and that parity is fundamental to the array process.

Thus if any contiguous word pair equals the “We will” in D1 it receives a 1; and once followed all the way through, the array formula brings 16 “We wills” to the surface, and I’d suggest that so broadly distributed and inclusionary a phrase is a hook on which the Conservatives want to hang their hat.  And you can copy the array formula, squiggles and all, the better to mint duplicates that can search for any other phrase(s); and you’ll be pleased to know that the dups will exhibit relative addressing, just like any another expression.

And by way of postscript to Part 1 we can also propose a friendly amendment to the COUNTIF usages we endorsed there. You’ll recall that we nailed together a most basic pivot table last week that would tally the occurrences of every word in the considered text:

Row Labels: Words

Values: Words (Count, of course; the data are textual)

But what had escaped my Starbucks-flooded attention was the most simple but powerful fact that the table itself performs a global COUNTIF; that is, it develops a count for every word in the corpus. And because the all the counts are already there, you need only substitute a simple VLOOKUP that’ll return any particular word count from the pivot table columns.

And you could even just sort the words alphabetically, or just bang out a Ctrl-F. That’s a Find as in Find and Replace; not FIND.

Word Counts Refined: Part 1

23 Apr

Last week’s keyword search of the Conservative Manifesto was happy to profit from a quirk of that quest: my apparently inadvertent success at finding a place for each of the words of the Manifesto’s copy-and-pasted text in their very own cells, e.g.

 text1

And once having got that far, we easily go about the task of piling keyword counts of Manifesto themes, by directing a series of COUNTIFs at the word isolates. But almost by definition quirks do not predominate, and the post allowed that most copies-and-pastes of text will, rather, assume this alignment:

text5

In which each line remands its text to a single cell; and if the keyword search is to find its quarry here a Plan B, such as the one I screen-shot last week:

{=SUM(LEN(SUBSTITUTE(TRIM(A2:A100),” “,”  “))-LEN(TRIM(A2:A100)))+COUNTA(A2:A100)}

 

might have to answer the call. But could there be a more economical, user-friendlier Plan C out there, one that would be happy to look no further than the COUNTIF that’s served us so reliably till now? There is, I think, though you’ll need to start the process with another reliable friend – none other than Word.

But, you may want to ask, why do we want to make resort to what is in effect an add-in, however familiar? After all, and as intimated last week, we could bring about the dicing of words we see in the first screen shot above by pumping the data through Excel’s Text to Columns parser, lodging Space as our delimiter. That suggestion is rightly posed, but let us see how the text breakout according to Word serves us better.

First, copy and paste the relevant text into Word. (Note by way of additional preliminary that if the text is PDF-sourced, as is the Conservative Manifesto, you can Save As the doc instead as a bare text file, an healthful austerity measure that should streamline the editing process – because if you’re working with an ornately-designed PDF you’ll probably want to vet the content for pull quotes, e.g.

 text2

and recurring, redundant headers (The Conservative Manifesto 2015) that pull at and skew the aggregate text totals.

Next, select all the text and click (remember, we’re in Word now) Insert > Table > Convert Text to Table. Once you check in, enter 1 in the ensuing dialog box:

 text3

But why?  We want a nice, minimalist 1 up there because we’ll soon want to cast the data into a pivot table, and a single Word table column translates into what is, after all, a single pivot table field. And if you want to know exactly why we’d want to pivot table the words, when we spent all of last week’s post doing nothing of the sort, I hope to establish relevance, as the lawyers say.

And before you take leave of Convert Text to Table click Other beneath the Separate text at heading and type a space in Other’s rectangular field – because the spaces in the document are to act as delimiters of the words on either side of them, chipping each into a table cell. Click OK and let the table pull the text into its unitary column (note: my laptop can’t distill the 30,000 Conservative Manifesto into a one-column table. For one thing, my Number of rows reports a negative number when I earmark but one column for the table, and that can’t happen. If were you determined to word-count the Manifesto or any other very large document via this means you might have to table the text in installments.)

Once the column drills down its pages, we then want to do something about the non-alpha accoutrements attaching like burrs to the words – that is, the sundry punctuations that deal a joker to COUNTIF wild-card searches. By way of review, if we’re counting instances of the word “tax” and we’re confronted with “tax,” a COUNTIF on the order of

=COUNT IF(A1:A100,”tax”)

won’t add “tax,” to its tally. And if you interject the perfectly legal “tax”&”*” instead, you’ll net “taxi,”, but also “taxation” and “taxi”; and that’s probably not what you had in mind either. Rid the text of commas, quotes, and the like, and a clean search for “tax” gets cleaner, and it’s here where Word’s far mightier Find and Replace feature justifies our recourse to that application (for an eye-watering treatment of its  herculean capabilities see Tibor Környei ‘s piece here).

Start up Find and Replace and enter:

text4

Note the Use wildcards ticked box; you need to click the More button (the one that presently shows <<Less, above) in order to find it.

And what’s this bit of cryptography doing? It’s Find and Replace’s way of searching for all non-alpha characters (the [!] signifies not; the [A-z] span looks for every alpha irrespective of case) and replacing them here with nothing; and what’s left are alphas only – the stuff of words. And it should work. Depending on the length of the excerpt with which you’re working, you may want to downsize its operative font substantially, thus manufacturing a smaller complement of pages (remember you’re not interested in reading the words here anyway; they’re to be pasted to Excel. And note also that I see that, for whatever reason, some non-alphas, e.g. [ and an underscore, don’t seem to capitulate to the Find and Replace routine. Because as we’ve seen, [ is a bit of Find and Replace code, it may be interpreted as such and left alone. You can still purge these recalcitrants, though, via a standard Find and Replace – that its, Find [ and replace it with nothing).

And once your column of words has completed its run down the pages, copy and paste the whole thing to a spreadsheet, say to B2 (reserve B1 for a field heading, say Words). Next, make way for an old trick: enter 1 in A2, 2 in A3, jot a field header in A1 (e.g. Number), and run an autofill down the column (though you’ll probably have to drag the fill handle in lieu of the standard double-click of the handle, because the latter fill mode will come to rest alongside any blank cell in B, and blanks there should be). We’re numbering here because we want to dispose of the empty cells in B (ones that have likely received the source text’s Enter/line break codes, transported to the spreadsheet as blanks). Sort B by A to Z and let the blanks drop to the floor of the column. Insert a blank row between the bottom word and first blank, and then sort by the numbers in A, in order to restore the primeaval word sequence of the text.

Now you’re ready to word count via the COUNTIF deployments we described next week. But again, apart from the non-alpha/punctuation sticking point we’ve unstuck here, COUNTIF had already begun to do this work last post. So what’s new? This: a pivot table, an elementary one:

Row Labels: Words

Values: Words

And that rudimentary construction delivers a global word frequency count of each and every word in the document, an as such widens an encompassing look at every potential keyword.  Of course you’ll have to look past the irrepressibly  populous “the”,” and”, “or”, “of”, and kindred gear teeth before you look any substantive keyword in the eye, but that’s part of the cost of doing business.

But there’s something else to consider. We’ve done a pretty good job of readying individual words for the counting. But what if we want to count a pair of them, conjoined in a phrase? We can count “budget” and count “deficit”now, but what about “budget deficit”? And no; it won’t do to count each of the two and simply sum the results, because that totalling won’t speak to instances of their adjacency. So what to do?

I think I have a workaround, so hold on.  I know you love a teaser.

The Conservative Manifesto: Word Counts of the World Unite

16 Apr

David Cameron wants your vote, but he can’t have mine; neither can Ed Milliband, Nick Clegg, Nigel Farage, or Natalie Bennett, if they and you must know. I’m a legal alien, you, see, and can’t exercise the franchise here in the UK, where they tell me there’s an election May 7, and I believe them (they – and they know who they are – are perfectly happy to take my taxes, though; complaint desk, please).

Apropos all of the above, if you’ve been waiting for the Conservative Manifesto it’s here at last, minted in all its PDF, ethnically-sautéed glory, and ready to head your way from the (curiously) small Read it here link bordering the accompanying video. And in this connection the Guardian has put its abacus to the manifesto and  keyword-counted the document:

 man1

(Don’t be misled by the “mentioned on speech” title clause. The Guardian states elsewhere in its story that the counts are in fact drawn from the actual manifesto.)

Question then: could you and I do the same? Answer,: yes, or something pretty much like it.  After copying and pasting the Manifesto into a spreadsheet, we might be able to throw a function or two – more particularly COUNTIF, or depending on the way in which the data break, a FIND, at the text and get an enumerative handle on the keyword counts.

And so in the interests of pre-testing I cracked upon the Manifesto and dragged, copied, and pasted a swatch of text into a sheet, looking something like this:

man2

Surprise. Surprise, because pasted text typically narrows itself into a single column, its row/lines protruding therefrom, e.g. this copy and paste:

 man3

That shot isn’t a paragon of lucidity but trust me; all these lines and all of their text are seated in the A column; and undifferentiated data of this stripe explains the need for a Text to Columns feature, the enabler of choice for parsing the lines into smaller, meaningful fields.

But our data are nothing of the sort. The Manifesto text has already done its parsing, breaking each and every word into a cell all its own, for reasons I can’t immediately explain.

And I can’t explain this either: when I later attempted a select-copy-paste replication on the Manifesto, it hemmed itself back into a single column, as per the screen shot directly above. What had I done right/wrong the first time? Or had the Conservatives shipped a reformat while I wasn’t looking? I really don’t know, at least not yet; but I do have the unlovely word-to-a-cell version for you, and it’s here:

Conservative manifesto

Mysteries aside, one might ask about the respective analytical potential inhering in the two motifs exemplified above. I would submit that the rendition before us – in which each Manifesto word owns an unshared cell address – is superior, because its discrete word segregation empowers us to make productive use of the relatively simple COUNTIF.

And for an understanding of that productivity think about how COUNTIF works, in its ultimately binary inclination. If you enter

Day after day it rained

in A10 and ask COUNTIF to look for instances of the word “day” (and let’s hold off on the syntactical issues for the moment) you’ll return 1, not the 2 for which you were hoping. COUNTIF merely tells you if the searched-for item is there at all: if it is, it’ll serve up a 1. If it isn’t, you’ll get a 0 (and not an error message, by the way).

Thus if the text data are conventionally organized, that is, if they’re massed into a single column, COUNTIF can’t accurately tell you how many times a word appears in the excerpt, telling you instead only whether or not the word appears in the line, frequency notwithstanding. But our data, for whatever reason, grant each word a cell; and so the 1 that COUNTIF can deliver will suffice (consider also that, unlike standard database features, COUNTIF can conduct cross-column/field counts).

And by way of further extended preamble let’s try and identify where in the sheet the Manifesto itself actually starts. It seems to me that the Manifesto proper inaugurates itself on row 114, and closes at 3741 (row 3742 features a personal attribution that is clearly extra-Manifesto). I thus named the range sprawling across A114:AD3741 (AD being the outermost text-bearing column) Man (and that name should accompany your download). Note as well that the copy-paste as we see it may have inflicted a measure of apparent text relocation here and there, but if we’re conducting individual word searches, that shouldn’t much matter.

In any case, if we want to count instances of the word “tax”, for example, we could repair to the out-of-range cell A3744, type tax therein, and enter, in B3744

=COUNTIF(MAN,A3744)

You should get 85; I do, at any rate. But we see above that the Guardian counts 132 occurrences of the term, a departure that one assumes could be explained at least in part if the newspaper admitted word variants – e.g. “taxes”, “taxation”, along with “taxes.” or “taxes,” etc. – into its equation. One way we could approximate toward that widened space:

=COUNTIF(MAN,A3744&”*”)

The above construction looks in effect for tax*, and so should assemble most of the kindred references to the central term (though it’ll also find and count “taxi”, and of course won’t find “levy” or “duty”. The science is inexact, though it’s still probably more science than art). And with that criterion in place, I get 146, a good deal closer to the Guardian total.

Note the syntax above calls for a string concatenation, even if the item to be counted turns out to be numeric. Thus, for example, counting every number exceeding 100 in cells A1:A100 would require

=COUNTIF(A1:A100,”>”&100)

And that string stipulation sets COUNTIF apart from the standard IF, which welds             operators to its truly numeric values, e.g.

=IF(A6>100, “Pass”,”Fail”)

In any case, you can try COUNTIF in its several flavors on the other Guardian search terms, and see how your counts compare with theirs (I don’t know how the paper realized its counts, by the way). Thus I get just 19 for “housing”, even as the Guardian comes up with 42. “hous” somehow gets me 37, though, and “home” 79. Did the Guardian simply add totals for incomparably-spelled synonyms? It surely did for deficit/debt, for example, and when I combined the two I got 32 to the paper’s 30 – not bad. Again, the Guardian gets 73 for Jobs/employment; I get 70, but I’m pretty far off on Europe (31, 44 with Euro) and immigration (my count: 26). But I get an in-the-neighborhood 52 for “economy”, near the Guardian’s 58 (though “econo” yields me 109).

Again, I don’t know just how the Guardian arrived at its counts, which in some cases put considerable distance between themselves and my totals. Maybe I’ll Tweet them about their methodology.

And by the way – would it be possible to roll up keyword counts even if the data hew to the one-column regimen? It is, but you might have to make recourse to a formula looking something like this:

{=SUM(LEN(SUBSTITUTE(TRIM(A2:A100),” “,”  “))-LEN(TRIM(A2:A100)))+COUNTA(A2:A100)}

And explaining that one here would blast my word count sky high.

Chicago Crime Stats: Two Ways of Apprehending The Data

1 Apr

When Big Data gets too big, crack the manual and page furiously to Plan B; and if someone’s put the scissors to Plan B, then hold your head high and beat a dignified retreat. What’s that song about knowing when to fold ’em?

For spreadsheet operatives, too big arrives at record number 1,048,576 (there’s that header row, after all), where the retreat begins its pullout, and the white flag starts waving. Plan B, then, might consist of keeping the data at arm’s length – in an Access table, for example, and querying it from afar. That task is certainly practicable: click Data > From Access, identify the database, and click its desired table e.g, this shot from Sean Lahman’s must-have baseball stats repository:

chi1

Decide in the dialog box that follows how you want the data to present themselves; if the record count overruns the sheet, click the Pivot Table Report radio button in answer to the Select how you want to view the data in your workbook prompt. You’ll get your pivot table and its Field List, all right, but it’ll be flying on automatic pilot; you won’t see the actual data, because you can’t.

This works; but when I read about data visualizer Joe Lieder’s charted considerations of the 5,700,000-record dataset of crimes perpetrated in Chicago between 2001 and 2015 I began to search for the data, along with the nearest white flag. I guessed right, and found the former in the city’s Data Portal. But what now?

With its familiar American open-data interface, the Portal does open a number of possible download routes (start exploring by clicking the blue Export button in the far right of the screen) including the ones taking us through Excel territory:

chi2

But 5,700,000 records just can’t be squeezed through those channels above. There’s also that OData possibility, up there, a data mode likewise affording at-a-distance access to Excel, provided you can download the free Power Query engine add-in; but it appears that my plebeian version of Excel doesn’t qualify for the utility (but don’t worry; I bought my copy).

In the interests of dignified retreat then, I asked the Chicago Data Portal to filter but two years’ worth of data, 2013 and 2014 instead – but at 577,000 records and nearly 100 MB, no small excerpt. Have your laptop take a deep breath before it attempts to force-feed it to your screen.

Now allow me to ask: what can a pure, un-charted spreadsheet bring to the analysis? How about, for starters, learning something about crime by hour of the day? Among other things, Lieder charts monthly crime fluctuations, but goes no farther. An hourly breakout, then, could be engendered by throwing the Date field data into a pivot table Row Labels area (but not into Column Labels; with around 228,000 unique date-times, they won’t fit in there); but a preliminary Oldest-to-Newest sort of the field turns up nearly 350,000 non-date – that is, text-formatted – entries, an apparent casualty of the AM/PM suffixes clinging to these.

I don’t know why some, and only some, entries should be so embellished (and note I’m asking after truly labelled AMs and PMs. Those time-of-day qualifiers can also associate themselves with certain actual, numeric date formats). After a good deal of dithering over the matter, I pried open a new column between C and D, called it Hour, formatted its cells to zero decimal points, and entered, in what was now D2:

=HOUR(C2)

And copy down the column, of course.

This works, but I don’t entirely know why. Again, there’s a whole lot of (apparent) labels in C, and yet HOUR manages to secure the number it’s seeking from each and every cell in the column format notwithstanding. Requires additional scrutiny, I think.

Then I’d step through this pivot table:

Row Labels: Primary Type (of crime, that is. I than continued with a Filter > Top 10, in view of the 33 types)

Column Labels: Hour

Values: Primary Type (Count; turn Grand Totals off)

And because the outcome reads densely, I went ahead and grouped the hours in bundles of four hours each:

chi3

(Deceptive Practice appears to denote crimes of business deception, by way. Look here, for example. And note that the grouped hours above are merely numbers, and not derivates from certified time-formatted values. Note in addition that HOUR retrieves just that – an hourly reading only – and as such will treat 7:58 as 7:00; and that means that the 8-11 really sections off 8:00 through 11:59, for example).

Among other things, the breakout totals almost exactly 40% of all narcotics offenses in the broad-daylight hours of 8:00 AM through3:00 PM (in effect), outnumbering the 35% it counts between 8:00 PM and 3:00 AM, when one might have stereotypically anticipated a spike in this kind of activity. Of course the greater daytime population and its widened customer base may offset the facilitating, clandestine ecologies of nocturnal settings.

No less surprising, perhaps is the 50% fraction of motor vehicle thefts reported between 8:00 AM and 7:00 PM, when cover of darkness would likewise have been expected to abet the illicit repossessions.

But of course it’s dark at 7:00 AM some of the year, so let’s try this: insert another column, this to the right of Hour, call it Month, and post in what is now E2:

=MONTH(C2)

And copy down. Next, engineer this pivot table:

Row Labels: Hour (in the interests of precision, group by bins of two hours each)

Column Labels: Month

Report Filter: Primary Type (select Motor Vehicle Theft)

Values:  Primary Type (Count, necessarily; the data are textual. Select % of Column Total; turn of Grand Totals)

 chi4

Do auto thefts mount in early morning and evening winter hours? Not decisively, but the percentages for the 16-17 (4:00 to 6:00 PM) tranche are slightly supportive. The winter month percentages (remember that we’ve called for intra-month hour/crime percentages, not proportions across the months for a given hourly frame). In fact the three highest theft rates for the 16-17 bin do emerge in November, December and January. Indeed, the 6-7 (6:00 to 8:00 AM) reading for January, when’s it’s still dark, stands notably higher than for any other month, but April and May discernibly top October and November here, and so the results aren’t emphatic.

And for a presentational talking point, it could be asked of this chart by Mr. Lieder:

chi5

In which in effect, three-variable are charted – Year, arrest percentage, and primary crime type – might be better captured by

Row Labels: Primary Type

Column Labels: Arrest (coded either True or False for an arrest or the failure to effectuate it)

Values: Arrest (% of Row Totals; Grand Totals turned off).

It’s just a thought.

And one more thought: the pivot table above turns up these labels:

chi6

And that means the same crime has been recorded with a pair of spellings, a classic no-no. While the absolute total of the two – 33 – isn’t enormous, you’d want to select just one of these entries for the authorized spelling and put the others through a Find and Replace.

Then refresh the pivot table.

 

And by the way, I have vacation next week. That’s the advantage of self-employment – you get to make your own hours.

Follow

Get every new post delivered to your Inbox.

Join 176 other followers