Word Counts Refined: Part 1

23 Apr

Last week’s keyword search of the Conservative Manifesto was happy to profit from a quirk of that quest: my apparently inadvertent success at finding a place for each of the words of the Manifesto’s copy-and-pasted text in their very own cells, e.g.

 text1

And once having got that far, we easily go about the task of piling keyword counts of Manifesto themes, by directing a series of COUNTIFs at the word isolates. But almost by definition quirks do not predominate, and the post allowed that most copies-and-pastes of text will, rather, assume this alignment:

text5

In which each line remands its text to a single cell; and if the keyword search is to find its quarry here a Plan B, such as the one I screen-shot last week:

{=SUM(LEN(SUBSTITUTE(TRIM(A2:A100),” “,”  “))-LEN(TRIM(A2:A100)))+COUNTA(A2:A100)}

 

might have to answer the call. But could there be a more economical, user-friendlier Plan C out there, one that would be happy to look no further than the COUNTIF that’s served us so reliably till now? There is, I think, though you’ll need to start the process with another reliable friend – none other than Word.

But, you may want to ask, why do we want to make resort to what is in effect an add-in, however familiar? After all, and as intimated last week, we could bring about the dicing of words we see in the first screen shot above by pumping the data through Excel’s Text to Columns parser, lodging Space as our delimiter. That suggestion is rightly posed, but let us see how the text breakout according to Word serves us better.

First, copy and paste the relevant text into Word. (Note by way of additional preliminary that if the text is PDF-sourced, as is the Conservative Manifesto, you can Save As the doc instead as a bare text file, an healthful austerity measure that should streamline the editing process – because if you’re working with an ornately-designed PDF you’ll probably want to vet the content for pull quotes, e.g.

 text2

and recurring, redundant headers (The Conservative Manifesto 2015) that pull at and skew the aggregate text totals.

Next, select all the text and click (remember, we’re in Word now) Insert > Table > Convert Text to Table. Once you check in, enter 1 in the ensuing dialog box:

 text3

But why?  We want a nice, minimalist 1 up there because we’ll soon want to cast the data into a pivot table, and a single Word table column translates into what is, after all, a single pivot table field. And if you want to know exactly why we’d want to pivot table the words, when we spent all of last week’s post doing nothing of the sort, I hope to establish relevance, as the lawyers say.

And before you take leave of Convert Text to Table click Other beneath the Separate text at heading and type a space in Other’s rectangular field – because the spaces in the document are to act as delimiters of the words on either side of them, chipping each into a table cell. Click OK and let the table pull the text into its unitary column (note: my laptop can’t distill the 30,000 Conservative Manifesto into a one-column table. For one thing, my Number of rows reports a negative number when I earmark but one column for the table, and that can’t happen. If were you determined to word-count the Manifesto or any other very large document via this means you might have to table the text in installments.)

Once the column drills down its pages, we then want to do something about the non-alpha accoutrements attaching like burrs to the words – that is, the sundry punctuations that deal a joker to COUNTIF wild-card searches. By way of review, if we’re counting instances of the word “tax” and we’re confronted with “tax,” a COUNTIF on the order of

=COUNT IF(A1:A100,”tax”)

won’t add “tax,” to its tally. And if you interject the perfectly legal “tax”&”*” instead, you’ll net “taxi,”, but also “taxation” and “taxi”; and that’s probably not what you had in mind either. Rid the text of commas, quotes, and the like, and a clean search for “tax” gets cleaner, and it’s here where Word’s far mightier Find and Replace feature justifies our recourse to that application (for an eye-watering treatment of its  herculean capabilities see Tibor Környei ‘s piece here).

Start up Find and Replace and enter:

text4

Note the Use wildcards ticked box; you need to click the More button (the one that presently shows <<Less, above) in order to find it.

And what’s this bit of cryptography doing? It’s Find and Replace’s way of searching for all non-alpha characters (the [!] signifies not; the [A-z] span looks for every alpha irrespective of case) and replacing them here with nothing; and what’s left are alphas only – the stuff of words. And it should work. Depending on the length of the excerpt with which you’re working, you may want to downsize its operative font substantially, thus manufacturing a smaller complement of pages (remember you’re not interested in reading the words here anyway; they’re to be pasted to Excel. And note also that I see that, for whatever reason, some non-alphas, e.g. [ and an underscore, don’t seem to capitulate to the Find and Replace routine. Because as we’ve seen, [ is a bit of Find and Replace code, it may be interpreted as such and left alone. You can still purge these recalcitrants, though, via a standard Find and Replace – that its, Find [ and replace it with nothing).

And once your column of words has completed its run down the pages, copy and paste the whole thing to a spreadsheet, say to B2 (reserve B1 for a field heading, say Words). Next, make way for an old trick: enter 1 in A2, 2 in A3, jot a field header in A1 (e.g. Number), and run an autofill down the column (though you’ll probably have to drag the fill handle in lieu of the standard double-click of the handle, because the latter fill mode will come to rest alongside any blank cell in B, and blanks there should be). We’re numbering here because we want to dispose of the empty cells in B (ones that have likely received the source text’s Enter/line break codes, transported to the spreadsheet as blanks). Sort B by A to Z and let the blanks drop to the floor of the column. Insert a blank row between the bottom word and first blank, and then sort by the numbers in A, in order to restore the primeaval word sequence of the text.

Now you’re ready to word count via the COUNTIF deployments we described next week. But again, apart from the non-alpha/punctuation sticking point we’ve unstuck here, COUNTIF had already begun to do this work last post. So what’s new? This: a pivot table, an elementary one:

Row Labels: Words

Values: Words

And that rudimentary construction delivers a global word frequency count of each and every word in the document, an as such widens an encompassing look at every potential keyword.  Of course you’ll have to look past the irrepressibly  populous “the”,” and”, “or”, “of”, and kindred gear teeth before you look any substantive keyword in the eye, but that’s part of the cost of doing business.

But there’s something else to consider. We’ve done a pretty good job of readying individual words for the counting. But what if we want to count a pair of them, conjoined in a phrase? We can count “budget” and count “deficit”now, but what about “budget deficit”? And no; it won’t do to count each of the two and simply sum the results, because that totalling won’t speak to instances of their adjacency. So what to do?

I think I have a workaround, so hold on.  I know you love a teaser.

The Conservative Manifesto: Word Counts of the World Unite

16 Apr

David Cameron wants your vote, but he can’t have mine; neither can Ed Milliband, Nick Clegg, Nigel Farage, or Natalie Bennett, if they and you must know. I’m a legal alien, you, see, and can’t exercise the franchise here in the UK, where they tell me there’s an election May 7, and I believe them (they – and they know who they are – are perfectly happy to take my taxes, though; complaint desk, please).

Apropos all of the above, if you’ve been waiting for the Conservative Manifesto it’s here at last, minted in all its PDF, ethnically-sautéed glory, and ready to head your way from the (curiously) small Read it here link bordering the accompanying video. And in this connection the Guardian has put its abacus to the manifesto and  keyword-counted the document:

 man1

(Don’t be misled by the “mentioned on speech” title clause. The Guardian states elsewhere in its story that the counts are in fact drawn from the actual manifesto.)

Question then: could you and I do the same? Answer,: yes, or something pretty much like it.  After copying and pasting the Manifesto into a spreadsheet, we might be able to throw a function or two – more particularly COUNTIF, or depending on the way in which the data break, a FIND, at the text and get an enumerative handle on the keyword counts.

And so in the interests of pre-testing I cracked upon the Manifesto and dragged, copied, and pasted a swatch of text into a sheet, looking something like this:

man2

Surprise. Surprise, because pasted text typically narrows itself into a single column, its row/lines protruding therefrom, e.g. this copy and paste:

 man3

That shot isn’t a paragon of lucidity but trust me; all these lines and all of their text are seated in the A column; and undifferentiated data of this stripe explains the need for a Text to Columns feature, the enabler of choice for parsing the lines into smaller, meaningful fields.

But our data are nothing of the sort. The Manifesto text has already done its parsing, breaking each and every word into a cell all its own, for reasons I can’t immediately explain.

And I can’t explain this either: when I later attempted a select-copy-paste replication on the Manifesto, it hemmed itself back into a single column, as per the screen shot directly above. What had I done right/wrong the first time? Or had the Conservatives shipped a reformat while I wasn’t looking? I really don’t know, at least not yet; but I do have the unlovely word-to-a-cell version for you, and it’s here:

Conservative manifesto

Mysteries aside, one might ask about the respective analytical potential inhering in the two motifs exemplified above. I would submit that the rendition before us – in which each Manifesto word owns an unshared cell address – is superior, because its discrete word segregation empowers us to make productive use of the relatively simple COUNTIF.

And for an understanding of that productivity think about how COUNTIF works, in its ultimately binary inclination. If you enter

Day after day it rained

in A10 and ask COUNTIF to look for instances of the word “day” (and let’s hold off on the syntactical issues for the moment) you’ll return 1, not the 2 for which you were hoping. COUNTIF merely tells you if the searched-for item is there at all: if it is, it’ll serve up a 1. If it isn’t, you’ll get a 0 (and not an error message, by the way).

Thus if the text data are conventionally organized, that is, if they’re massed into a single column, COUNTIF can’t accurately tell you how many times a word appears in the excerpt, telling you instead only whether or not the word appears in the line, frequency notwithstanding. But our data, for whatever reason, grant each word a cell; and so the 1 that COUNTIF can deliver will suffice (consider also that, unlike standard database features, COUNTIF can conduct cross-column/field counts).

And by way of further extended preamble let’s try and identify where in the sheet the Manifesto itself actually starts. It seems to me that the Manifesto proper inaugurates itself on row 114, and closes at 3741 (row 3742 features a personal attribution that is clearly extra-Manifesto). I thus named the range sprawling across A114:AD3741 (AD being the outermost text-bearing column) Man (and that name should accompany your download). Note as well that the copy-paste as we see it may have inflicted a measure of apparent text relocation here and there, but if we’re conducting individual word searches, that shouldn’t much matter.

In any case, if we want to count instances of the word “tax”, for example, we could repair to the out-of-range cell A3744, type tax therein, and enter, in B3744

=COUNTIF(MAN,A3744)

You should get 85; I do, at any rate. But we see above that the Guardian counts 132 occurrences of the term, a departure that one assumes could be explained at least in part if the newspaper admitted word variants – e.g. “taxes”, “taxation”, along with “taxes.” or “taxes,” etc. – into its equation. One way we could approximate toward that widened space:

=COUNTIF(MAN,A3744&”*”)

The above construction looks in effect for tax*, and so should assemble most of the kindred references to the central term (though it’ll also find and count “taxi”, and of course won’t find “levy” or “duty”. The science is inexact, though it’s still probably more science than art). And with that criterion in place, I get 146, a good deal closer to the Guardian total.

Note the syntax above calls for a string concatenation, even if the item to be counted turns out to be numeric. Thus, for example, counting every number exceeding 100 in cells A1:A100 would require

=COUNTIF(A1:A100,”>”&100)

And that string stipulation sets COUNTIF apart from the standard IF, which welds             operators to its truly numeric values, e.g.

=IF(A6>100, “Pass”,”Fail”)

In any case, you can try COUNTIF in its several flavors on the other Guardian search terms, and see how your counts compare with theirs (I don’t know how the paper realized its counts, by the way). Thus I get just 19 for “housing”, even as the Guardian comes up with 42. “hous” somehow gets me 37, though, and “home” 79. Did the Guardian simply add totals for incomparably-spelled synonyms? It surely did for deficit/debt, for example, and when I combined the two I got 32 to the paper’s 30 – not bad. Again, the Guardian gets 73 for Jobs/employment; I get 70, but I’m pretty far off on Europe (31, 44 with Euro) and immigration (my count: 26). But I get an in-the-neighborhood 52 for “economy”, near the Guardian’s 58 (though “econo” yields me 109).

Again, I don’t know just how the Guardian arrived at its counts, which in some cases put considerable distance between themselves and my totals. Maybe I’ll Tweet them about their methodology.

And by the way – would it be possible to roll up keyword counts even if the data hew to the one-column regimen? It is, but you might have to make recourse to a formula looking something like this:

{=SUM(LEN(SUBSTITUTE(TRIM(A2:A100),” “,”  “))-LEN(TRIM(A2:A100)))+COUNTA(A2:A100)}

And explaining that one here would blast my word count sky high.

Chicago Crime Stats: Two Ways of Apprehending The Data

1 Apr

When Big Data gets too big, crack the manual and page furiously to Plan B; and if someone’s put the scissors to Plan B, then hold your head high and beat a dignified retreat. What’s that song about knowing when to fold ’em?

For spreadsheet operatives, too big arrives at record number 1,048,576 (there’s that header row, after all), where the retreat begins its pullout, and the white flag starts waving. Plan B, then, might consist of keeping the data at arm’s length – in an Access table, for example, and querying it from afar. That task is certainly practicable: click Data > From Access, identify the database, and click its desired table e.g, this shot from Sean Lahman’s must-have baseball stats repository:

chi1

Decide in the dialog box that follows how you want the data to present themselves; if the record count overruns the sheet, click the Pivot Table Report radio button in answer to the Select how you want to view the data in your workbook prompt. You’ll get your pivot table and its Field List, all right, but it’ll be flying on automatic pilot; you won’t see the actual data, because you can’t.

This works; but when I read about data visualizer Joe Lieder’s charted considerations of the 5,700,000-record dataset of crimes perpetrated in Chicago between 2001 and 2015 I began to search for the data, along with the nearest white flag. I guessed right, and found the former in the city’s Data Portal. But what now?

With its familiar American open-data interface, the Portal does open a number of possible download routes (start exploring by clicking the blue Export button in the far right of the screen) including the ones taking us through Excel territory:

chi2

But 5,700,000 records just can’t be squeezed through those channels above. There’s also that OData possibility, up there, a data mode likewise affording at-a-distance access to Excel, provided you can download the free Power Query engine add-in; but it appears that my plebeian version of Excel doesn’t qualify for the utility (but don’t worry; I bought my copy).

In the interests of dignified retreat then, I asked the Chicago Data Portal to filter but two years’ worth of data, 2013 and 2014 instead – but at 577,000 records and nearly 100 MB, no small excerpt. Have your laptop take a deep breath before it attempts to force-feed it to your screen.

Now allow me to ask: what can a pure, un-charted spreadsheet bring to the analysis? How about, for starters, learning something about crime by hour of the day? Among other things, Lieder charts monthly crime fluctuations, but goes no farther. An hourly breakout, then, could be engendered by throwing the Date field data into a pivot table Row Labels area (but not into Column Labels; with around 228,000 unique date-times, they won’t fit in there); but a preliminary Oldest-to-Newest sort of the field turns up nearly 350,000 non-date – that is, text-formatted – entries, an apparent casualty of the AM/PM suffixes clinging to these.

I don’t know why some, and only some, entries should be so embellished (and note I’m asking after truly labelled AMs and PMs. Those time-of-day qualifiers can also associate themselves with certain actual, numeric date formats). After a good deal of dithering over the matter, I pried open a new column between C and D, called it Hour, formatted its cells to zero decimal points, and entered, in what was now D2:

=HOUR(C2)

And copy down the column, of course.

This works, but I don’t entirely know why. Again, there’s a whole lot of (apparent) labels in C, and yet HOUR manages to secure the number it’s seeking from each and every cell in the column format notwithstanding. Requires additional scrutiny, I think.

Then I’d step through this pivot table:

Row Labels: Primary Type (of crime, that is. I than continued with a Filter > Top 10, in view of the 33 types)

Column Labels: Hour

Values: Primary Type (Count; turn Grand Totals off)

And because the outcome reads densely, I went ahead and grouped the hours in bundles of four hours each:

chi3

(Deceptive Practice appears to denote crimes of business deception, by way. Look here, for example. And note that the grouped hours above are merely numbers, and not derivates from certified time-formatted values. Note in addition that HOUR retrieves just that – an hourly reading only – and as such will treat 7:58 as 7:00; and that means that the 8-11 really sections off 8:00 through 11:59, for example).

Among other things, the breakout totals almost exactly 40% of all narcotics offenses in the broad-daylight hours of 8:00 AM through3:00 PM (in effect), outnumbering the 35% it counts between 8:00 PM and 3:00 AM, when one might have stereotypically anticipated a spike in this kind of activity. Of course the greater daytime population and its widened customer base may offset the facilitating, clandestine ecologies of nocturnal settings.

No less surprising, perhaps is the 50% fraction of motor vehicle thefts reported between 8:00 AM and 7:00 PM, when cover of darkness would likewise have been expected to abet the illicit repossessions.

But of course it’s dark at 7:00 AM some of the year, so let’s try this: insert another column, this to the right of Hour, call it Month, and post in what is now E2:

=MONTH(C2)

And copy down. Next, engineer this pivot table:

Row Labels: Hour (in the interests of precision, group by bins of two hours each)

Column Labels: Month

Report Filter: Primary Type (select Motor Vehicle Theft)

Values:  Primary Type (Count, necessarily; the data are textual. Select % of Column Total; turn of Grand Totals)

 chi4

Do auto thefts mount in early morning and evening winter hours? Not decisively, but the percentages for the 16-17 (4:00 to 6:00 PM) tranche are slightly supportive. The winter month percentages (remember that we’ve called for intra-month hour/crime percentages, not proportions across the months for a given hourly frame). In fact the three highest theft rates for the 16-17 bin do emerge in November, December and January. Indeed, the 6-7 (6:00 to 8:00 AM) reading for January, when’s it’s still dark, stands notably higher than for any other month, but April and May discernibly top October and November here, and so the results aren’t emphatic.

And for a presentational talking point, it could be asked of this chart by Mr. Lieder:

chi5

In which in effect, three-variable are charted – Year, arrest percentage, and primary crime type – might be better captured by

Row Labels: Primary Type

Column Labels: Arrest (coded either True or False for an arrest or the failure to effectuate it)

Values: Arrest (% of Row Totals; Grand Totals turned off).

It’s just a thought.

And one more thought: the pivot table above turns up these labels:

chi6

And that means the same crime has been recorded with a pair of spellings, a classic no-no. While the absolute total of the two – 33 – isn’t enormous, you’d want to select just one of these entries for the authorized spelling and put the others through a Find and Replace.

Then refresh the pivot table.

 

And by the way, I have vacation next week. That’s the advantage of self-employment – you get to make your own hours.

Travelin’ Man: David Cameron Trip Stats, Part 2

24 Mar

It was somewhere in the closing miles of the first leg of our own journey – a skein of uncertain steps shuffling across the David Cameron overseas trip data – that I red-flagged my hardy companions – that’s you – about a bump in the road, a set of roadblocks blockading what, under other data-design circumstances, could have been a direct route toward calculating the Prime Minister’s trip durations. To wit: had the designer been so minded, a Start and Finish Date field would have been made to come along for the ride on the sheet at the outset, these posting embarkation and return dates for each of the PM’s 116 trips. Subtract Start from Finish (plus one; see last week’s post), then, and trip duration is right there for all to see.

But the data don’t break that way, not with 91 of the trip dates read into the manifest in label terms, e.g.

3-5 July 2011

Question, then: can we get there from here? Can we wrest three days from that miserly text construction above, and its 90 fellow travellers?

Good question, though transparency impels me to confess I ‘m the one who asked it. For one thing, and as previewed by the previous post, it seems to me that the Finish Dates pose the far easier ask – because the text on the other side of the date-offsetting dash own the month and year information, and the specifying properties attaching to both.

That incentive took me to cell E5, the first cell in the Finish Date field we had mapped in the last post. And there I wrote:

=IF(ISNUMBER(A5),A5,(DATEVALUE(RIGHT(A5,LEN(A5)-FIND(“-“,A5)))))

Ok – so what’s this one all about? I goes something like this: if the entry in A5 (in the Date of Trip field) is a number – and again (previous post, again), all the Cameron trips comprising exactly one day assume numeric form, then simply return that number/date in A5. If, on the other hand, A5 holds one of those pertinacious labels with a forged date passport, we need to isolate all the text to the right of its dash, which, if nothing else, looks like a date; then, appearances notwithstanding, we’d brace the isolate with a DATEVALUE, which in fact turns a label imposter into a real live, measurable date.

This formula works in E5; and because it does, I entitled myself to copy it to all other the Finish Date cells, wherein they work, too – until E57, where David Cameron’s trip to Strasbourg hits the runway and comes to rest right by the #VALUE! message – and he was expecting “Bienvenue, Monsieur Prime Minister”.

What happened in E57, and about a dozen other Finish Date cells that resound with the same error message? My standard suspicion – that these faltering cells had been saddled with superfluous spaces in ruinous places that were jamming the formulas – proved wrong.

But what proved right was a small – a very small – difference that typified, and beset, the errant formulas. It’s in the dash, of all things; if you look closer, you’ll note that the problem cells inscribe a dash just a touch lengthier than the one nestled in the formulas that do work. My nested FIND up there asks for this

 =FIND(“-“,A5)

 And the delinquent cells don’t have that; they have this:

 And that isn’t the same dash – trust me.

And because it isn’t you can select the Date of Trip range and drive this Find and Replace through it:

 trav21

(You can actually select the respective dashes in any cells in which they featured, and copy-and-paste them into the appropriate Find and Replace fields.)

And that works, too. Every Finish Date cell now boasts a finish date.

But then there is the matter of the Start Date(s).  Extracting whatever is there from the left of every dash in the Date of Trip entries and turning these out in numeric terms is conceivable, e.g. for A5:

=VALUE(LEFT(A5,FIND(“-“,A5)-1))

 That is, find the dash (after the makeover endorsed above), grasp all the characters to its left, and reintroduce them to their numeric provenance via VALUE. For A5 we’d get 20. Moving into the Trip Duration field in F, you could press ahead with

=DAY(E5)-D5+1

That is, use DAY to skim the day value from the full-fledged data in Finish Date (and that’s 21 in E5), subtract 20 from it, and add 1 to account for both the 20th and 21st in the duration (again, 21-20=1).

Now rewrite the expression in D5 to read:

=IF(ISNUMBER(A5),A5,VALUE(LEFT(A5,FIND(“-“,A5)-1)))

 Again, that IF statement leaves the actual, native dates in the A column as they are; they require no retooling. And remember that the real dates always return 1 as their trip duration (see above), requiring in turn this amendment to the expressions in the Trip Duration field:

=IF(ISNUMBER(A5),1,DAY(E5)-D5+1)

If you’re still with me, let’s understand that this all works – most of the time. But it won’t work for the Date of Trip plunked down in A19 – 30 November- 2 December 2010. In fact, on the one hand, none of the formulaic activity on that row concedes an error message. On the other, what we get for Trip Duration in F19 is -42335 – because the Start Date formula in D19 treats 30 November as an authentic date – from 2015, because that’s precisely what happens when you actually enter 30 November in a cell. Absent a concerted year reference, Excel defaults the year to the one during which you’ve entered it. And the same thing happens in A88-A90. (What’s remarkable here is that the VALUE –not DATEVALUE – function is prepared to turn 30 November into a number – and it does).

Question, then: given the determined inconsistency of the data, is it possible to write a formula that could contend with them all, and succeed in realizing Trip Duration in every case? I’m not exactly sure; but if it is, you’d have to do battle with a rather variegated – and long – set of nested IF contingencies before you could walk away from it with head held high. What, for example, are we do with

31 December 2013-2 January 2014 ?

And if you want to reply that in fact none of the Prime Minister’s trips overarched two years, is that quite the point? You want a formula that handles all comers – actual or potential.  Can it be written?

Over to you, learned colleague.

Travelin’ Man: David Cameron’s Trip Stats, Part 1

17 Mar

Does David Cameron have to take off his shoes when he passes through a scanner? I don’t know; but the answer to that question, you’ll agree, rather depends on whether the Prime Minister has to pass through a scanner at all.

They don’t take my calls at 10 Downing anymore, so I am left to speculate. But his international itinerary is a bit more confirmable, having in fact been committed to the public record, at least for his trips through March 2014 by the Guardian’s Data Blog in tabular form, its current iteration (that’s to be explained) duly copied-and-pasted and slightly reorganized by yours truly into this worksheet:

David Cameron’s international trips

(And if you can get that right-clicked Export to Microsoft Excel menu command to work here on the Guardian site – or anywhere else- identify yourself and report here immediately to claim your reward.)

Three columns and but 116 rows worth of official trips should prove eminently workable, but let’s see. As intimated above, the data underwent a change or two from after their coming out some days ago. The companion Guardian story reports 47 countries among the Prime Minister’s line-of-duty destinations, but if you had run this pivot table on the original data (though if you hadn’t see those you’ll have to trust me here):

Row Labels: Country

the table would have draped a directory of 50 countries down the Row Label column. The reason for the discrepancy: the duplicative spellings United States/USA and United Arab Emirates/UAE. Excel doesn’t know any better, or course, and as such treated those four usages as four countries, an impermissibly misleading redundancy. In the interests of deep background, I brought these excesses to the attention of article author Ami Sedghi via a Tweet, and the repairs were made (I never received a reply from Ms. Sedghi for my citizen-journalistic contributions, though we could wonder if the Guardian had made the discoveries independently).

But even with all that vetting I still get 48 countries, and haven’t found any third instance of a same-country-entry inconsistency  – not the biggest deal to be sure, but 48 isn’t 47.

In any event, now that the one country/one name ratio has been restored, a country count follows pretty straightforwardly:

Row Labels: Country

Values: Country (Count, and sort Largest to Smallest)

I get, in excerpt:

trav1

As the Guardian explains, the Belgian predominance has everything to do with European Union-seat Brussels, and Cameron’s recurring need to deal with and justify his coolness toward that organization to its other members.

But the real fun happens over in the Date of Trip field, particularly if you’re keen to calculate trip durations. First note the this-way-and-that alignments of its entries, e.g.

 trav2

The scattershot left-rightness of the data isn’t borne of some random indecision; rather, it attests a pair of data types asserting themselves in the field’s cells. 20-21 May 2010 is clearly a label, for example, and in its present balky state thus resistant to any numeric services we might ask it to perform. The right-aligned 11-Mar-11, on the other hand, is just as clearly a date/number; click its cell and observe the Custom (as in custom date format) legend registered in the Number format field in the Home > Number button group for verification .

Be that as it may, if I do want to get to those trip durations (even without thinking about whether to regard the very first pair of trips, the ones to Berlin and Paris and their identical dates, as one or two visits), I’d first want to insert three columns somewhere, these to be respectively titled Start Date, Finish Date, and Trip Duration, or something like that, the first two formatted in date terms, the third in numeric, no-decimal mode. I’ve chosen columns D through F for those remits.

But we’re not having fun yet, are we? Ok; so let’s start with the actual date-formatted trip data (understanding as a first, or last, principle that any encompassing, formulaic workaround must be ultimately be IF-driven, so that all data-type contingencies could be enabled to yield a trip duration). By apparent definition these dates comprise one-day journeys – because a longer trip would have expressed themselves in label terms, e.g. 20-21 May 2010.

Thus for starters, and for example’s sake, I’d click in D29 – because A29 stores the 11-Mar-11 date, and because I assigned the Start Date field to the D column  – and enter

=IF(ISNUMBER($A29),$A29,””)

Here we’re testing the entry in A29 for its numerical pretensions. If a number it is – that is, an authentic number formatted in date terms – the formula simply returns that date/number. If A29 is a non-number, then “”  moves into the cell – although we need to understand that the double-quotes are nothing but a holding action, mounted in anticipation of a more substantive riposte to any label that might have laid its claim to A29 or any other cell in the data set.

I’d then copy precisely the above formula to E29, wherein the End Dates have set up shop. Why? Because again, any truly date-formatted entry stands for a one-day trip, and so Start and End Dates must of necessity converge. And in F29 – next door, in the Trip Duration field – I’d offer

=E29-D29+1

a formula that simply subtracts the End Date from the Start Date, adding a 1 because we want a trip beginning and concluding on the same day to evaluate to one day, and not no days. The formula’s universal, sure – but only if and once the other trip entries – right now inactive labels – could somehow be made to free their inner, usable numeric status.  And that’s what I’m working on. (Hint: I think the Start Date is a lot harder to nail down than the End Date.)

And unlike the Prime Minister, I’m working in economy class; and the guy in front of me is leaning back into my coffee, my laptop has 12 minutes worth of charge left, and we’re still only halfway over the Atlantic.  And I still can’t figure out to play that bowling game onscreen.

Tracking Crime: British Transport Police Data, Part 2

9 Mar

Striving for cheap dramatic effect is my expository stock in trade, and I did myself proud last post. I had closed the first installment of our survey of British Transport Police crime data with a boffo teaser of a homework assignment, of sorts: namely, what needs to be done about the aggregate station passenger numbers (when available) which restate themselves in every station-specific record. Again, for example, Abbey Wood’s 1515106 passenger figure – an overall tally for the station – stakes all the Abbey Wood entries, even as these count individual crime totals for a particular month in a particular year (I should add that I’m still not unassailably sure if the passenger numbers have rolled up the traffic across the entire February 2011-Jauary 2013 interval with which the data work, but I assume they do. I have emailed data.og.uk about this).

So we’re left with a confusion of strata, as it were; month-specific crime totals always accounted against what appear to be 24 months’ worth of passenger traffic. We seem buying an apple mixed among the oranges, then, when in fact we’re looking for a given month’s crime figure packaged with that month’s – and only that month’s – ridership total. And that doesn’t seem to be there.

Still, there is a bit of useful work to be done here, I’d submit. You could divide each record’s Crime Count by its passenger number, churning out a miniscule but meaningful percentage that could be added to all the other same-station percentages. Because each crime count would divide itself by the same, unvarying station passenger number, the combined percentages should build a supervening percentage of all station crimes, as a fraction of its passenger traffic.

And so those marching orders would direct me to name a new field in the M column, say Percent of Passenger Traffic (again, I’m assuming, as per the previous post, that the All Crime and ASB records have been shown the door), and step down into M2 and enter, with all due mindfulness of the fact that some stations have no passenger numbers:

=IF(J2=”NULL”,””,E2/J2)

Then copy down the column. Because of their smallness, a good many of the numbers evaluate to apparent zeros or scientifically-notated cocoons; and while all of these could be reformatted into legible, multi-decimal-pointed values we’re really interested in their combined effect, once they’ve been channelled into this pivot table:

Row Labels: Station_Name

Values: Percent of Passenger Traffic (sum, formatted in Percentage terms, say to five decimal points. Remove the Grand Totals).

Once you’ve gotten this far you could, for example, right click among the station names and conduct a Filter > Top 10 run-through, opting for say the top 20 (you’ll have to  go ahead and sort these descendingly by yourself, though). I get:

btp1

Confession: I don’t think I’ve heard of any of these stations, even if Google and Wikipedia have. The Ardwick outpost is somewhere in or near Manchester, and it won’t be pleased to know that in purely percentage terms its 6 reported crimes – reckoned against a reported passenger traffic total of only 334 – puts it atop the list. Do those numbers sound right? No, but they apparently are; Wikipedia tells us the station is unstaffed, adding that “In [the] 2004-2005 financial year only 285 passengers used the station, or fewer than one per day, increasing to 358 in 2005-2006“. Interesting place, and yet Ardwick rocks with activity when sized against number two New Clee, in Northeast Lincolnshire (I don’t know where that is, either), and its 149 straphangers, though at 149 no one is hanging onto any straps in its cars.

On the other hand, Mr. Wikipedia counts 616 commuters for Ardwick across the 2012-2013 financial year, and 334 for New Clee’s for 2011-12, and those departures from the BTP data are bothersome, begging the question about other under or overcounted stations.  That claim on our investigatory attentions is in the first instance to be put to the data gatherers; but whether the question is at the same time slightly political is for the powers that be, and those who write about them.

In any case if you add Crime Count (Sum, and formatted to decimal-free Number mode) to the Values area and rerun a top 20 (by Sum of Crime Count), you’ll get:

 btp2

I’ve heard of most of these. In purely literal, quantitative terms the Victoria station (excluding its underground stop, which ranks 18th above) heads the enumeration, even as its ranking here by percent – 13th – does not. Note as well the unrelieved zeros lining up by the St. Pancras International (that is, Eurostar) station, a consequence of its unreported passenger figures.

We can begin to wind this discussion up with this small, quick table:

Row Labels: Location_Type

Values: Location_Type(Count)

I get:

btp3

A subtle measure, the figures perhaps reflect the crime-“facilitating” properties of duration and place. Train journeys typically extend longer than the waits for them to commence, and so in theory avail more time for wrongdoing; but stations afford swifter egress from the scene. We see that 64% of the crimes were station-bound, a proportion that needs to be thought about with some due diligence.

Then reset the table thusly:

Row Labels: Crime_Type

Column Labels: Location_Type (moved here for presentational reasons)

Values: Location_Type (Count, Show Values As > % of Column Total (and retract Grand Totals):

btp4

Note the likenesses and the disparities. Public Disorder and Weapons (however defined) account for about the same distributions across the two sites, but Other Thefts (again we need a definition) proliferate on trains (drug activity, on the other hand, is by far the more likely in stations, where perhaps rapid transactions can be consummated, followed by rapid disappearances).

And now I don’t know about you, but I’m tempted to book my ticket to New Clees. Good seats still available.

Tracking Crime: British Transport Police Data, Part 1

1 Mar

It’s a big country, the UK, but not so big that a unitary police force can’t stand watch over the nation’s railways, from London to Aberdeen and the still higher latitudes. The British Transport Police has 2600 stations to patrol, and it’s made its crime data a matter of public knowledge on the data.gov.uk website, more specifically here:

http://data.gov.uk/dataset/btp-crimes-recorded-february-2012

Don’t be fooled by the link’s name; you’ll see that the relevant download bears the date 2013, and indeed – the data themselves are end-pointed by February 2011 and January 2013, an ungainly, but precisely two-year, interval that needs to be dealt with one way or another. And at 15MB it’s a big file; but it’s a big country.

And once you’ve stationed yourself in front of the data you’ll find their 106,000 records read pretty crisply, and exhibit the right type of organization , more or less (e.g., the three reported years themselves a pack single field, rather than fare-beating their way into separate columns). But those records have been made to shoulder a weighty, onerous excess. About 40,000 of the 106,000 are of the All Crime and ASB (Anti-social Behaviour, as they spell it in the UK) stripe, and in my view these are simply dispensable – because each one realizes the aggregate crime total for each station – for each reported month and year. The upshot: a massive potential double-count of the numbers.

For example: The data for the first-alphabetized station, Abbey Road (no, not that one; this Road is way over in London’s east, near Canary Wharf), comprises 13 records – seven of which report a specific crime-type incidence for a specific month of a specific year, with the other six  tabulating all crimes  for each month/year. (Because two crime types – Robbery and Pubic Disorder and Weapons – were committed in May of 2012, both fold into the May 2012 Abbey Road All Crime and ASB record.)

Thus if you pivot table the data thusly:

Row Labels: Station_Name

Column Labels: Crime_Type

Values: Crime Count

You’ll discover that the 14 Grand-Totalled crimes ascribed to the station really enumerate seven crimes, because the seven offenses counted by All Crime and ASB item merely duplicate the actual total – and that’s a problem.

We’ve gone toe-to-toe with double-counts before (see here, for example), but those earlier confrontations brandished avowed, self-described sub/grand totals rows that presented themselves as such, but wormed into the larger data set just the same. Here, the All Crime and ASB records are unmarked cars, as it were – they simply back themselves into their appointed rows like any other record, with no subtotal pretensions whatsoever. (Note: the March station in Cambridgeshire is oddly sorted atop the station list because Excel reads the name as a month, which in fact and in effect possesses numeric standing. If March’s peculiar positioning bothers you, the fix is surprisingly begrudging; the classic label-casting strategies, e.g., prefacing the text with an apostrophe or tapping a space after the “h”, just won’t work. I’d simply sort the field, aim a Find command at the first instance of March, add the word” Station” to it and copy down. Refresh the pivot table and now you’ll find March where it belongs.)

What do to about this supererogation? Something simple, I think. Because it turns out that the All Crime and ASB rubric happens to appear earliest in the alphabet in its field, I’d run a Z to A sort, race down to row 66045, the present lodging of the first All Crime and ASB, and insert our tried-and-true blank row. While as always I’m happy to be persuaded otherwise, I don’t think the remaining data and the analyses they encourage will be in any way impaired. And had the All Crime and ASB designation not cooperated and sorted itself somewhere in the middle of the pack instead, I’d have introduced a new, temporary column somewhere and loaded its first row with something like this:

=IF(A2=”All Crime and ASB”,”zzz”,A2)

I’d have then sorted all the zzzs to the bottom, and deleted accordingly.

But because neurosis must be served, I’d leave those 40,000 outcasts in the sheet, one row away – just in case someone decides they’re needed after all.

Now think about another issue, this one likewise raising a double-count concern, albeit bearing a slightly distinct provenance. Most stations report a passenger number, what I take to be an aggregate station traffic estimate (and estimates some of them clearly are, what with their zeroes-laden totals, e.g., the Acton Town and Aldershot stations). Have these numbers gathered themselves across all 24 of the recorded months? I don’t know, but the larger point is that each station number appears to be invariant, featuring in each and every station record. Thus the 84 rows of Abbey Wood (not Road, and my count assumes you’ve sorted out and shelved the All Crime and ASB records) data all disclose the same 1515106 passenger number, and therein lays the complication. Because even as each record marks out a subset of all Abbey Wood crime, each passenger number upholds the same overall figure, record after record. This granular disconnect needs to be refit too, because, for example, this pivot table

Row Labels: Station_Name

Values: Crime Count (Sum)

Passenger numbers (Sum)

won’t work, because a station’s constant Passenger Number will, by default, be added again and again, along with the crime counts. And that’s not a double-count there – it could be a 100-count.

So let’s all think about that one.

Follow

Get every new post delivered to your Inbox.

Join 175 other followers