Tracking Crime: British Transport Police Data, Part 1

1 Mar

It’s a big country, the UK, but not so big that a unitary police force can’t stand watch over the nation’s railways, from London to Aberdeen and the still higher latitudes. The British Transport Police has 2600 stations to patrol, and it’s made its crime data a matter of public knowledge on the data.gov.uk website, more specifically here:

http://data.gov.uk/dataset/btp-crimes-recorded-february-2012

Don’t be fooled by the link’s name; you’ll see that the relevant download bears the date 2013, and indeed – the data themselves are end-pointed by February 2011 and January 2013, an ungainly, but precisely two-year, interval that needs to be dealt with one way or another. And at 15MB it’s a big file; but it’s a big country.

And once you’ve stationed yourself in front of the data you’ll find their 106,000 records read pretty crisply, and exhibit the right type of organization , more or less (e.g., the three reported years themselves a pack single field, rather than fare-beating their way into separate columns). But those records have been made to shoulder a weighty, onerous excess. About 40,000 of the 106,000 are of the All Crime and ASB (Anti-social Behaviour, as they spell it in the UK) stripe, and in my view these are simply dispensable – because each one realizes the aggregate crime total for each station – for each reported month and year. The upshot: a massive potential double-count of the numbers.

For example: The data for the first-alphabetized station, Abbey Road (no, not that one; this Road is way over in London’s east, near Canary Wharf), comprises 13 records – seven of which report a specific crime-type incidence for a specific month of a specific year, with the other six  tabulating all crimes  for each month/year. (Because two crime types – Robbery and Pubic Disorder and Weapons – were committed in May of 2012, both fold into the May 2012 Abbey Road All Crime and ASB record.)

Thus if you pivot table the data thusly:

Row Labels: Station_Name

Column Labels: Crime_Type

Values: Crime Count

You’ll discover that the 14 Grand-Totalled crimes ascribed to the station really enumerate seven crimes, because the seven offenses counted by All Crime and ASB item merely duplicate the actual total – and that’s a problem.

We’ve gone toe-to-toe with double-counts before (see here, for example), but those earlier confrontations brandished avowed, self-described sub/grand totals rows that presented themselves as such, but wormed into the larger data set just the same. Here, the All Crime and ASB records are unmarked cars, as it were – they simply back themselves into their appointed rows like any other record, with no subtotal pretensions whatsoever. (Note: the March station in Cambridgeshire is oddly sorted atop the station list because Excel reads the name as a month, which in fact and in effect possesses numeric standing. If March’s peculiar positioning bothers you, the fix is surprisingly begrudging; the classic label-casting strategies, e.g., prefacing the text with an apostrophe or tapping a space after the “h”, just won’t work. I’d simply sort the field, aim a Find command at the first instance of March, add the word” Station” to it and copy down. Refresh the pivot table and now you’ll find March where it belongs.)

What do to about this supererogation? Something simple, I think. Because it turns out that the All Crime and ASB rubric happens to appear earliest in the alphabet in its field, I’d run a Z to A sort, race down to row 66045, the present lodging of the first All Crime and ASB, and insert our tried-and-true blank row. While as always I’m happy to be persuaded otherwise, I don’t think the remaining data and the analyses they encourage will be in any way impaired. And had the All Crime and ASB designation not cooperated and sorted itself somewhere in the middle of the pack instead, I’d have introduced a new, temporary column somewhere and loaded its first row with something like this:

=IF(A2=”All Crime and ASB”,”zzz”,A2)

I’d have then sorted all the zzzs to the bottom, and deleted accordingly.

But because neurosis must be served, I’d leave those 40,000 outcasts in the sheet, one row away – just in case someone decides they’re needed after all.

Now think about another issue, this one likewise raising a double-count concern, albeit bearing a slightly distinct provenance. Most stations report a passenger number, what I take to be an aggregate station traffic estimate (and estimates some of them clearly are, what with their zeroes-laden totals, e.g., the Acton Town and Aldershot stations). Have these numbers gathered themselves across all 24 of the recorded months? I don’t know, but the larger point is that each station number appears to be invariant, featuring in each and every station record. Thus the 84 rows of Abbey Wood (not Road, and my count assumes you’ve sorted out and shelved the All Crime and ASB records) data all disclose the same 1515106 passenger number, and therein lays the complication. Because even as each record marks out a subset of all Abbey Wood crime, each passenger number upholds the same overall figure, record after record. This granular disconnect needs to be refit too, because, for example, this pivot table

Row Labels: Station_Name

Values: Crime Count (Sum)

Passenger numbers (Sum)

won’t work, because a station’s constant Passenger Number will, by default, be added again and again, along with the crime counts. And that’s not a double-count there – it could be a 100-count.

So let’s all think about that one.

Re-counting the Conseil de Paris Election Data

22 Feb

With its Gallic titular panache Les Conseillers de Paris could pass as a movie title, perhaps the sequel to the Umbrellas of Cherbourg; but alas, it’s only a spreadsheet, and the big screen on which it features is probably the one glassed inside your Samsung.

OK, so it isn’t coming to a theater near you, but the sheet has its moments, nevertheless, once you decide  you want to know something about the  “deliberative body responsible for the governing of Paris” and its 163 members, all of whom throw themselves at the mercies of the city’s voters every six years, including 2014.

With members apportioned from the city’s 20 arrondissements , or districts, in numbers reflective of  arrondissement size (the 15th is the largest), the Paris Open Data site lists each and every serving conseiller from 1977 through 2014, and I’ve sent it your way here:

Les Conseillers de Paris 1977-2014

Once you’ve smoothed all that wrapped text curled inside the worksheet you can proceed to a couple of first general-interest findings for starters. Try this pivot table:

Row Labels: Mandature (or Term)

Values: Mandature

conseil1

Thus we see that the 1977-1983 complement comprised only 120 conseillers, and nowhere does that defining 163 appear, even among the later mandates. The surplus figures are likely attributable to the deaths of conseillers (note the deces field in column J) and their replacements, all of whom are listed, and perhaps mid-term leavers and their successors (see the remplacement field in L). Note, on the other hand, that some deaths appear to have post-dated a conseiller’s tenure. For example, both Claude Avisse and Louis-Henri Baillot died in 2007, well after their service on the conseille had elapsed.  Moreover, because conseillers who served multiple terms (e.g., Baillot) appear in the data set for each of their mandatures, dates of death appear as many times. And I can’t explain why date-of-death entries are fashioned in text format, e.g., décédée le 25 mars 1998, when they could have been made available as actual dates – particularly when dates of birth are expressed in bona fide numeric terms.

Note as well the curiously extended seven-year mandature 2001-2008,a governmental workaround “designed to avoid an overloading of the electoral calendar in 2007“, a presumable concession to the presidential and parliamentary contests of 2007.

And if you want to learn something about the Conseil’s gender representation, try this:

Row Labels: mandature

Column Labels: conseiller

Values: conseiller (Count, and % of Row Total) (and turn off Grand Totals for Rows)

conseil2

A most interesting and emphatic incline toward gender-count parity informs the table.  Remember, however, that the above accords multiple counts to the same conseillers serving across multiple mandatures; if you want to break out discrete, actual conseillers by gender, then, the process gets trickier, because if you go with:

Row Labels: Nom

Row Labels: Prenom (to disambiguate conseiller’s with the same last name)

Column Labels: conseiller

Values: conseiller (Count)

You’ll pile up multiple counts for the multiple-term servers (some conseillers in fact have served in all 6 cited mandatures), and thus do little more than restate the outcomes of the screen shot right above. There are several resolutions to this pivot-table-peculiar search for unique field constituents that appear repeatedly, and here’s but one: Return to the data set, and conduct a Data > Remove Duplicates, by ticking prenom and nom as the duplicate-bearing columns. Then pivot table the 532 prevailing, definitively unique  records (i.e., any and all conseillers now receive exactly one entry) again, precisely as per the instructions immediately above; and down by the Grand Totals role you’ll find 348 Conseillers – men – totalled along with the 184 Conseillères. Gender percentage breakdown: 65.4%-34.6%.

(Keep in mind of course that a removal of duplicates imposes would could be a permanent decrement in the dataset count, depending how you play it. You could of course save multiple versions of the data, pre and post-paring.)

You’ll also want to think about the dataset’s 57 fields, and wonder why the last six of them – out there in columns AZ through BE- aren’t merely entitled empty, but are empty.  There’s no evident reason not to delete them,  and I suspect for purely operational reasons you might well throw a good many of the other fields overboard too, as they’re only occasionally populated (and I’m still working on what their headers actually mean, truth be told. J’ai besoin a native French speaker), and the fields often have been distended to enormous column widths besides.

Still, the more self-evident data do have something to say. What if, for example, we wanted to calculate the average conseiller age of at the point of election, i.e., date of election minus date of birth? That’s a most practicable task – begin by parting the fields with a new column, say between A and B, call it Age at Election or something like it, and, in what is now B2, enter

=(J2-AT2)/365.25

That is, subtract the number of days separating a conseiller’s birthdate from the day on which he/she was elected, and divide the total by the average day-length of a year. Then head towards this pivot table:

Row Labels: mandature

Values: Age at Election (Average, round to two decimals).

I get:

conseil3

The consistency is notable, though the average age ascent – about four years, between 1977-2001 and 1995-2001 – may reflect large numbers of triumphing  incumbents, all of whom of course aged about six years between elections.

Then swing conseiller into Column Values:

conseil4

We see women members are the consistently younger, perhaps comporting with their relative, aggregated newcomer statuses.

So check it all out; vive la feuille de calcul. That’s spreadsheet to the rest of us.

Snow in the Forecast, and in the Data: Winnipeg 311 Calls

14 Feb

Call it a disciplinary tilt, or just another blurt of dime-store iconoclasm, but I’m prepared to label much of this thing we call data journalism as nothing more – or less – than a take on sociology writ digital, and with fewer  prepossessions. Ok -I’ll own up to a graduate credential or two in the field if my view needs defending or explaining, but tilt aside, the case is there to be made.

I can’t help noting that the early University of Chicago sociologist Robert Park – whose first job description actually read “journalist” before he turned into the academy  – famously made a go (with colleague Ernest Burgess) at mapping  the city’s sectors into differentiated, concentric  zones of activity, and with nary an Android in hand or satellite above.

The quest for pattern directs much of what passes for sociological exploration, but that what’s what data journalists often do, too; and their tools for aggregating and plotting the data mean to paint the big pictures affording the step-backs that let us see what’s really going on, we hope. Sounds pretty sociological.

Either way, if it’s pattern you want, albeit of a fairly obvious sort, download the 2014 311 Service call records for the Canadian city of Winnipeg here:

 311 Service Requests Winnipeg 2014

Drawn from its open data site, the resulting workbook most neatly archives the 64,000 calls put to the service that year in nine nicely laconic fields.  A few very small design carps could be placed in the record (these should sound familiar, too), namely the redundant Year field, streaming its 64,000 2014-only entries down its column, and the fact that for parsimony’s sake the Month and Day fields could have been left aside as well, because these could be coaxed from Date as needed. But that’s a very small carp indeed. A slightly larger one: if you want to do something about the latitudes and longitudes pressed together in text format in the Location 1 field  you’ll have to detach these into usable readings, via something like a Find and Replace to rid the open and closed parentheses (replaced in both cases with nothing), and followed up with a Text to Columns (the comma’s your delimiter).

In any event, it snows in Winnipeg, about 45 inches a year (including 4.2 inches last April), and as such you’ll want to find something out about the 311 dialers having something to say about that meteorological shakeout. Start with this time-sensitive pivot table:

Row Labels: Service Request

Column Labels: Date

Values: Service Request (Count)

In fact, snow-relevant requests come in two varieties, Snow Removal Roads and Snow Removal Sidewalks, the former engaging far more callers. Pattern? Clear, if pretty obvious; the calls predominate in the January-March quarter, time-stamping more than 92% of all such requests on both matters.

Note that you can group the data by quarter, provided you’re prepared for a small detour in the step-through. If you wheel Date into Column Labels to prime your quarter-grouping intention, you’ll find you can’t get there from here:

win1

While it’s true that 2014 had only 365 days, the dates here micro-refine themselves by time of call as well, and the count of unique entries top the no-go 16,384 as a result. The means for stanching the overflow is recommended in the above prompt: drag Dates into Row Labels first, click PivotTable Tools > Options > Group Selection > Quarters, and then drag the grouping back into Column Labels.

You, could, on the other hand, dust off a Plan B quarter breakout by positioning Month in Column Labels instead and grouping those thusly:

win2

Because the Month data comprises nothing but numbers landing between 1 and 12, you won’t come anywhere near the 16,384 item limit.

Quartering aside, next try showing the data as % of Column Total (we’re back to the Month breakout). Here you can click Design > Grand Totals > On for Rows Only, because the column totals all necessarily return 100% and there’s nothing to learn there. You’ll see:

win3

Pattern, or something like “systematic” variation at least, surely guides the Snow Removal calls, these cresting predictably in the January-March interval (note the December numbers, though which really poke into the following winter); but variation bobs up and down through the quarter nevertheless. OK, you may suppose that the February peak in calls probably square with a precipitation uptick in the month, but let us see.  The Winnipeg Weather Stats site throws out these respective precipitation overalls:

Jan 2014 16.30 mm

Feb 2014 8.40 mm

Mar 2014 9.00 mm

There’s perhaps something noteworthy, then, in February’s most-calls/least-precipitation disjunction. Looking for a story? There might be one there. You might want to wonder as well about September and October’s top-outs in Garbage Collection calls, another type of removal of a more perennial sort.  In fact Collection and Recycling complaints are the most likely to bend the ears of Winnipeg’s 311 responders, comprising over 45% of all calls.

And for a final gaze at pattern, try:

Row Labels: Month

Values: Service Request (Count)

Service Request (again, this time by % of Column Total)

win4

Once charted (say by Month and one of the Values fields), the curious near-linearity of months and calls describes itself:

win5

So there’s some pattern. Now try and explain it, too.

The Skinny on UK Obesity Data: Work to Do

8 Feb

The reported levelling of the rates of childhood obesity in the UK has naturally been accounted a good thing (for the article that made the news in full, consult this URL), although for dieters who like their glass half empty the news should perhaps be imbibed with slow, deliberate slips.  After all, the study’s abstract prefigures its findings with this bit of curbed enthusiasm:

“More than a third of UK children are overweight or obese, but the prevalence of overweight and obesity may have stabilized between 2004 and 2013.”

It’s possible, then, that the discovered equilibration has more to do with some saturation effect that has already visited its full-on impact upon the cohort of obesity-vulnerable youngsters, but I take no credit for the worthiness of that conjecture, only blame for the conjecture itself.

Be that as it may, the National Child Measurement Programme (NCMP) has something to say about the development as well, here for the years 2007/8 through2012/13, in the form of this workbook available for download here:

http://www.noo.org.uk/visualisation

(Click the Download link for Child Obesity and excess weight prevalence by Clinical Commissioning Group.)

The book breaks out its populations by both “excess weight” and “obesity”  measures  (the former standing as a superset of the two) for two school-year age cohorts, what the UK calls Reception (ages 4-5) and Year 6 (10-11) (look here for the BMI formulas, both for kilograms and pounds; the workbook’s NOTES tab sketches additional methodological background. This study’s sample sizes also operate within the confidence interval limits limned in the Excess weight and Obese tabs. Remember again that our spreadsheet doesn’t invoke the same data compiled in the journal piece. Remember too to review the Notes tab’s explication of data cells marked “s”).

Now to those data, about which there is a good deal to say. Or orrection: now to the data organization.  First, and perhaps most prominent, are the fields. We’ve seen this before, and the question begs reiteration: why are data which, for all analytical intents and purposes, belong to the same parameter – e.g. Reception – stationed in a plurality of columns? The resulting barrier to analysis is formidable, though not insurmountable; but surmounting requires us to cut and paste kindred data to the same field, copy and properly align the CCG (Clinical Commissioning Group) identifiers down columns A and B via a duly-diligent copy and paste, and make room for a year field in which that differentiating datum needs to be copied down, too. And you’l probablyl need to cut-and-paste the Year 6 data, too, and presumably to a worksheet all its own the better to disentangle these from the Reception figure, though that larger positional issue needs to be thought through – because in fact the workbook draws itself around two pairs of variables – the Reception/Year 6 axis, and the Excess weight/Obese binary. (It could at the same time be possible to jam all of these into a single grand sheet, by introducing, say, new School Year and Weight Category fields, thus lending themselves to additional breakouts along those lines, but I’m holding my jury out on this unifying tack.) The larger point, though: if you’re serious about doing something serious with the data, you’ll need to think about all these prior somethings, too before your analysis proceeds.

But quite apart from those necessary considerations you’ll also have to do something about the superabundance of merged cell in rows 1 through 4. These can’t coexist with your pivot-tabling intentions, and neither can the blank row

obesity1

That prohibitively separates those potential field headers from the data. The simplest rejoinder, of course: move the putative field names above down a row, but only after you’ve made sure that the now-unwelcome text entries still higher above these perch at least one blank row away from what you’re envisioning as the finalized data set (and if you’re going this far you’ll find that, once unmerged, the cell now invested with the “Number measured” header, for example, has been bumped two rows above its eventual place atop the dataset). And, for crowningly good measure, you’ll likewise have to unmerge the sheet’s title wrapped across the A1 supercell. (I also don’t see the immediate pertinence to our data of the 32,000 LSOA-CCG lookup records in that so-named tab, apart from the “The analysis uses a best-fit 2001 LSOA to 2013 CCG lookup created by PHE” declaration in the Notes tab. LSOA abbreviates lower layer super output area; for more, look here.)

To repeat – if you’re not prepared to do the reconstitutive work schematized above I don’t think you’ll be able to take the data very far; and why the analysts and/or story-seekers should be made to steel themselves for a rough ride across this obstacle course makes for one very good rhetorical question, though of course we’ve asked it before.

In fact, though, the data look pretty interesting, if now untamed. Your mission – should you choose to accept it – is to domesticize it all.

But don’t worry – the workbook won’t self-destruct.

Some Q and A about A and E Waiting Times

31 Jan

We don’t think that way, but field headers are a species of first-line documentation. We need to know what fields mean before we work with them, and quite apart from the brute operational need for fields to be headed at all (and header-less worksheets are out there to be found; I can think of a couple).

That is a none-too-arresting truism to be sure (most truisms aren’t too arresting), but when headers put understanding on hold, the truism suddenly matters. I’m thinking about the UK’s National Health Service Accident and Emergency (A&E) data on waiting times entrants to the Service’s emergency facilities endure before they’re attended to. It’s a perennially newsworthy concern, and some pretty current data are here:

 2015.01.25-AE-Timeseries4w3Rl

But before we think about those aforementioned headers some other, more standard spreadsheet considerations require a look, too. Begin with the oldest-to-newest sort ordering the Week Ending data in the Period field, a curiosity you may or not find nettling depending on the intentions you’re bringing to the sheet. If you’re doing nothing but reading the records you’d likely have opted for a recency-driven sort, in which the latest entries top the field; but if, on the other hand, you’re keen to mortar and pestle the data with a pivot table or some other analytical instrument you shouldn’t care at all, because you’ll be sorting as and when you wish. Note as well, of course, that those W/Es have seen to it that the presumptive dates filling the Period column are labels, a substantive devaluation that could and should have been forestalled. If you run a Find for W/E[space] and Replace with nothing, you should be able to recover the data’s rightful quantitative standing (though be prepared for some issues if your Dates and Times honor the U.S. month-first protocol). And when that rework does its job you can rename the field Week Ending.

Moreover, if you are going to pivot table the data set, you need to do something about the blank row 18 – namely, by deleting it – and treat the grand-totalling row 17 to much the same. Again, you don’t want 17’s totals uncorking a double-count all over your tables. And the merged superordinate headings in 15 have to go the way of 17 and 18, or else your pivot tables won’t go at all (just delete the entries; the row can be left alone).

And now for my opening pronouncement. Perhaps it’s me, but I experienced more than a jot of difficulty in my strivings to understand whence the percentages in columns M and N came. You see how the two metrics are meant to be differentiated: one conveys the percentage of all A& E patients attended (that verb probably requires scrutiny) in four hours or less, the other the fraction of type 1 incidents seen within that time (Type 1 denotes  “A consultant led 24 hour service with full resuscitation facilities and designated accommodation for the reception of accident and emergency patients”, we’re informed here).

The problem – to which my obtuseness is probably central – is in the clarity of the headers’ implied instructions to divide this field by that field.  What field, that is, denominates the Percentage in 4 hours or less (type 1) data?  That datum in cell M18, for example, reports 95.6%, and after a round of head scratching I returned my fingers to the keyboard and at last prevailed with this triumphant formulation:

=1-I18/E18

Considered in field terms, I divided

Type 1 Departments – Major A&E/Type 1 Departments – Major A& E

Subtract that result from 1 and you do get 95.6%, in single-decimal terms. But you got it – I seem to have divided two fields by the same name, and in fact I have. Of course they’re topping different columns, but…I wouldn’t have done that, quite apart from my early inclination to think about dividing I18 by H18, or Total Attendances.  That 95.6% then needs to be understood as confining itself to Type 1 on both sides of the divisor, and not all attendances.

And now – now – that other percentage, the 96.9% in N18, reads more sensibly:

=1-L18/H18

Or,

1-Percentage in 4 hours or less (all)/Total attendances

(remember that L18 counts attendances that took longer than four hours, and so the division computers the greater-than-four-hour data, which need to be subtracted from 1, or 100%, in order to develop the less-than-four-hour percentage). But I might have also replaced Total with All, so that the adjective reads identically across those two fields. Why should we have to wonder about the difference between Total and All?

Another point: the merged-cell heading in I15, A&E attendances > 4 hours from arrival to admission, transfer or discharge, seems to point to patient receptions in excess of four hours; but the percentage recounted above computes attendances of four hours or less. But as I said, maybe it’s me.

A concluding recommendation, even as no one asked me for it: click anywhere in M and N and see 10-or-so decimal points congesting its column. Might not all confusion have been allayed if the actual formulas occasioning the M and N results have been left in their places, instead of the hard-coded replacements that we’re seeing? We’d then know exactly what was divided by what, to the relief of all those bloggers confessing their obtuseness in shameless public view.

Ebola Data: Worth Thinking About

22 Jan

If the subject is Ebola, you want the data to be equal to the epidemiological tasks to which they’re likely to be applied – that is, as timely and accurate as they can be. The Humanitarian Data Exchange (HDX) site, an outpost of the United Nations Office for the Coordination of Humanitarian Affairs (OCHA), has been tracking the disease’s incidence across nine countries (as of this writing) from which Ebola reports have issued, and in nearly to-the-minute fashion (my data posts records dated January 20). You can download it here (access the csv rendition; I couldn’t get the Excel offering to work).

Once inside, you’ll be able to get to work on the sheet’s 3200+ rows of time-stamped data (as of this writing), and a parameter – Indicator – that will force your attention span to… snap to attention. Indicator comprises 36 different items, a good many of them same-sounding; and if you’re program calls for a scrupulous run through the data, you’ll have to make sure you understand what each of these uniquely mean. For example, the eight discrete items headed by the phrase “Cumulative number of…” simply demand clarification, as do the 12 “Proportion of… ” rubrics. Put alternatively, each country receives 36 parameters worth of records about its Ebola status.

Moreover, keep in mind that the Proportion data are registered in whole-number terms, and as such would be more meaningfully comprehended by dividing each by 100 (and thanks to HDX’s email reply to this question).  Thus a proportion of 43 is to be quantitatively read 43%, for example.

That obligation to clarify might be illustratively served by pivot-tabling Ebola cases by Indicator and one country, say Sierra Leone:

Row Labels: Indicator

Values:  Value (Sum)

Slicer (or Report Filter): Sierra Leone

And here’s where I made my first mistake. I saw that the country had apparently incurred 257,286 confirmed cases of Ebola as of January 20th, but that figure isn’t remotely true-, and fortunately not. My pivot table had added each successive, as-of-the-specified date case total – a textbook case of double, triple, and quadruple counting – when in fact of course each figure tabulates the cumulative case total until just then.

And that makes for a rather cautionary item. After all, review an unassuming datset of sales recorded by say, salesperson, and Jane or Rob’s transactions are there for the summing. But adding successive cumulative totals makes for one very wrong answer.  It’s as if the data have already submitted to something like Excel’s Show Values As > Running Total In… option. It’s the records that are doing the aggregating for us, as the dates proceed.

And all that means that for certain reads the data should to be lined up one date at a time. For example, the Sierra Leone data for October 8, 2014 total 2455 cumulative, confirmed cases of Ebola. The number of confirmed cases in the previous 21 days as of that date at stood at 924, which indeed yields the 38 (really 38%) in the “Proportion of confirmed Ebola cases that are from the last 21 days” cell – that is, 924/2455.

Note at the same time the empty cells that, for whatever reason, have no data to report, and doubtless data collection in affected regions has its challenges. October 3 counts no confirmed cases of Ebola, even though it precedes the 8th by only five days and even as it follows Oct 1 and its 788 cases by but two. And because the dataset’s 51 dates aren’t identically spaced in time, one needs to reflect on the mathematics of those “…last 21 days” items. As with the cumulative data above, these numbers are rolling, but here they confine their locomotion to a moving 21-day band. That means of course that days in the data set fewer than 21 days apart share many of the same cases in their respective enumerations (as another HDX e-mail acknowledged), certainly not a bad thing by itself. After all, the recency and magnitude of Ebola case distribution is a perhaps literally a vital concern for researchers. But the journalist needs to appreciate the necessary resort to a kind of double-counting in the process. Still, the variable distancing of the 51 dates from one another does ask a question about how case counts should be properly interpreted.

And there’s something else. Cumulative figures should naturally increment across time, or at the very least remain unchanged; but the cumulative death totals (for all countries) for October 29 and 31 2014 appear as 9 for both days, following a total of 2990 for October 25. And November 5 presents a figure of 1739, still notably smaller than the October 25 report. It seems that only certain countries were considered for the 29th and 31st, a undercount that seriously disrupts the accumulating total. To see what I mean, try:

Row Labels: Indicator (Filter for Cumulative number of confirmed Ebola deaths only)

Column Labels: Date

Values: Value (Sum)

This one requires follow-up, I’d allow.

And at the risk of baring my medical ignorance I’ll put one more table on the table:

Row Labels:  Indicator (filtered for the four “Case Fatality rate” items

Values: Value (Average).

I get

 Eb1

Why should probable Ebola cases exhibit so markedly higher a fatality rate? Could the differential be ascribed at least in part to post-mortem assessments of individuals whose conditions went undiagnosed or to whom medical attention was delayed? I don’t know, but the rate differentials are notable and investigation-worthy.

Trumped Cards: Consumer Credit Card Complaint Data

15 Jan

The topical suggestions on the US www.data.gov home page flash across its master search field like a battery of rotating advertisements, and it was the Credit Card Complaints catchphrase that road-blocked my fleeting attention span, if you must know.

Ok – yours will likely resonate to a more exotic theme, but it must have been the cool alliteration that got me. Either way, if you’re looking for 22.5 megabytes and 334,000 records worth of consumer discontent about their plastic you’ve come to the right place:

http://catalog.data.gov/dataset/consumer-complaint-database

(You’ll also want to review the companion Data Dictionary in there, at its own linked destination for elucidations of and field references for the above workbook. I have a bit more to say about these later.) Click at the Comma Separated Values Download link, and when the hurly burly’s done save it all as an Excel sheet. (I’m now beginning to think that a workbook’s CSV derivation is what explains the un-fit columns that burrow their way into so many spreadsheets across the net; here too, our data call for the redoubtable autofit.)

Once you actually get to the data (which are rather current; beginning with December, 2011, my download called up complaints through January 8) you’ll be enraptured by their organization and fitness. The dates are actually dates (that is, they’re verifiably numeric), and their fields are awash in pivot-table friendliness – lots of meaningful permutations in there to be spun. Let’s start basic, with a state-by-state complaint count:

Row Labels: State

Values: State (Count, necessarily; then by Show Values As > % of Column Total)

Sort the numbers by largest to smallest and you won’t be struck by California’s petition to call itself the credit card complaint capital of the US. That soubriquet, of course, merely hooks adventitiously into the state’s largest population but still, my Wikipedia census (circa 2012) fractionates Californians into 12.1% of the country’s total – and they’re registering 15.16% of the complaints, doubtless a meaningful disparity. (And, blank identities aside, you’ve taken note of the 62 records spilled onto the pivot table, taking the data rather over and above America’s 50-state inventory. Those excess rows point to US territories, whose 2-character codes are deciphered here; and so California’s complaint quotient is in actuality a bit higher, if you filter out the territories.) They’re laid back out there, but unhappy.

And other state discrepancies break towards statistical significance, too. Florida’s second-place complaint count – 9.79% – doesn’t really square with its 6.2% of the US population, for example. Perhaps they’ve experienced an influx of Californians.

And how about the problems with Product, i.e., the kinds of credit card services most likely to put the pique in the public?

Row Labels: Product

Values: Product (again, Show As > % of Column Total). Never mind the Grand Totals.

 credit1

Mortgage complaints, however defined, trouble complainants most widely (though these evince considerable inter-state variation), with debt collection – presumably remonstrances against dunning tactics – running a few laps behind in the second position.  (It goes without saying, however, that the numbers here need can’t be confidently understood without sizing up the respective customer base for each product, though I just said it.)s  The few complaints about student and payday loans might surprise Britons accustomed to recurrent news coverage of both types of transactions.

Of course the data could also be broken out by time, e.g.

Row Labels: Date received (Grouped by Years and Months)

Values: Date received (Count, Show Values As > % of Parent Row Total)

I get (in excerpt)

credit2

That % of Parent Row Total command is new to this blog, and merits some detailing. It works something like this:  Because the Years/Month grouping naturally accords priority to years, that chronological unit is vested with a Parent status, with the kindred month rows sired as a kind of progeny. The above screen shot fits subtotals within subtotals, then; 2013’s 32.36% marks that year’s slice of all consumer complaints, but its monthly percentages proportion themselves not to all months in turn, but only those contributory to 2013’s totals. Thus January’s 9.07% is 2013-specific, even as the strikingly scant 4.50% for the previous January binds itself to 2012 alone. It’s what % of Parent Row Total does.

When it’s all unrolled before you onscreen you’ll observe the very substantial uptick in complaints for 2014, though it’s difficult, though not unimaginable, to suppose that service quality faltered so badly from but one year to the next. More plausible, though again not definitively so, it’s possible that more consumers were made aware, or perhaps made themselves aware, of the complaint filing process. This one might be worth researching as a result.

Now about that Data Dictionary. Jump down to the Field reference section and see that, most, but not all, of the text fields are termed categorical, what are elsewhere called nominal – that is, fields whose items can’t be meaningfully ordered or measured, e.g., gender, race, or nationality. What’s slightly odd is that the zip (or postal) code field is likewise understood here as plain text, even as each and every available code (there are some blanks as well) is a certifiable number. What the sheet wants us to presumably understand, then, is that these numbers aren’t for adding or otherwise manipulating. They’re “text” in the way that telephone and social security numbers are text.

But ok; those are quibbles, and does the worksheet really need a 334,565th complaint?

Follow

Get every new post delivered to your Inbox.

Join 174 other followers