US Railroad Accident Data: Getting Them On Track, Part 1

7 May

Falling asleep on the job – literally – can be an occupational hazard – literally. Some evidence, for example: the incidences of fatigue-associated accidents among railroad staffers, tracked by the US Department of Transportation and compiled here:

 HF_Deidentified_FRA_Work_Histories

Encountered straight on, the data seem analyst-compliant and only too happy to offer their services to interested pivot tablers and other deep thinkers; all the reported dates and incident times, for example, sport verifiably numeric status. But there are some sticking points in there that need to blunted before they’re smoothed into talking points.

First, a data-organizational alert.  The records for the year have been sliced into quarter-specific sheets, thus placing your permit for pivot table construction on momentary hold; but by copying and pasting the four sheets’ worth of data into a solitary one you’ll be able to start building, and without bribing the inspector besides (the time-warped Alt-D, P-conjured PivotTable and PivotTable Chart Wizard consolidation routine we’ve discussed in an earlier post won’t work here, for reasons with which I won’t bore you).

And then we’re left to account for the 490 rows in Quarter 4 starting at 10024, all of whose 2,450 cells store data that, diverse formats notwithstanding, evaluate to zero. Speculation aside they should be deleted, and not merely separated from the substantive pack by our celebrated interpolated row. Rather, delete them simply because they’re effectively empty.

And for another curiosity – about which I may not have to inform you if you’ve already downloaded the workbook – note that the sheets are protected, a defensive mode I don’t recall having encountered in a net-available file before (and remember we’ve culled this one from an open data site).The fact of the protection is acknowledged in this supplementary page, that step taken “to prevent accidental data corruption”.  Given the workbook’s placement and public availability on the Data.gov site I’m not sure where that corruption might be perpetrated; but in any case a simple tapping of the Review > Unprotect Sheet button will release the cells to your beneficent custodianship.

I was also temporarily mystified by the two Accident Types sharing that eponymous column, until it occurred to me that HF almost surely initials some human factor contributory to the accidents, corroborating the sheets’ title; and by extension NHF points to non-human causatives. And I don’t know whether the matched entries in the Subject ID  field always denote the same railroad worker; we see, for example, that subject ID 051500.0101 in the Quarter 2 sheet records 17 incidents. Same person?

And what about that title, in the wrapped, merged A1? That composited address comprises worksheet titles and field headers, and that won’t do. The very simple workaround: click the Merge & Center button; that’ll restore the headers to their appointed cells in – and remember to either delete the title proper, or insert our redoubtable blank row between what are now rows 1 and 2. (But of course if you plan on carrying out that copy-and-paste multi-sheet consolidation, you need perform this  cell decoupling only once.)

The On and OffDuty times, on the other hand, seem unproblematic; they appear to mean what they say – that is, recall the start and end points of a worker’s daily stint, although I’m assuming – and not necessarily rightly – the associated accident in column E may have necessarily shortened any given day in the data set. But even here my confidence in that reading is not doubt-free..

To explain: in the prospective interest in correlating accident times with its point of occurrence during the given work shift, I’d move into the F column and enter, in F3:

=C3-B3

That simple subtraction should turn a workday length in hours and minutes, discernible as such if you Custom format the results this way:

h:mm

(Note that only one h will nevertheless properly capture a double-figured hourly duration.)

Copy the formula down the column and, among other things, (and provided you haven’t copied-and-pasted yet everything into a single sheet yet) you’ll find very many unaccountably lengthy shifts; in Quarter 2, for example, a 23:30 turn materializes in row 5989. Does that sound possible, or legal?

Moreover, the year references in the sheets appear to hark to 2003, and I’m not sure the data aren’t more current than that. The workbook was first made public in 2008, but we’re told here that an update was patched in this March 31. If in fact the years are anachronisms and as such nothing but a grand, aggregated mistake, they could perhaps at the same time be defensibly ignored, precisely in virtue of their constancy. If, after all, the data emanate from the same year, the intrepid pivot tabler will concentrate the analysis on months and days alone.

And for another discrepancy: look, turn for example, to Quarter 1’s row 650. It cites an accident at 12:30 am, even as the associated work shift outlined in B650 and C650 spanned 2:55 am to 9:55 am.

rail1

That can’t check out; 12:30 am fell outside that worker’s day, and quite apart from the 2/15 date (see below).

And while you’ve probably brought this one to your own attention already (and if you haven’t it’s been brought there right above), just for the record we note the unvarying dates accompanying the accident times. I think we can safely ignore these without much commentary or controversy; I suspect a format misfire explains them all.

But I’d appreciate some answers to the other questions. I hope to get back to you on them, if the Federal Railroad Administration gets back to me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: