Going Back to Your Routes: NYC Bus Data, Part 1

12 Jan

There are more expedient ways for getting around New York than on its buses, but if you like the slow, scenic route – provided you know what you’re looking at – the deal isn’t half bad. And nowadays the data miners are hopping on too, their sightseers’ oohs and aahs inspired by the buses themselves and their GPS-driven must-sees.

And now those attractions are even available to armchair flaneurs worldwide at  theopenbus.com, the depot into which trip data for New York’s buses are parked, and in felicitous spreadsheet spaces. Click the home-paged Raw Data link and again on the next-up Bus Data button for its bank of monthly numbers waiting at the next stop. I clicked Sept. 2015, drilled down into the q44 rawdata.csv  link, and downloaded away:


I’ve probably spent more time on the Q44 bus – its drunkard’s walk shambling there and here from its square one in deepest Queens and funnelling northward across the Whitestone Bridge until it comes to rest at various Bronx termini – than all the elite band of readers of this blog put together. A very long, money’s worth of a route, then, its 139 columns are each appointed to its like number of stops (though stop 103900 in column BV is utterly empty).

You’ll want to download the important three-page Methodology booklet from theopenbus site, a detailing of what the spreadsheets mean to tell us by their outpourings of 1’s and 0’s. For starters, though, we can agree that a 1 records the arrival of a bus at (or near) a route’s stop as coded in the respective stop columns, and at the approximate time reported in column A. But what do the 10-digits crowding A have to do with arrival times?

Appearances notwithstanding, quite a bit. The numbers in A mask durations expressed in UNIX time, a dating metric conveying the number of seconds elapsed from 12:00 AM, January 1, 1970 to any given date/time. Apart from the question of why the bus time-tracking here is so conducted (a question not taken up by the Methodology booklet), it’s clear that both presentational and analytical dictates call for a translation of the UNIX readings into something rather more recognizable to you and me.

And that’s something we can do. Here’s what I did: In readiness for some future pivot tabling of the data set I introduced two blank rows atop row 1, posting in A1 I the value 86400 – that is, the equivalent in a 24-hour day diced into seconds (in fact the astronomical cognoscenti have slipped an occasional leap second into the day reckoning, but UNIX ignores these). I then range-named A1 sec. The intent, then: to divide every entry in A by sec, thus returning the number of days separating the entries from the January 1, 1970 baseline. I then marked out a new column between A and B, named that incipient field datetime, and entered, in what is now B4:


Copy down the column and you’ll have realized the number days distancing each A datum from the January 1, 1970 point of inception. B4, for example, evaluates to 16679.4163 – and if you divide B4 by 365.25, the approximate duration of the average year, you’ll get 45.66575, or the number of years counted off from the UNIX touchstone (and the decimalized half of that value represents B4’s time of day, and most certainly can’t be ignored or rounded away).

But 45.66575 years doesn’t deliver the actual date and time divulged in B4; again, our result keys itself to January 1, 1970, and not Excel’s January 1, 1900 date of embarkation against which all other dates are counted. As such, B4 requires a new additive – Excel’s numeric equivalent of January 1, 1970 itself. Combining that with B4, then, should yield the cell’s actual date and time, because all the days elapsed from January 1, 1900 through January 1, 1970 plus the original value in B4 should then be accounted for and return the actual, desired date (and time).

In that light I entered 1/1/1970 in B1 and called it yr. Remember that 1/1/1970 is merely a formatting guise; click the down arrow on the Number Format drop-down menu in the Number button group and observe the date’s numerical equivalent, 25569, what really matters. I then amended the expression in B4 to:


(Yes, the order of operations will take care of itself here.) I get 42248.42, and by applying this slightly customized format


 (I added the :ss to the off-the-shelf format catalogued above by Excel, because we may want to think about seconds here) to the cell, I wind up with

That looks good, and so I can copy it down B, after which you’ll probably have to autofit the now-constricted column. You’ll also note that, even as the Methodology booklet describes a typical three-minute gap between tracking observations, our data are overwhelmingly standardized on a 2:08 interval, with exceptions tending toward a doubled, 4:16 timing.

Now the bus route tracking intervals read intelligibly. Ok – there’s more to think about, but I know you have a bus to catch.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: