# Going Back To Your Routes: NYC Bus Data, Part 2

26 Jan

With its buzzing confusion of 1’s and 0’s, the spreadsheet ledger of the stops and starts of the Q44 looks for all the world like the leaked innards of some secret binary code, but the numbers are assuredly public; each 1 remembers the time and the place when and where a Q44 passed by or came to momentary rest at a scheduled stop.

And if you sum all the 1’s (add the 0’s too, if you like) your formula is going to report 507,218 back to you, rather a large total that counts the number of times the Q44 traversed a stop in September, 2015. And do the subsequent math; divide the sum by 138, the number of apparently operative stops on the route (remember that stop 103900, wherever it is, is empty and apparently isn’t serving riders) and you get the approximate number of Q44 runs (about 3,600) across the month. Divide that result in turn by September’s 720 hours and you’ll find a Q44 heading your way around five times an hour, and that sounds about right.

But companion analyses of the stops aren’t quite as forthcoming. Take note of the 1’s informing any particular row, for example 14959, a pinpointing of the 44’s daily (or nightly) round at 4:39:28 AM. A sideways scan along the row turns up several rivulets of consecutive 1’s staking contiguous stops, e.g.

Now even those among you who’ve never stepped aboard a 44 will reject the likelihood that four of them would find themselves queued along adjacent stops at 4:39 in the morning. Far more probable, of course, is a tracking of a very solitary bus sailing along its appointed round at that time with all due speed, rolling through the four stops in the two-or-so minutes culminating in the next interval – 4:41:36 (thanks to Open Bus contact Justin Tillman for corroborating the above). As a result of that sort of double and quadruple count, an unerring read of the number of different buses motoring through a particular interval’s row fails to resolve, at least not without a deep and lengthy think.

But there may be practicably more to learn by thinking vertically instead – that is, for example, by asking about the timing activity of buses pulling into, or past, any given stop (thanks again to Justin for commending the strategy); and those comings and goings are headed down the columns.

But if stop-specific activity is what you want to track you won’t want to pursue an average times-between-buses per metric; that’s because averages can’t do justice to the commuter’s daily scourge – the bunching of buses in maddening fraternity that should have been properly spaced apart (something I’ve actually seen today; I’m in New York and have witnessed several bunched 44s) . After all – think of 5 buses in a hour wheeling their way past stop A, each arrival lock-stepped precisely 12 minutes after the previous one. Then think of one bus traveling through A and succeeded 55 minutes later by four buses packed into the next 300 seconds. Either way you get 5 buses an hour, or one every 12 minutes – and that average isn’t telling your story.

The stat that you need, then, is a standard deviation, a measure that’ll say something about the variation in the bus spacing, and looks past the deceptive flattening that averages perpetrate upon the stop times. And indeed it’s standard deviations on which Open Bus relies in their Final Averages spreadsheets (click on the 2015/9 data link to see one. You’ll have to parse these data with a Text to Columns routine once you download them, by the way), and for two reasons, I think: the one I’ve just cited, and the scheduling realities of routes, on which some buses simply run more often, and the same routes run variously often per time of day. Again, then, an average can only mislead and fail to properly compare the numbers, both between and within different bus lines.

But the route toward standard deviation is paved with issues, not the least of which is the assignment of each bus stop to its own field, and not to the item status I’ve espoused many times in past posts. On the other hand, with 2,000,000 cells of data the sheet simply can’t reinvent itself into a like number of record rows, so we have to make do with what we have, and my hopeful suspicions notwithstanding I’ve been unable to set forth a single array formula that could be copied beneath the columns of each stop. So what I’m pushing now with all due diffidence is a pivot-table-fronted “solution”, one that imperfectly requires that only stop be considered at a time – and that’s imperfect to be sure. But it goes like this:

Row Labels: Any stop (I’ve selected the first, stop_503964; filter for its 1’s only)
Time

Values: Time (Show Values As > Difference From > Base Field: (time), Base Item: (previous)

In an occupied cell – I’ve chosen I1 – enter =STDEV.P(B5:B10000)/60
I get, for stop 503964, a standard deviation of 10.28 minutes between bus appearances.

Now to translate: I’ve selected the Time field – comprising the rawest of UNIX raw data counting the number of seconds elapsed between the values in the data and January 1, 1970 – because we’re concerned to figure times elapsed between buses sighted at the particular stop. We’re not now interested in the date of the events, but rather only their timings relative to all the other bus sightings. We need to filter for the 1’s populating a given stop simply because that value represents the appearance of a bus at the given time, and so we want to compare these and only these. The Time field is made to reappear in the Values area because we need to subject Time’s numbers to a mathematical operation – each time’s difference from the immediately previous time, and these won’t appear unless they’re called up, as it were, by the times in Row Labels. Thus each time at which a bus appears is subtracted from the next bus appearance, returning the intervals out of which the standard deviation is fashioned. The result is then divided by 60, because all the times are expressed in seconds. The standard deviation formula is itself external to the pivot table, because while you can ask a pivot table to perform standard deviations here it’ll calculate the SD for each item, and there’s no standard deviation for one value.

To learn the standard deviation for any other stop you need to drag the active stop field name off the table, substitute the next desired stop, and again filter for the 1’s. Imperfect indeed, because the process is necessarily sequential. But it works, more or less.

Got that? While you work on it, I’m trying to figure out how many rides I get on my Metro Card – now that’s a challenge.