Archive | Uncategorized RSS feed for this section

NBA Field Goal Data, Part 2: More Than 3 Points to Make

11 Feb

Among the metrics figured and plotted by the Medium look at NBA shot-making (and missing) is a two-way-player analysis, a comparison of the points scored by a given player to the points surrendered in his defensive capacity. As the study authors allow, the measure’s validity need be qualified by several cautions, e.g. the fact that an offensive dynamo might be assigned to guard a scorer of lesser prowess, thus padding his points differential. In any case, we can ask how a spreadsheet might applied to the task.

And that task is encouraged by the data’s CLOSEST_DEFENDER field, which identifies the player nearest a shooter at the point when he launched the shot. Thus by totalling a player’s points and subtracting the sum scored “against” him in his closest-defender role, the metric is realized. (Remember that the data issue from three-quarters of the games comprising the 2014-15 season.) But in view of the way in which the data present themselves, calculating that difference is far from straightforward.

It’s simple enough to drop this pivot table into the equation:

Rows: player_id

player_name

Values: PTS

That resultant – indubitably straightforward – apprises us of the number of points (scored via field goals, but not foul shots) credited to each player, and we’ve earmarked player_id here for inclusion in the table in order to play the standard defense against the prospect of multiple players with identical names. (Subsidiary point, one that’s been confronting my uncomprehending gaze for quite some time: fashioning a pivot table in tabular layout mode substitutes the actual data source field names in the header for those dull “Row Label” defaults. Thanks to Barbara and her How to Excel at Excel newsletter.)

But it turns out that an another, data-set-specific requirement for player_id imposes itself on the process. In fact, the player names in CLOSEST_DEFENDER are ordered last name first, surname distanced from the first by a comma, Yet the entries in player_name hew to the conventional first name/surname protocol, daubing a viscous blob between the two fields. Excuse the pun, but properly comparing the fields would call for a round of hoop-jumping that won’t propel me off my couch – not when I can make a far simpler resort to both player_id and CLOSEST_DEFENDER_PLAYER_ID, which should encourage a more useful match-up (but I can’t account for the mixed caps/lower-case usages spread across the field headings).

That understanding in tow, we can plot a second pivot table, one I’ve positioned on the same sheet as its predecessor, set down in the same row:

Rows: CLOSEST_DEFENDER_PLAYER_ID

Values:  PTS

The paired tables should look something like this, in excerpt:

nba21

Once you’ve gotten this far you may be lightly jarred by an additional curiosity scattered across the data: namely, that the closest defender outcomes comprise far more players, a few more hundred, in actuality. I looked at the stats for two of the players who appear in CLOSEST_DEFENDER_PLAYER_ID only – ids 1737 and 1882, i.e. Nazr Mohammed and Elton Brand, and learned that their offensive stats for 2014-15 were rather sparse (check out www.basketball-reference.com for the data); Mohammed averaged 1.3 shots per game that years, with Brand checking in at 2.6. It may be, then, that the data compilers decided to omit field goal stats for players falling beneath an operationalized threshold, but that conjecture is precisely that.

You may also wonder why two pivot tables need to be impressed into service, when both train their aggregating gaze at the same PTS field. It’s because we’re directed different parameters to the respective Row Label areas – one identifying the scorers, the other naming what are in effect the same players but in their capacity of defender, there lined up with someone else’s points, so to speak.

In any case, once we’ve established the tables, we can dash off a column of relatively simple lookup formulas alongside the first pivot table, one that searches for the equivalent player id in the second. I’ve named the data range in the second table defense, and can enter, assuming the first receiving cell is stationed in A4 (I’ve named the budding field Points Surrendered in A3. Remember of course that the field is external to the actual pivot table):

=VLOOKUP(A4,defense,2,FALSE)

And copy down the column.

(The FALSE argument is probably unnecessary, as the ids in both pivot tables should have been sorted as a matter of course.)

The lookups track down the ids of the players listed in the first pivot table, and grab their points surrendered totals, culminating in a joint scenario resembling this shot in excerpt:

nba22

And once engineered you can, among other things, subject the lookup results to a simple subtractive relation with the sum of pts to develop the offense/defense differential on which the Medium piece reports. You could also divide players’ points by points surrendered instead, developing a ratio that would look past absolute point totals.

Remember, however, that Point Surrendered “field” and the suggested follow-on formulas are grafts alongside, but not concomitant to, the pivot table, and as such you could unify all the fields’ status by selecting the first pivot table and running a Copy > Paste Values upon the results, thereby sieving the pivot data into a simple data set now of a piece with Points Surrendered and kindred formulas.

If we go ahead and divide pts by points surrendered and sort the results highest to lowest we see, in excerpt:

nba23

The findings are both interesting and cautionary.  Dwayne Wade’s and Lebron James’ enormous differentials may have more to do with their offensive puissance than their preventive talents, offset by the understanding, on the other hand, that a good offense may well be the best defense. What’s really needed, however, is a finer scrutiny of the players to which they’ve been assigned – and the same could be said about those who cede far more points than they score on the other end of the sort.

Of course, with all those parameters there’s no shortage of looks you can cast at the data. For example, try this pivot table:

Rows:  CLOSEST_DEFENDER

CLOSEST_DEFENDER_PLAYER_ID

Values: PTS

ShotPct (the calculated field we hammered together in the previous post).

I get in excerpt:

nba24

The intent here is to compare players’ points surrendered – an absolute measure – and the shooting percentages of the players they’ve guarded. Scan the list and you’ll see that Lebron James “held” his shooters to a middling .442 percentage, but Dwayne Wade restricted his opponents to a .394 mark, suggesting his defensive goods are for real. But again – the numbers need to be checked against the overall percentages of shooters. It may be that Mr. Wade has been issued a light workload.

And for a concluding, graphical touch, the Medium piece offers a scatter plot pairing players’ points scored by and against:

nba25

I don’t know with what tool the authors plied the chart, but I very much doubt it was Excel. In any case I managed to achieve something very similar with that application:

nba26

How? Well first, recognize that Excel simply can’t put together a scatter plot from a pivot table. If you try, you’ll be told “Please select a different chart type, or copy the data outside the Pivot Table.”

Opted for the latter counsel, I copied these data, for example:

nba27

And pasted them into a blank sheet area via Copy > Paste Values. I then selected the two columns of data, and headed toward Insert > Insert Scatter (X, Y) or Bubble Chart (to add data labels, see this You Tube video).

I did all this stuff without a programming language in sight. Does that make me a philistine?

 

Advertisements

NBA Field Goal Data, Part 1: More Than 3 Points to Make

21 Jan

Spreadsheets can’t do everything, but don’t sell them short; there’s plenty a spreadsheet can teach us about the National Basketball Association shot data garnered by Kaggle right here, even though the data have already been vetted with some due diligence by the appropriately five-to-a-side team of Michael Arthur, Caleb Johnson, Aimun Khan, Nimay Kumar, and Reid Wyde in this look, shaking and baking on the Medium web site. The authors learned much from the sheet and its 128,000 records of player j’s, hooks, and jams, with particular reference to the vaunted, if mythical, hot hand – the perceived, stepped-up likelihood that any given shot finding nothing but net will somehow inspire the next one to make the same discovery.

That’s something we can look at, too, along with a number of other analytical scenarios we can fold into our playbook.

But first a few words about the data, which, as the above-linked piece allows, were in fact accumulated for the 2014-15 season (owing to availability issues for subsequent years).  In addition, the 30 NBA teams play an 82-game season, or an aggregate of 1230 unique contests; yet the dataset’s GAME_ID field carries 904 game identifiers, affirming a shortfall of about one-quarter of all the games that actually took the floor. I can’t account for the fractional representation, but those are the data we have. You may also want to think, and do something, about the GAME_CLOCK field, whose times mistakenly offer themselves in hour/minute format when they really mean minutes and seconds. The first entry, expressed as 1:09 AM, really wants to archive one minute and nine seconds, and so if you want to work with these data you may need to reach for a recalibration (remember that each quarter in the NBA extends for 12 minutes), which could entail breaking open a new temporary column alongside GAME_CLOCK, subjecting it to the mm:ss Custom format, and entering in what is now I2:

=H2/60

Because of course an hour comprises 60 minutes, dividing what are the hourly totals in GAME_CLOCK (ignore the AM suffix) by 60 miniaturizes the values to minute/second magnitudes. Once you copy the formula down I you can copy those outcomes back to GAME_CLOCK via the Paste > Values protocol and send column I back to the bench – and make sure GAME_CLOCK assumes the mm:ss Custom format, too.

That conversion having been plied, we can proceed to and get past the time-honored column auto-fit exercise and then begin to run-and-gun some actual questions at the data, including variations on the themes drawn by the Medium study. One such question asks about league shooting percentages, varied by what the authors term “areas of the [playing] floor”, a slightly misinforming alias for what are in fact distances from the basket, whose data in feet register themselves in the SHOT_DIST field. In fact a given distance can be fixed at any number of positions on a semi-circumference on the floor, e.g. the foul line or another point nearer the floor perimeter, even as both are plotted 15 feet from the backboard, for example. Two different “areas”, then, same distance. In any case, we could ask if shooting percentages push downwards with increased distance from the hoop. Common sense of course suggests they do, but proof awaits.

And here the Medium study performs something of a personnel substitution, by  turning to a new dataset for the shot-distance data (which again were gathered from the 2014-15 season) – even as it seems to me that those data could have been derived from the Kaggle workbook via its SHOT_DIST field. (In addition, the NBA data linked in this paragraph offer data for the entire 2014-15 season, disrupting a precise like-for-like comparison of our data which again report the numbers for but three-quarters of the year).

I can think of two or three means toward a shot percentage-by distance set of answers, the first cooking up a messy stew of FREQUENCY and COUNTIFS formulas, the second a relatively more elegant pivot table that necessitates an important tweak just the same.

Remember we’re interested in correlating shooting percentages with distance from the basket, and as such we could try

Rows: SHOT_DIST (grouped by units of 5 feet)

Values:  FGM (stands for Field Goals Made, Sum)

FGM (again, this time Count)

We want Count for that second invocation of FGM, because a counting of all the elements in the field- the 1’s for shots made and the 0’s for those missed – delivers the total of all shots attempted.

I get:

nba1

Note first of all the presentational imprecision besetting the row labels, by which the upper number in each bin is reprised in the lower value for the bin that follows. The numerical actuality imputes the accurate reading to the lower value, e.g. the first bin really tops out at 4.9 feet, and all truly five-foot shots contribute to the second bin. Remember, though, that you can hand-modify the labels, in the service of clarification, for example:

nba2

(Note also that the grouping by 5 really builds bins comprising six values, e.g. 0 through 5.)

But aesthetics aside, we still need to calculate the respective shooting percentages binned by those grouped distances – a simple mathematical proposition by itself, asking us merely to divide field goals made by field goals attempted. That intention sounds like a call for a calculated formula that could look something like this:

nba3

But guess what – as Excel savant Debra Dalgleish reminds me, calculated fields work exclusively with summed fields; try sneaking a COUNT in there and that nice try will be rejected.

A second suggestion, this one external to the pivot table proper, would be to compose a simple formula alongside the pivot table, e.g. =B4/C4 for the first bin, and copy it down, each formula sidling its bin. That’ll work, but if you regroup the data by a new interval, say 10 feet, the pivot table’s now-fewer rows will kick up a clutch of #DIV/0! errors that cling to the now-existent bins.

But this next alternative seems to work, even if it’s redolent of a kludge: make room for a new column in the dataset (say alongside FGM in S), call it something like ShotsAttempted, enter a 1 in S2, and copy that meek value down the column. What’s this curious maneuver doing? Glad you asked. It enables this calculated field:

nba4

ShotsAttempted’s endless litany of 1’s will be summed, and will divide themselves into the FGM values and break out by SHOT_DIST in Row Labels. Format appropriately (I tried a Custom format keyed to the .000 motif) and you’ll get something like this:

nba5

(Note that the Values area need only consist of the calculated field results.)

Of course the surpassingly high field goal percentage for shots in the 0-5 feet range won’t surprise (a zero-foot shot is presumably a dunk or a layup); what’s at least slightly surprising is that once shooters move out beyond five feet, the percentages sink markedly, in part at least a function of the more assiduous defense applied to the shooters out there. After all, if you’re positioned to dunk the ball you’ve already lost the man assigned to guard you (I cannot speak first-hand, you understand). Indeed – the aggregate shooting percentage for shots equaling or exceeding five feet is .393.

And once you’ve made your way this far into the analysis you can select the pivot table, copy it, and perform a Paste > Values nearby, thus establishing a fixed individual player baseline. Then draw up a Slicer earmarking the player_name field, and you can check out your favorite hoopster’s percentages by distance, e.g.

nba6

Moral of the story: don’t let Lebron get too close to the basket – nudge him out past 25 feet. At least that’s what I try to do with him.

New York City Restaurant Inspection Data: Tips Included

9 Jan

Expressing an interest in getting a bite to eat in New York calls for a bit of narrowing down. You will need to get a little more specific about your preferences, in light of the 26,000 or so restaurants in the city happy to fill your mouth, and evacuate your wallet.

Indeed – advising your New York crew “Let’s go to a restaurant” reminds me of the woman who stood in front of me in a Starbucks and requested a cup of coffee, the kind of order that drives baristas into a hand-crafted frenzy.

But once you’ve finally sat yourselves down you may want to learn a little more about what exactly it is you’ve gotten yourself into – literally – and the restaurant inspection data reorganized by the Enigma public data site may go a ways towards telling you more than you wanted to know (the data are free to you, but they want you to sign up first. Remember that the lower the inspection score, the more salubrious.)

I say “reorganized” – although Enigma will tell you they’ve “curated”-  the data, because the inspection outcomes have presumably been culled from New York’s remarkably near-real-time and far larger official data set, available on the city’s open data site (and Enigma’s too, though their version is three months old). The revision opens an interesting point of entry, then, to an understanding of how someone else’s data have been re-presented by someone else.

In what, then, does Enigma’s remake of the original data consist? For one thing, they’ve proposed to distill the source data set down to a unique entry for each restaurant (keep that that stratagem in mind), each of which, after all, have been subjected to several inspections.  By means of verification I aimed a Remove Duplicates check at the camis field comprising restaurant ids, and came away with but six redundancies – not too bad for a compendium of nearly 25,000 records.

And once having completed that chore we can run a simple but revealing pivot-tabled census of New York’s eateries by borough:

Rows: boro

Values: boro (count)

boro (again, by % of Column Total)

I get:

resto1

No one will be surprised by Manhattan’s restaurant plurality, though it should be added that the residential populations of both Brooklyn and Queens far exceed that of the storied island. In addition, keep in mind that the endless turnover of restaurants (the Quora article linked above declares an annual restaurant closure rate of 26%, though that assertion should probably be researched), turns the count into an implacably moving target.

And for another thing, the Engima set has padded the progenitor data with each restaurant’s geo-coordinates (latitude-longitude), thus priming a mapping capability. But they’ve also, befitting one of Enigma’s enigmatic apparent first principles, reformatted the inspection dates into text mode.

And Enigma’s alternate take has also put the scissors to some of the set’s original fields. The Critical Flag field – naming restaurants that incurred what the Department of Health and Hygiene terms critical violations, “…those most likely to contribute to food-borne illness”, is gone, and I’m not sure why. Those data sound like something you’d want to know about, and analyze.

But there’s a pointedly more serious issue besetting the data that I haven’t quite figured out. Because Engima determined to squeeze the data into a one-record-per-restaurant yield, it had to decide exactly which record would be earmarked for retention; and common analytical sense would commend the latest such record, conveying the current inspection standing for each restaurant. But it appears that Enigma hasn’t always nominated the latest record. A spot comparison of the records across the two datasets turned up some Enigma selections that predate more current inspections for the same restaurant in the official New York workbook. And if those kinds of discrepancies riddle the Enigma data, then we need to wonder about the decision rule that authorized their inclusion – and I don’t know what it is. What would an aggregate averaging of inspection scores purport to say, if some of the scores have been superseded by newer ones? (My emailed query to Enigma about the matter remains unanswered as of this writing.)

Moreover, because the one-record stipulation is in force, Enigma was impelled to collapse disparate violation codes in that eponymous field. The very first record, for example, for the Morris Park Bake Shop, reports two violations coded 10F and 8C, both filed on May 11, 2018. But New York’s precedent dataset has assigned a distinct record to each of the two, easing a pivot table breakout by code.

And those code consolidations – an ineluctable follow-on of the one-record-per-restaurant decision – probably explains Enigma’s omission in turn of the original Violation Description field. Boxing multiple violations in the space of one cell might confound legibility for both researchers and readers, and so Enigma likely concluded the whole field was best expurgated – at a price of course, because now we don’t know what the violation codes mean.

Now to be fair, Enigma also furnishes a worksheet-housed directory of those codes, which make for a most serviceable lookup array; but the multiple-code cell structure of its inspection data makes for an exceedingly messy prospect for 24,000-plus lookup values, which must be individuated somehow.

But all these cogitations have given me the munchies. Where do you want to eat? You want Chinese? Fine – that pares the choices to around 2,400. Your treat.

Walking Across the Brooklyn Bridge: Stepping Through the Data, Part 2

24 Dec

The decision to walk across the Brooklyn Bridge is a distinctly multivariate one, even if the internal equation that sets the walk in motion doesn’t chalk its terms on the walker’s psychic blackboard.

That preamble isn’t nearly as high-falutin’ as it sounds. Nearly all social activities negotiate trades-off between these and those alternatives, and a promenade over the bridge is no different. We’ve already observed the great, and expected, variation in bridge crossings by hour of the day in the previous post, and we could next consider the impact – if it’s proper to think about the matter in those causal terms – of month of the year on journey distribution (remember that our data records bridge crossings from October 1 2017 through July 31 of this year).

That objective calls for this straightforward pivot table:

Rows: hour_beginning (grouped by Year and Month. You need to put both of those grouping parameters in place in order to properly sequence the months, which straddle parts of two years.)

Values: Sum

Sum (again, here by % of Column Total)

I get:

bb1

The differentials are formidable, for me surprisingly so. One would have expected bridge foot traffic to crest in the summer, but a July-January walker ratio of 2.8 comes as a surprise, at least to me (remember again that the above totals compute one-way trips). It’s clear that meteorology has a lot to do with the decision to press ahead on the bridge, in addition to, or in conjunction with, chronology, i.e. the hour of day and day of the week. What we can’t know from the findings is whether the walkers had to get from Brooklyn to Manhattan or vice versa one way or another and chose to walk, or whether the trips were wholly discretionary.

And would one expect a spot of rain to discourage walkers? One suspects as much, of course, but confirmation or denial should be but a few clicks away. We could, for example, write a simple CORREL formula to associate precipitation with pedestrian turnout, provided we understand what it is we’re correlating. Here we need to remind ourselves that because we subjected in the previous post to a Get & Transform routine which replicated the pedestrian data source, that copy rolled twice as many rows we found in the orignal, assigning a record each to Towards Manhattan and Towards Brooklyn hourly totals. As a result each hourly precipitation figure is counted twice there, and so simplicity would have us look at rainfall data in the original dataset, if you still have it. If you do, this CORREL expression, which assesses precipitation by hour:

=CORREL(B2:B7297,H2:H7297)

Delivers a figure of -.0093, or a rather trifling fit between rain/snow and the determination to walk the bridge. Now that doesn’t look or sound right; but that perception is my way of saying it doesn’t comport with my commonsensical first guess.

But the correlation is “right”, in light of the manner in which I’ve framed the relationship. Because the formula considers precipitation with hourly pedestrian totals, most of the rainfall entries are overwhelmingly minute, and indeed, in over 4800 cases – almost two-thirds of all the hourly readings – amount to zero. The correlation appears to capitulate to what is, in effect, a host of unrelated walk/rainfall pairs.

But if you correlate walk numbers with aggregate rainfall by entire days the numbers read very differently. Continuing to the work with the original dataset, try this pivot table:

Rows: hour_beginning (grouped by Days)

Values: Pedestrians

Precipitation (both sum)

(Note that the row labels naturally nominate 1-Jan as the first entry, even as that date isn’t really the earliest. Remember the demonstration project got underway on October 1, 2017. But chronological order – really a lowest-to-highest numeric sort – is in no way a correlational necessity.)

Running a correlation on the above outcomes I get an association of -.333, which makes my common sense feel better about itself. That is, as calibrated here, rain “affects” pedestrian turnout to a fairly appreciable extent – the more rain, the fewer walkers, more or less. Again the (negative) correlation reflects the precipitation aggregated by days, not hours. Indeed – just 47 of the recorded 304 days report no precipitation at all.

And how does temperature figure in the decision to traverse the bridge? Again working with the original data set (and not the pivot table), in which each hourly instance appears once, we can rewrite the correlation, this time introducing the temperature field, which I have in the G column:

=CORREL(C2:C7297,G2:G7297)

I get .391, another persuasive, if partial relationship. With higher temperatures come stepped-up foot traffic – to a degree, pun intended – but that finding induces a couple of hesitations. For one thing, the Fahrenheit system to which the temperatures are here committed promotes an arbitrary, famous understanding, as it were – the temps aren’t keyed to an absolute zero. And so it occurred to me that a second correlation, this one redrawn with the temperatures pitched in Centigrade mode, crunch out a different result. That statistical hunch had me open a new temporary column (in my case in H), in which I refigured the temps with the cooperative CONVERT function, e.g.

=CONVERT(G2,”F”,”C”)

Copying down H and reprising the CORRELATION, this time with the C and H-columns range in tow, I wound up with… .391, at one with the first result, at least if you’re happy with a 3-decimal round-off, and I think I am.

But in fact the two .391s depart from one another by an infinitesimal sliver. The first, associating walk totals with temperatures expressed in Fahrenheit, comes to .390815536. That correlation with the temperatures in Centigrade (Celsius) calculates to .390959393. Presumably the tiny shift in numeric gravity wrought by the respective measurement systems accounts for the difference, about which few are likely to care, to be sure. But that discrepancy does mean I need to learn more about the workings of correlations.

The other caution about the correlation, whichever one chooses, asks about its linearity. While we could reasonably anticipate a swell in bridge crossings as the mercury ascends, it’s most possible, on the other hand, that pedestrian activity could be inhibited by temperatures forbiddingly high – in the 90s, for example.

And that conjecture could be submitted put to a pivot table (small note: two rows among the data record no temperatures), e g, assuming again we’ve remained with the original dataset, which features each temperature only once:

Rows: temperature (grouped, say, in bins of five degrees)

Values: Pedestrians (sum)

Pedestrians (count)

I get:

bb2

(The blanks reference the two empty temperature cells, and can be filtered out.)

The pedestrian count in effect totals the number of days populating each grouped temperature bin. After having filtered the blanks, move into the next-available D column – a space external to the pivot table – and enter in D4:

=B4/C4

Round to two decimal points and copy down D (I don’t think a calculated field can must this result). I get:

bb3

We’ve disclosed a strong if imperfect (and unsurprising) upward association between pedestrian hour averages and temperature. But the highest reading – 89-94 degrees – does seem to drive a pull-back in traffic. Note in addition the leap in hourly crossings from the 74-78 to 79-83 bins, as if 80 degrees or so lifts the inclination to walk to its tipping point.

So there. Didn’t I tell you the decision to lace up those high-heeled sneakers was multivariate?

P.S. In response to my previous post’s curiosity about a few missing Towards Brooklyn/Manhattan data, New York’s Department of Transportation wrote me that the empty entries might be attributable to a weather-induced snarl.

Walking Across the Brooklyn Bridge: Stepping Through the Data, Part 1

11 Dec

Ask New Yorkers what they were doing at 4 o’clock in the morning on the night of October 12, 2017 and they’ll tell you they a) don’t remember or b) won’t answer on the advice of counsel. But no matter – we know exactly what one intrepid iconoclast was doing at that hour – walking towards Manhattan on the Brooklyn Bridge.

That solitary – and courageous – wayfarer will be happy to know we have no more information about him, or her; all we know is that he – or she – was sighted and duly recorded by the Brooklyn Bridge Automated Pedestrian Demonstration Project, an initiative of New York’s Department of Transportation.

Squirreling an electronic counter somewhere on the Manhattan-side approach to the storied bridge the project gathered footfall data for both directions through the October 1 2017-July 31 2018 span (I’ll own up to the pun) and walked them over to New York’s open data site here (click the Export button on the screen’s far right and tick the relevant CSV option).

Our solo nightwalker presages a dataset that is nothing if not intriguing, even as it exhibits a few organizational curiosities. Apart from column A’s all-but-standard need for a widening via auto-fit (its dates/times are authentically numeric, though) you’ll join me in declaring the contents of column B – comprising 7,296 citations of the phrase Brooklyn Bridge – slightly superfluous, and so eminently dispensable.

And much the same could be offered about the lat and long fields, each of whose cells deliver the same coordinate, presumably positioning the Brooklyn Bridge in its earthly locus. So too, the Location1 field restates its data all the way down, and in textual terms, no less. We’ve seen this sort of thing in any number of open datasets, and it’s proper to wonder why. One assumes some industry-wide download routine batches out these relentless uniformities but whatever the accounting, the data they produce aren’t needed and could be either deleted or ignored.

And there’s another corrective that merits a claim on our attentions. The numbers in the Pedestrian field for November 9 at 7 and 8 PM read 411 and 344 respectively, but the companion data in the Towards Manhattan and Towards Brooklyn cells – which when summed should equal the Pedestrian figures – report nothing but zeroes. And in view of the fact that the pedestrian numbers for November 9 at 6 and at 9 PM read 455 and 300, it seems clear that the count for 8 and 9 could not have amounted to nothing at all. I broached the discrepancy via emails to both the Department of Transportation and the New York open data site, but have yet to hear from either. For the moment, we have to proceed with four empty cells.

And there’s something else, a data-organizational failing that could, and should, be righted, and one we’ve encountered in previous posts. Because the Towards Manhattan and Towards Brooklyn fields host data that are really of a piece and should be treated unitarily (the easier to calculate the percent of pedestrians by direction and date/time, for example) they should migrate their holdings to a single parameter (which I’ll call Direction), via the Get & Transform Data routine with which we’ve recombined  other datasets. Start by selecting both Towards Manhattan and Towards Brooklyn and continue per these instructions.

bridge1

The Get & Transform alternate take then also frees us to delete the Pedestrian field, because the new amalgamated Direction brings along with it a Value (I’ll rename it Sum) column that totals the Brooklyn/Manhattan numeric data by each date/time entry, thus superseding Pedestrian.  Of course the new dataset (presenting itself in table form) comprises twice as many records as its progenitor, because each Towards Manhattan and Brooklyn notation now populates the new Direction field with an individual record instead of the previous parallel fields populating the same record; but that trade-off is well worth the price.

Once that extended preliminary chore completes its work we can try to make some sense of the all that fancy footwork on the bridge. We could start by pivot tabling pedestrian totals by hour of the day:

Rows: hour_beginning (group by Hours only)

Columns: Direct

Values: Sum

I get:

bridge2

The two obvious attention-getters here are the exceedingly, but largely predictable, fluctuations in foot traffic by hour, and the direction of that movement. Note the grand totals as well: over 5 million unidirectional trips launched across the ten-month project period – about 16,500 a day, or approximately 8,200 discrete, carless (don’t read careless) New Yorkers opting for the scenic route over the East River, assuming of course they walked the walk both ways. And if you’re wondering about 4AM, we find around 5.6 average crossings ventured at that hour – even though that hardy coterie probably could have gotten a seat on the subway, then, too.

And consider the literal back and forth of the walkers’ directional proportions. 3PM (that is, the hour running through 3:59), attracted the most pedestrians, closely flanked by 2 and 4PM; and though I doubt the reasons for the hours’ appeal are particularly mysterious, I don’t know what they are. An optimum time for a mid-afternoon stroll? A cadre of workers perambulating home from their early shift, or setting out on their way to a late one? I doubt it, but research on the matter awaits.

And what of the 8AM hour, in which walks toward Brooklyn far outnumber trips to Manhattan? I would have thought – wrongly – that the press of rush hour would have drawn Brooklynites toward Manhattan in predominating numbers, but we’ll have to look elsewhere for an explanation. But by 9AM the flow reverses, rather pronouncedly and curiously.

Now the above table might read more crisply by removing the grand totals for rows (while retaining them for columns) and spinning the numbers through the Show Values As > % of Row Total grinder (remember that this prospect has been enabled by the Get & Transform protocol):

bridge3

Thus we see 58% of the 8AM traffic heading toward Brooklyn, with a pull-back to 44.71% by the next hour – a dramatic, and analytically provocative, reversal.

And what about pedestrian accumulations by day of the week? Common sense allows that weekend totals should be the greater, and that conjecture dovetails with the truth. But extolling common sense won’t do much for your byline; you’ll need to substantiate the finding with a wee bit more precision. Start by introducing a new column to the immediate right of hour_beginning, call it Weekday and enter in what should be B2:

=WEEKDAY(A2)

(If you click on A2 in order to emplace that cell reference in the formula, Excel with respond with a table structured reference; but the result will be identical).

And because the dataset has assumed table form, the formula will instantly copy itself down the B column.

Then try:

Rows: Weekday

Values: Sum

Sum (again, this time % of Column Total)

I get:

bridge4

(You’ll want to rename the headers. Remember as well that 1 signifies Sunday, 7 Saturday.)

Common sense prevails, though I hadn’t foretold Saturday’s substantial edge over Sunday.

But nor did I expect to find someone on the Brooklyn Bridge at 4 in the morning. I want to see the selfie.

 

 

Top 100 Toy and Game Manufacturers: Playing Around With the Data

28 Nov

A worksheet is a big thing. If Excel savant Francis Hayes has read his surveyor’s map properly, he’s discovered that a sheet’s 17 billion cells (give or take a few hundred million) marks out a square mile’s worth of territory, and that’s an awful lot of lawn to mow. And while your dataset isn’t likely to push its lot of fields into the three-lettered column nether (column AAA is a sheet’s 703rd), it’s nice, as they used to say about the Sunday New York Times, to know it’s all there.

But that gigantic tract of cells means there’s enormous room for a spreadsheet’s design to assume this or that conformation, a reflection that’s cued by a look at a Statista workbook that lists and details the corporate skinny for planet Earth’s top 100 toy and game companies. It’s here:

Top_100_Toys&Games – Statista

We can agree that the Toplist sheet is iconoclastically organized; it has assigned five discreet datasets to its space, each reporting on a different corporate parameter and extending its overall reach to column AX. By itself, course, that layout program is perfectly legal, but it controverts the conventional wisdom that would commend each dataset to a sheet all its own.

On other hand, one would be entitled to ask exactly what’s ultimately “wrong” with the all-to-one-sheet scheme. One answer could point to a compromising of navigational ease across the sheet; we’re accustomed to finding our data waiting for us somewhere in the upper reaches of the A column, and only one among our quintet can answer to that description, of course.  Here, you’ll need to do your share of scrolling if you want to view the other four. (Another note: appearances to the contrary, the datasets are just that, and not the tables a first look at the set might suggest. The banded-row effects – a default staple of tables – were here hued by a series of conditional formats that assign contrasting colors to odd and even-numbered rows respectively.)

But I’d agree, on the other hand, the navigational objection could be quashed as less-than-substantive, and merely presentational. But there are other demurrals could be aimed at the datasets that might prove a jot more incriminating.

For example, you’ll note that the actual names of the 100 firms appear only in column B in the first dataset along with their rankings in A. Yet the actual ranking determinant – presumably the revenue-declaring   “latest value” field in the AG column – finds itself in the fourth dataset only.

Moreover, you’ve probably also observed the filter buttons wedged between the datasets in the empty columns that separate them, buttons that of course have nothing to filter and thus provoke the necessary follow-on question: why are they there at all?

toy1

The answer, it seems to me, insinuates a likely pre-history of the datasets: that the five were originally one, and that in the interests of thematic reorganization the designer chose to chip the data into the fractionated wholes we’re viewing now. By way of proof, try drawing up a primitive dataset comprising three fields. Turn on the filter buttons (Data ribbon > Filter in the Sort & Filter button group) and proceed to interpolate a new column between any two of the existing columns. You should find yet another filter button topping the new, completely empty column/field, even as you leave it untitled.

So that’s what Excel does, affirming an operational reality that goes some way toward proving my point – that the data in the Toplist tab were originally of a piece, split only later into the five sets about which Statista wants us to know. And indeed – the very placement of all five in the same sheet corroborates the point.

What advantage, then, redounds to the breakup of the data? As intimated above, I suspect Statista wanted to sharpen the data’s readability and focus by subdividing them into the headings featuring in each set’s upper-left-corner cell (e.g. Rank, End of Fiscal Year, etc. And by the way, those headings – in row 4 – will not obstruct a sorting of the data, which will rather, and properly, draw their upper boundary at row 5. We met up with a similar curiosity in one of my posts on the American Sociological Association data. And pivot tables will likewise recruit their field names from the entries in row 5, which means that something will have to be done about A1, in which the name is represented by a dash.)

But the presentational gains realized by the data partitions cloud before the losses in functionality. If, for example, one wanted to group employee counts by manufacturer ranking you’d be stopped at the door, because those counts, in the third dataset, aren’t accompanied by the rankings there. That information is exclusive to the first set, and the two sets can’t be subjected to a relational query – even should they be converted into bona fide, prerequisite tables – because datasets one and three share no field.

Such is the problem, but the way out seems simple – delete the corridors of empty columns and reunite all the data into one grand, overarching monolith of a set. Remember that you’re an analyst here, not an aesthetician; you want the data to behave themselves and cooperate with your investigative intentions; looks come in a distinct second.

But that consolidative act doesn’t conclude the remake. You’ll likely want to attend to the fields topped by year figures – there are four 2017s, for example, renamed successively by Excel 20171, 20172,20173, and 20174 in the interests of differentiation. I’m also not sure why the fields reserved for 2018 data are in place here at all, as they’re unpopulated and could be deleted. While we’re at, we could ask what analytical utility devolves upon what was the fifth dataset, Revenue: Reported Currency (in millions); its income figures, expressed in country-specific denominations, thwarts a like-for-like comparison.  Thus one could tempt oneself to delete that erstwhile fifth set, save for the fact that Reported Currency imposes time-sensitive currency dollar equivalences to the dollar. Thus, for example, tenth-ranked Bandai Namco Holdings, a Japanese concern, reported a 2017-2016 earnings ratio in dollars of1.19. The proportion for the same years expressed in dollars comes to 1.08. Is that discrepancy worth pursuing? Maybe.

In any event, once you’ve fused the datasets into the greater whole you’ll be able to assess employee force size as grouped by the ranks of firms, say in bins of 5, e.g.

Rows:  Rank (grouped by the interval 5)

Values:  latest value (the one in column u, that is.  Column AD bears the identical similar heading, at least by default. Average, formatted to two decimals with commas. And true, latest value here might not edify a reader.)

I get:

toy2

Among other things, we learn that employee complements don’t correlate neatly with ranks (which are derived from revenue), and surprise – the list descends to number 108.

But maybe that’s a top 100, accounting for an inflationary quarter.

Atlantic Ocean Hurricanes: Brain-storming the Data, Part 2

13 Nov

Obvious questions, as we learned in the previous post, don’t always facilitate obvious answers; and if you’re seeking additional cases in point, consider this up-front question we could put to our hurricane-data worksheet: has the average length of storms, measured in days, moved about across the 167 years of data filling the sheet?

By way of a first consideration, what we know won’t answer the question is a dividing of the sheet’s 50,000-plus records by the 1848 individual storms we’ve identified. The quotient that emerges – about 27 – can’t propose itself as the storm duration average in days, because each storm elicited multiple observations of its movement and progress. The very first storm in the data set, coded AL0111851, triggered 14 observations across its four-day span. And its daily observation average of 3.5 tells us that the numbers of recorded observations weren’t constant across days; and so that while 27 represents the average number of observations performed per storm, that figure cannot be divided by some unvarying value to yield an average day count.

We’re also roadblocked from an imaginable alternative for calculating the average duration. We can’t subtract each storm’s start from its finish date, at least not directly, because the pre-1900 storms informing the list resist Excel’s best date-formatting efforts. Enter a date antedating January 1, 1900, and the notation is forced into text mode, e.g. a non-numeric entity that can’t be added, subtracted, and the like.

That famous limitation doesn’t close the door on the task, however, and I can think of a couple ways of poking my foot beneath the transom before it slams. The more elegant of the two rides the idea that a unique count of the dates associated with each storm is coterminous with its duration.

For example- storm AL0111851’s 14 entries post four different dates to the Observation_Date field in column E. (We need to tap into Observation_Date and not the dates vested in the Date field in A, because those latter data are visited by times as well, and thus confound the search for unique dates alone.) Four different dates – assuming they’re consecutive for all the storms, a not imprudent assumption – must signal a storm of four days’ duration.

That the pre-1900 entries among the data don’t qualify as genuine, grade-A certified dates doesn’t matter here. All that concerns us is our capacity for culling one instance of each date, its format notwithstanding.

And to broadcast that possibility to our screens we can once again make productive use of a pivot table’s Distinct Count operation, mobilized by a tick of the Add this data to the Data Model box:

weath1

Set the pivot table in motion and organize it thusly:

Rows: Year

Values: Observation_Date (Distinct Count)

Storm_ID (Distinct Count)

I get, in excerpt:

weath2

The table thus grabs only one instance of each date and storm ID. From here, however, it appears as if we need to supplement the results with some simple but external formulas, because a calculated field invoking distinct counts – in which the annual counts of the observation dates might be divided by the yearly count of storm IDs – isn’t available to us.

We can opt for a none-too-graceful workaround, then, by entering this extra-pivot table formula in D3 (entering the title Average in D2), lining itself up with the first data row in the pivot table:

=B3/C3

Format the result to two decimals and copy down D (through the year 2017; omit the Grand Total row).

We could also, however, replace Year with the overarching Grouped Year field we had extemporized in the previous post. Try that, and delete the now-excess formulas in D that descend the final 2010 grouped year.

I get:

weath3

Of course the Average field abuts, but cannot enroll, in the pivot table, and as such can’t be party to a pivot chart. But by selecting A3:A20 and D3:D20 with the cooperation of the Ctrl key as you can select the non-contiguous fields, you can insert a scatter chart with straight lines and markers and power up this chart:

weath4

You can right-click the horizontal axis, click Format Axis, and set the Major units interval for 10 as you see above, a decision that will disclose all the grouped years into the chart (note that the scatter chart seems to be preferred here because it honors the numeric status of the X axis data, here the grouped years; a line chart engineers a default treatment of X-axis data as labels. Look here for an instructive discussion of the issue. It’s true, on the other hand, that a conventional line chart could be put into place in this case as well, because the grouped years happen to be equally spaced and so would be indifferent to their label status; but you’d need to edit the axis in the Select Data dialogue box in less-than-obvious ways).

The chart delineates a dip of sorts in average storm duration across time, and so dashes a lurking, laymen’s speculation of mine – that the upheavals wrought by global warming would have served to prolong storm lengths. But ok – I guess that’s why we look at the evidence. (Again, of course, we assume that storm measurement criteria and instrumentation enter the equation as constants across the 167 years, a premise that could be vetted.)

Now for a next task, we could examine the average maximum wind velocities across the grouped years, again with an eye toward any material change. Because the job appears to require a distinctly stepwise solution, in which the wind maximum for each storm needs to be calculated and then followed by an average of the maxima by years, I don’t think a pivot table can deliver an immediate, conclusive result. Here’s what I’d do, then: commission this pivot table:

Rows: Grouped Year

Storm_ID

Values: Observation_Max_Wind

That table looks something like this:

weath5

I’d next turn the above results into a data set of my own, whose records could be plowed back into a second pivot table. I’d thus

  1. Redesign the pivot table into a Tabular mode and likewise tick of Repeat All Item Labels (both options stored in the PivotTable Tools > Design > Report Layout button the Layout button group). Turn off Grand Totals.
  2. Click anywhere inside the above table, click Ctrl-A to select it in its entirety, and perform a simple Copy> Paste Values upon itself.

You’re left with an unprepossessing data set, divested of all its pivot table appurtenances. But we then proceed to pivot table the data anew:

Rows: Grouped Year

Values: Observation_Max_Wind (Average)

weath6

And if we’re charting the above we’re forced to return to a conventional line chart, because you can’t describe a scatter chart from a pivot table, and Excel will duly inform you. Proceeding with the line chart option, then, and indulging in some by-the-book tweaks, something like this emerges:

weath7

The plummeting of maximum speeds commencing with the 1960 interval surely demands a closer look, particularly in view of the subsequent movement upwards approximating toward pre-1960 averages. How is the dip to be explained? Observational error or methodological rethink? Actual diminution in velocities?

I don’t know, but just don’t call this post long-winded.