Archive | Uncategorized RSS feed for this section

World Cup Data: Goal Seeking- and Finding

31 Jul

Like American presidential elections, the World Cup doesn’t seem to end; the four-year interregnum separating those two events seems ever more ostensible; and because some prognosticators have instated Brazil as winners of the 2022 Cup it may already be time to wonder if the tournament should be held at all.

But my latest information is that it’s all systems go, and anyway, Qatar is lovely in November; so in the interests of limning some helpful deep backgrounding of the picture, you may want to download and kick around the data set of all World Cup match outcomes extending through the 2014 go-round, niched here in Kaggle’s repository of holdings.

The set then records all 836 matches contested from the Cup’s inception in 1930 through the immediately previous competition, in relatively self-evident fields, with the exceptions of the Round and Match IDs in columns Q and R. Kaggle maintains that the identifiers are unique, but the Round IDs exhibit a qualified uniqueness, i.e., they appear to signify a certain stage in the tournament (e.g. semi-final) that by definition would have been reached by multiple teams bearing the same id. And the match ids display curiously variable lengths, suggesting a mid-data shift in their coding protocol. The 2014 matches, for example, sport nine-digit identifiers; in 1998 their lengths have shrunk to four characters.

More troublesome is the small but measurable number of redundant game records, signaled by shared match ids. A Remove Duplicates routine earmarking MatchID as the offending field discovered 16 doubled records, which were promptly shown a red card.

Once you’ve stretched all the columns via the requisite auto fit, you can begin to think about what’s interesting in here. What, for example, about putative home field advantage? That vaunted residential edge is something of a legal fiction here; the first record, for example, names France as the home team and Mexico as the visitors, in a 1930 game set in host country Uruguay. But that only nominal imputation spices the question; might even a desultory home team standing impact game outcomes?

Let’s see. Any formula presuming to calculate win percentages needs to reckon with a common soccer/football eventuality – the fact that many games culminate in a draw. As such, we can take over next-available column U, title it Home Win Pct., and enter in U2:


That simple expression means to ask: if the goal entry in G exceeds the figure in H – that is, if the home team wins, then enter 1 the appropriate U cell. If, however, the values in G and H are identical – signifying a draw – then assign .5 to the cell, the standard evaluation of an outcome in which each team in effect arrogates half a win. Once we copy the formula down U, we can simply add all the numbers and divide the total by 836, the number of records in the data set (remember we deleted 16 of them). The result: a home-team win percentage of 68.42, a disproportion that piques the question as to exactly how home teams are nominated.

For example: in the 1930 debut Cup, Argentina received home-team standing for four of its five matches, its sole “away” status assigned to its role in the final, which it lost to the authentic home team, Uruguay. Mexico, on the other hand, competed under the away rubric for all three of its games that year. And the home team in 1930 – however denoted – won all 18 matches during the tournament.

Explanations notwithstanding – though they do need to be pursued – we can calculate Cup-by-Cup home-team winning percentages via a rather neat deployment of the AVERAGEIFS function.

First, enter a bare section of the spreadsheet and list the years in which the Cup was held, starting with 1930 and coming to a stop at 2014 (I’m commencing in Y3). Once you type 1934, of course, you can drag the remaining years down their column by autofilling their four-year interval, remembering, however, that the Cup was suspended in 1942 and 1946. Then name the Year field in A yr, the winner field in U win, and enter in Y3:


And copy down the Y column.

How does simply averaging the win data – which after all, comprise either a 1, a .5, or a 0 – forward our objective? Contemplate this example: a team winning two games and losing one receives 1, 1, and 0 points for its exertions. Average the three values and the resulting .6667 returns the winning percentage for two wins and one loss.

If we’re happy with that understanding and then proceed to format the results in percentage terms, I get:


It is clear that somewhere, perhaps in the 70s, the idea of a home team underwent a rethink; I’m not sure what drove the apparent definitional overhaul, but it seems to have been put into place (for a possible partial accounting see this discussion). We even see an away-team edge gained in the 2010 Cup. I’m happy to entertain surmises about these disparities.

In any case, what about goals – e.g., have their outputs ebbed or surged across the Cups? If we want to figure a winning-to-losing team metric, say the average winning and losing goal total – or really, the average score  – by game by Cup, we’ll have to improvise, because those data aren’t expressed in existing fields. A couple of simple formulas should be able to answer our question, however. I’ve moved into column V, called it Win Goals, and jotted in V2:


That expression simply declares that if the goal total in G exceeds the one in the corresponding H cell, then return the value in G; otherwise report the number in H. If a game was drawn the logical test will not have been met, of course, but no matter; since in such a case the G and H figures are identical it matters not which one the formula returns.

I next head into to column W, label it Lose Goals, and write what is in effect the flip side of the above formula in W2:


Both formulas are copied down their respective columns, of course, and conduce toward this pivot table:

Rows: Year

Values: Win Goals (average, formatted to two decimals)

Lose Goals (same treatment as above)

I get:


The marked downturn in goal scoring is associated with the recency of the Cups; indeed, the overall winning-game average of 2.18 goals was last attained in the 1970 tournament, and the average victory margin of three goals in the 1954 contests exceeds the average per-game combined goal total for the last 14 Cups. Average winning margin for all games: 1.51 goals.

And let’s see VAR verify that .51 goal.


Hacks of Stacks of Wax: Billboard Top 100 Data, Part 2

16 Jul

The hits keep coming on the Billboard 100 dataset, and its mighty chorus of voces populi (it’s the plural; I checked) sounds an arpeggio of questions our spreadsheet is prepared to answer. Topping the survey, perhaps, is one that is both obvious and most compelling: who’s the most prodigious hit maker? The answer, again, should emerge from the trenchant Data Model Distinct Count query we described last week. It informs a pivot table that should look something like this:

Rows: Performer

Values: Song (Distinct Count)

Sort the results Highest to Lowest.

The listings, at least to this way-behind-the-curve listener, were cause for surprise:


Pulling away from the pack, and by a couple of orders of magnitude, is the vast vocal catalogue of ditties crooning your way from the Glee television show, its cover versions of other person’s hits splattering all over the charts, but with a curious aggregate brevity. Its 183 unique hits resounded through the rankings for a total of but 223 weeks, if I’ve gotten my filter right; not one-hit wonders, then, but one-week.

But those counts call for a measure of refinements. In addition to the generic Glee Cast appellation, a filtered scan of the data for the artists bearing the name Glee somewhere in their handle reports:


Filter-confining our pivot table to that expanded Glee complement, I get


Apart from the fact that I haven’t heard of half of the above collaborators, we’ve boosted the Glee count to 206 unique tracks that some time, somehow, succeeded in booking their place in the top 100.

And of course, the multi-name problem is no idiosyncrasy of the Glee phenomenon. You’ll note a Mr. Presley, whose 53 chart visits essayed in conjunction with his antiquarian colleagues the Jordanaires combine with his 49 solo efforts (we’re apparently not counting his backup singers here). That’s 102 appearances for the troubadour from Tupelo, but we’re not finished. Filter for Elvis Presley, and


I’m all shook up. (And like you, I have no idea who the Carol Lombard Trio/Quartet was. The movie star was spelled Carole, but so is one of the listings up there.) And by the way, remember that the Billboard 100 data tracks back to August, 1958; but Elvis’ debut hit, “Heartbreak Hotel”, bears a time stamp of February 1956, and so won’t be found here (though four renditions of a song with the same name by others will).

Aim a like filter at the Beatles – that is, the words Beatles, and –


Or take James Brown. Soul brother number 1 has 47 entries per his name in stand-alone mode, but filter for all instances of the original disco man and we see:


You’ll appreciate the problem; a proper census of each and every artist’s top 100 total would appear to require a filter of the sort we’ve applied above, a necessity that, if plied, can’t be directly pivot tabled, in part because a great many songs would need to be counted more than once. You’d need to allot an entry, after all, to each artist enumerated in a tandem hit, e.g. you’d be bidden to assign one hit each to the Beatles and Billy Preston for “Don’t Let Me Down” and “Get Back”. Remember them?

Now the route to another need-to-know metric, the total number of weeks an artist’s offerings have informed the top 100, offers a smoother ride, particularly if you simply need the total:

Rows: Performer

Values: Songs (Count)

Each appearance of a song in the data set amounts to a 1, after all, or one week’s visit to the top 100. Sort the outcomes by Highest to Lowest, and I get, in excerpt:


Your cultural weltanschauung will dictate your adjective, i.e., the results are interesting, surprising, confirmatory, or dismaying. I am in addition either embarrassed or proud to say I’ve never heard of Kenny Chesney, Keith Urban, and Brad Paisley; that these titans are country and western singers explains my arrant illiteracy in this character-defining matter.

But the complication we uncovered earlier reappears here. If you’re asking after the missing Elvis Presley in the above screen shot, for example, run the Performer filter for the generic Elvis Presley – again, filter for all instances of his name:


And you’ll see:


That’s 19 years’ worth of time spent in the top 100. Next filter for all mentions of Elton John:



A remarkably comparable durability, but again, we haven’t accredited Presley’s pre-August 1958 incursions into the chart.

And just for the record, here’s some other all-mentions-of-an-artist/top-100 week tenures:

The Beatles: 608

Michael Jackson: 726 (none of which factor in the Jackson Five’s 212 weeks, however)

James Brown: 682

U2: 397

Kelly Clarkson: 542

Diana Ross: 626 (but the Supremes qua Supremes, sans any official allusion to Ross, contribute another 299)

Barry White: 175

The erstwhile Beatles in solo-act capacity:

Paul McCartney: 344 (but Wings brings another 201 weeks to the count)

John Lennon: 161

George Harrison: 161

Ringo Starr: 129

But just don’t ask me what any of it means.

And still another need-to-know follow-on takes our analysis to its interrogative crescendo: Which tracks have enjoyed the longest stays (welcome or not) on the Billboard 100?

That question seems to admit of a pretty straightforward answer:

Rows: SongID

Values: SongID (count, of necessity; the field is text)

(Remember that SongID, and not Song, need be applied to the pivot table. SongID imparts a unique identifier to each song, in order to disambiguate multiple versions of the same song.)

I get, in excerpt:


Remember that SongID concatenates title and artist; and so oblivious am I to all these next big things that I wasn’t sure if the week leader above is entitled Radioactive Imagine by the Dragons, or Radioactive by the Imagine Dragons. I have since learned the latter formulation properly parses song and group; described by Wikipedia as a sleeper hit, Radioactive nevertheless somnambulated across the charts for 87 weeks (a figure Wikipedia corroborates), or about  1 2/3 years. That’s a long snooze; but don’t overlook their Demons, chart-resident for another 61 weeks. In fact, a scan down the list counts 55 songs that persisted somewhere in the top 100 for at least a year.

And I think the only one I know is Unchained Melody, by the Righteous Brothers. Are you amused?

Hacks of Stacks of Wax: Billboard 100 Data, Part 1

2 Jul

You don’t read big data, you analyze it. No one unrolls themselves into their hammock, reaches for their mint julep, and thrills to that page-turner of a 300,000-row data set they’ve been craving to get at all winter. Big data is meant to revel in its bigness, favoring the curious with its possibilities for aggregated, patterned and macro-leveled largesse, and largeness.

But sometimes the revel is in the details. Now and then a big data set comprises a gigantic compound of molecular bits whose very protons might be of sufficient interest to make you put your julep on hold – and I’m thinking about the 59-years of variously memorable hits filling 309,000 rows of the Billboard top 100 workbook, playing its enormous medley here on the site.

As indicated, the Billboard set recollects its chart-toppers all the way back to August, 1958, and if you’re just bursting to know for exactly how many weeks “She Loves You” oooed its way into the listings – and you probably are – or precisely when the epochal “Rapper’s Delight” first hip-hopped onto the rankings and your consciousness (15, and the week of November 10, 1979, respectively; but remember that the Beatles’ own German cover version “Sie Liebt Dich” also checked in for a week at 97 in June, 1964), you’ve assuredly come to the right place.

I don’t know about you, but I think the Billboard data – all 21.6 megabytes of it (i.e., you’ll have to download it yourself) – makes for a cracking good read – but it’s a spreadsheet, after all, and so some intriguing global findings should be in there, too. But as usual, the data need some inspecting before the work gets underway.

Note, for example, that the Peak Position and Weeks on Chart fields installed in columns I and J are, at least in theory, dispensable; one could derive both findings from a pivot tabling of the songs, subjecting Peak Position to a Min in Values, and then applying the song titles themselves to Values, realizing a count that would deliver a Weeks on Chart equivalent. That sparer approach would relieve the data of a slew of redundant entries, e.g. a song’s peak position appears identically for each week in which it appears.

If you’re wondering about the Instance field and what it means, you’re not alone. I originally supposed that it counts the number of times the same chart-bound song was performed by different artists (I use the term loosely), but that conjecture proved a false take. Rather, Instance seems to number a given version’s non-contiguous revisits to the charts. For example, Nicky Jam’s El Amante – a performer and song whose identities draw a pair of blanks in my uncomprehending mind – exhibits six instances; its debut at position 99 in the week of February 18, 2017 was succeeded by its disappearance the following week, only for the tenacious ditty to stage a three-week comeback dating from the week of March 4. Continuing to loll in the high 90s, El Amante submerged once again, before clambering back into 98 on April 4, etc. It last held its place in the rankings until the week of September 2, 2017, concluding its sixth instance – before it fell back into the oblivion it likely deserved.

Note in addition the SongID field, a unique identifier crafted by a concatenation of the entries in Song and Performer. Slightly curious is the retention of the formulas in their cells; their work has been completed, and could be paved over with a Paste > Values routine, an austerity move that serves to reduce the file’s size to 19.5 MB.

And if you’re wondering what purpose a song id might fulfill – that is, what analytical need would spur the assignment of an id to each song – I can think of at least one, one that returns us to an exigency with which I’ve contended before, and not optimally, as it turns out.

If we want to learn how many discrete songs clambered into the top 100 for any particular year we need – again – to do something about the recurring weekly appearances of the same songs, songs we want to count exactly once. I had expressed a similar wish, for example, in my posts on the Supreme Court Voting data, in which I wanted to count unique cases heard by the Court per year. I developed the count by embedding case data into the Rows area, where of course they’re enumerated but one time each. I then moved to analyze that satellite table instead.

But I’ve since learned that the above exertion is unnecessary, thanks to Excel frontliner Chandoo. He inclined my attention to an unassailably more elegant maneuver, that works like this:

But before I demonstrate, recall what I’m aiming to do: I want to pivot table a tabulation of the number of unique songs crashing the charts by year, and as such a prior step need be enacted upon the data before I set the table – I need to release year information from the WeekID field in B. That intention can be effected in several ways, but in the interest of simplicity I’ll scamper to next-available column K, call it Year, and enter in K2:


And copy down the column. That simple device releases the first four characters from each week id, which in every case offers up the year of the song’s chart entry (WeekId is text-formatted, by the way).

When the venerable Create Pivot Table dialog box opens, tick the Add this data to the Data Model box at its lower left (and someone tell the folks in Redmond it should be these data):


That tick activates Excel’s Data Model (which first made itself freely available in the 2013 release), an add-in that enables a number of data-querying enhancements, including the potential for building relational pivot tables. But our interest here is in those unique song titles, and so once you’ve executed the tick and the Data Model loads, advance to the pivot table (notice the slightly modified field list drawn up by the Data Model) and earmark Year for the Rows area. Next show SongID into Values, right-click into Summarize Values by, click More Options… scroll down and…


Wow – Distinct Count; what a concept. Click it, click OK, and I get (in excerpt):


(Note that the 1958 data are partial, encompassing only the last five months of that year. The 2017 listings extend to the end of October.) Subjected to a rudimentary line chart, the slope looks like this:


I’m not sure what sociological conclusions beg our attention, but the peak in song numbers in the 60s is marked, as is the decided slump in the ensuing years.

Put it this way: There’s something happening here/What it is ain’t exactly clear.

“For What It’s Worth”; hit the charts the week of January 28, 1967, and stayed there for 15 weeks. Peak position: 7.

The Verdict on Cook County Court Sentencing Data, Part 2

18 Jun

There’s plenty to learn from the feature-rich Cook County sentencing dataset, and so let the learning commence. We could, by way of first look at the data, examine what the set calls SENTENCE_TYPE, a self-evident header whose entries could be additionally broken out by another field, SENTENCE_DATE:


Columns: SENTENCE_DATE (grouped by year; see the previous post’s caution about the date-grouping challenges peculiar to these data.

Values: SENTENCE_TYPE (% of Column Total, as you turn Grand Totals off for the table’s column, which must necessarily come to 100%.)

I get:


(Bear in mind that the numbers for the 2010 and 2018 are very small, a concomitant of their fraction-of-the-year data representation.) Not surprisingly, Prison – the lengths of its terms unmeasured in this field, at least – accounts for the bulk of sentences, and rates of imprisonment verge towards virtual constancy across the represented years.

Probations, however, describe a bumpier curve, peaking in 2014 but exhibiting markedly lower proportions in both 2010 and 2018 (to date). Those wobbles, doubtless both statistically and criminologically significant, call for a look beyond the numbers we have.

I’ve been unable to turn up a denotation for the Conversion sentence type, but I can state that the zeros, first appearing in 2011, are actual nullities, and not an artifact of the two-decimal roundoff. Slightly more bothersome to this layman, though, was the Jail sentence; I didn’t know if that event discreetly named a particular mode of punition, or merely, and wrongly, reproduces Prison sentences under a second heading. It turns out that the former conjecture is in point, and that jail time is typically imposed for shorter sentences or individuals awaiting trial (see this clarification, for example. For an explication of the Cook County Boot Camp, see this piece).

The natural – and plausible – follow-on would associate sentence types and their distributions with types of crimes, but the presentational challenge proposed by a keying of the sentences to more than 1500 offense titles very much calls for a considered approach, indeed, one that could perhaps essayed with a Slicer populated with the offense titles. And while that tack will “work”, be prepared to scroll a long way until the offense about which you want to learn rises into the Slicer window. But such is the nature of the data.

And in view of that profuseness one could, perhaps, engineer a more practicable take on the matter, for example, by inaugurating a new pivot table, filling the Rows area with say, the top 20 offense types (right-click among Rows and click Filter > Top 10, entering 20 in the resulting dialogue box). Sort the types highest to lowest, line the Columns area with the Sentence Types, drop Offense_Type again into Values, and view them via the % of Column Total lens.

A screen shot here would be unacceptably fractional in view of the table’s width, but try it and you’ll see, for example, that 93.41% of the convictions for unlawful use or possession of a weapon by a felon resulted in a prison sentence, whereas perpetrators guilty of theft – however defined in this judicial context – incurred prison but 36.41% of the time, along with a 45.69% probation rate.

More troubling, however, is the small but measurable number of death sentences that appear to have been imposed on individuals sentenced for crimes not typically deemed capital. For example, .04% of the convictions for the possession of cannabis with intent to deliver/delivery of cannabis have drawn the death penalty, as have .05% of forgery convictions. These legal reprisals don’t ring true, and surely obligate the journalist in a redoubled scrutiny, if only to confirm their accuracy, and/or their propriety.

If you’re looking for other fields to irrigate, sentence length, expressed here as COMMITMENT_TERM, likewise begs for correlation with other fields; but here a new roadblock stalls the journey. Inspect the field and the adjoining COMMITMENT_UNIT column and the obstruction will loom large. Because the units present themselves in a raft of different durations, the correlations can’t proceed until some manner of unit reconciliation is brought to the data.

My early sense about the COMMITMENT_UNIT field told me that the disparities merely juxtaposed years to months; and were that the case, a simple formula could be stamped atop an new field, one in which either years could in effect be multiplied by or months divided by that value, e.g. a six-year sentence could be translated into 72 months.
But in fact, the units are a good deal more numerous and qualitatively varied than that. Turn a filter on the data set and click the arrow for COMMITMENT_UNIT. You’ll see:


While it would be possible to construct an equivalence lookup table for the chronological units enumerated above, e.g., one month rephrased as 720 hours, a sentence delivered in monetary terms – dollars – can’t be subjected to a like treatment. And a sentence of Natural Life – presumably an indefinite, open-ended prison stay – is similarly unavailable for equating. Moreover, I have no idea what the two records declaring sentences of “pounds” – 30 of them for a criminal trespass of a residence, and 2 for driving with a suspended or revoked license, and both pronounced in Cook County District 5 – Bridgeview – can mean. And you may note that 19 sentences comprising 365 days each were issued as well; how these distinguish themselves from one-year terms is unclear to me. Nor do I understand the 1526 sentences consisting of what are described as Term.

On the one hand, of course, the data set can’t be faulted for admitting all these “discrepancies” into its fold; they’re perfectly valid and pertinent records. On the other hand, they cannot, by definition, be forced into comparability with the other entries; they’re oranges to the predominating crop of apples.

The simple way out, of course, would be to sort out and excise the non-chronologicals and proceed, and on a blunt practical level that stratagem might work. But it would work here for the simple empirical reason that those incongruities are few, and as such, would not compromise the greater body of data. But what if these irregulars were formidably populous, and hence unavoidable? What would we do with them?

That is a good and interesting question.

The Verdict on Cook County Court Sentencing Data, Part 1

29 May

You can trace the course of justice in Chicago, including the direction and speed at which it travels, at the fledgling Cook County Government Open Data portal, a site brought to its URL at the behest of Kimberly Foxx, Illinois State’s Attorney for the county in which the city of the big shoulders shrugs. Four of the portal’s holdings –  Initiation, Dispositions, Sentencing, and Intake – chronologize the dispositions of cases processing through the system; I chose Sentencing for my look here.

It’s a big data set for a big city, recalling as it does sentencing records dating back to January 2010 and pulling through December 2017. With 189,000 cases and a field complement stretching to column AJ, don’t even think about calling it up in Google Sheets (the data-supporting capacity there: two million cells), but Excel is agreeable to its 41 megabytes if you are, and it’s available for download from the second above.

And at 41 megs, the minimalist in you will be avid to put your scissors to fields that might be rightly deemed dispensable. Cases in point: the four ID parameters fronting the data set in columns A through D, none of which are likely to advance your reportorial cause (note, by the way the interjection of commas into the large-sized identifiers, an unusual formatting fillip). Deleting the fields and their750,000 or so entries actually slimmed my workbook down to a lithe 29.7 mb, and that’s a good thing.

You may also note the slightly extraneous formatting besetting the INCIDENT_BEGIN_DATE, RECEIVED_DATE, and ARRAIGNMENT_DATE fields, their cells bearing time stamps all reading 0:00. I suspect these superfluities owe their unwanted appearances to the data in the ARREST_DATE field, which do exhibit meaningful times of suspect apprehension. We’ve seen this kind of excess before, but again it’s proper to wonder if any of it matters. If, after all, it’s your intention to re-present the data in pivot table form, for example, you’ll attend to any formatting disconnects there, and not here. If so, a reformatting of the data source may be no less superfluous.

But whatever you decide we can proceed to some analysis, acknowledging at the same time the scatter of blank cells dotting the records. Given the welter of substantive fields in there, quite a few possibilities beckon, and we could start by breaking out types of offenses by year, once you answer the prior question submitting itself, i.e. which of the available date parameters would be properly deployed here? I’d opt for ARREST_DATE, as it affords a kind of read on Chicago’s crime rate at the point of commission – or at least the rate of crimes culminating in arrest, surely a different and smaller-sized metric.

But if you’re thinking about installing the arrest dates into the column area, think twice – because the dates accompanied by their time-stamps are sufficiently granulated that they surpass Excel’s 16,384- column frontier. You’ll thus first have to swing these data into the Rows area, group them by Year, and only then can you back them into Columns, if that’s where you want them stationed.

And that’s what I did, only to be met up with a surprise. First, remember that Excel 2016 automatically decides upon a (collapsible) default date grouping by year, like it or not; and when I corralled the arrest dates into Rows I saw, in excerpt:


Now that ladder of years seems to be fitted with a column of rickety rungs. Remember that the sentence data appear to span the years 2010-2017, and so the aggregates above hint data entry typos, and at least some of them – e.g. the 1900 and 1915 citations – doubtless are.

The additional point, however, is that some of these putative discrepancies might tie themselves to crimes that were in fact brought to the attention of the justice system well in the past, and that took an extended while before they were actually adjudicated. Remember that our data set archives sentences, and some criminal dispositions take quite some time before a sentence is definitively pronounced.

For example, the 12 sentences associated with arrests made in 1991 reference serious crimes – seven murder or homicide charges, one armed robbery, one unlawful use of a weapon charge, one robbery and two thefts. One of the thefts, however, records an incident-began date (a separate field) of November 17, 2013, and thus appears to be erroneous.

But in any event, since our immediate concern is with arrests carried out in the 2010-17 interval I could click anywhere among the dates and proceed to group the data this way:


Note that I’ve modified the Starting at date to exclude the pre-2010 arrests, be they errantly captured or otherwise. Now after I click OK I can drag the years into the Columns area, after filtering out the residual <1/1/2010 or (blank) item.

Now I can drag OFFENSE_TITLE into Rows.

Surprise. With 1268 Offense categories cascading down the area you’ll have your work cut out for you, once you decide what to do next. Do you want to work with the data as they stand, or collapse near-identical types, and vet for misspellings along the way? Good questions – but in the interests of exposition we’ll leave them be.

How about something more immediately workable then, say age at incident? Exchange AGE_AT_INCIDENT for OFFENSE_TITLE, filter out the 2300 blanks, and group the ages by say, 3 years. Reprise AGE_AT_INCIDENT into Values (count). I get:


We see an extremely orderly negative association between age and arrests, with only the 20-22 tranche exceeding its predecessor bracket among the grand totals and only slightly. You’ll also observe that the numbers for 2017 are far smaller than the previous years, a likely function of incomplete data. In addition, track down to the Grand Totals row and behold the very significant ebbing of overall arrest totals from 2013 to 2016. Again, our table records arrest, and not crime totals, but the two likely point the same way – unless one wants to contend that the downturn in the former owes more to policing inefficiencies that any genuine diminution in crime – a not overwhelmingly probable development.

I’d then move to a Show Values As > % of Column Total look to learn how the brackets contribute differentially to arrests:


(The zeroes at the lowest reaches of the table reflect round-offs.)

Among other things, note the considerable, relative pull-back in arrests of suspects in the 17-19 range.

No, I don’t have an explanation at the ready for that, but perhaps you do.

Airbnb Data, Part 2: A Tale of Three Cities

14 May

There’s such a thing as overstaying your welcome, but Airbnb landlords – and in effect that’s what they sometimes appear to be – may be prepared to happily abide your long-term self, or selves, in their abode.

And that heartening show of hospitality may be illegal. They take a dim view of Airbnb’s good neighbor policy in London, for but one example, where the powers-that-be aren’t thrilled about the kind of lucrative serial subletting some Airbnbers perpetrate, straitening as it does the market for folks who’d prefer to hang tight in an actual, lease-driven apartment.

The long and the short of it then, is that it a review of Airbnb property availabilities – defined as the number of days in a year in which a given room remains on offer – could prove instructive, and our data for New York, London, and Paris devotes a field to just that question.

The analysis, then, should proceed pretty straightforwardly, once we do something about the startingly sizeable count of rooms – about 55,000 – that presently aren’t to be had. That is, their value in their dataset’s availability_365 field states 0, indicating that, for now at least, that room has been withheld from the market. An email from Inside Airbnb compiler Murray Cox informed me that zero means the property’s next 365 days in its calendar (presumably its start date is a moveable inception point, which Cox finds in the the scrape_date field in a different dataset) aren’t in play, at least temporarily.
And as such, those zeros – which are, after all, values that would contribute to and very much confound any formula result – have to be barred from the data set. Here I rely on my venerable highest-to-lowest sort of the availability_365 field, relegating the zeros to the bottom of the set; and once put in their place, an interpolated blank row immediately above the first zero will detach them from the usable data, for now (of course they can be recalled if needed via a simple deletion of the blank row).

And all that enables us to proceed here:

Rows: City

Values: availability_365 (Average, formatted to two decimals)

I get:


Real city variation is in force; and recall the linked article above, the one reporting the December 2016 London-approved bill “limiting Airbnb hosts to renting their property for only 90 days”. Looks as if a few thousand room owners in that city haven’t read the news lately.

We could next cross-tab the results by room type, by rolling room_type into Columns:


All the cities trend in the same direction, though not identically – itself a differentiation worth pursuing, perhaps. Availability widens as the rental space constricts, with shared rooms – defined by Airbnb as those in which “Guests sleep in a bedroom or a common area that could be shared with others”, presumably humans, who presumably might or might not be actually residents of the property – freed up for a considerably larger fraction of the year.

And the results make sense – even common sense, perhaps. Entire homes and apartments need be empty, by definition, and if so, where would their owners be expected to go for the duration of the rental?

That’s a good question, one that directs itself to one of flashpoints of the Airbnb controversy. Are its hosts the kinds of proprietors who hoard multiple listings that might otherwise be released to the conventional rental/purchase market?

A few sure-footed steps toward an answer would require us to divide all the rentals in a city by its number of hosts, i.e., an average of properties per host; and that simple division exercise needs to fill its denominator with a unique count of hosts, thus returning us to a problem with which we’ve tangled before. To reiterate it: an owner of multiple properties will naturally appear that many times in the data set, that is, once each for each holding, even as we want him/her here to appear once. In light of that complication I think the neatest way out this time is to conduct a Remove Duplicates maneuver (Data ribbon > Data Tools), and ticking the host_id field, the parameter whose entries contains the duplicates we want to shake out (again, you may want to save these results to a new workbook precisely because you’re shrinking the original data set).

But how do the host ids, once in their respective, solitudinous states facilitate a calculation of the number of properties they own, on average? Here’s how: once we’ve identified each host id singly, we can average the calculated_host_listings_count in column P via a city pivot table breakout. That field, which restates the number of each host’s holdings in each of his/her record entries, is one I would have deemed redundant to the data set’s design. After all, the owner-property count could otherwise be derived when needed, for example, via a pivot tabling of the hosts id, delivering the field to both the Rows and Values areas. But because we’ve removed all host id duplicates, that plotline has to be red-penciled – and that’s where the calculated_host_listings_count comes to salvage the script:

Rows: Country

Values: calculated_host_listing_count (Average, to two decimals)

I get:


We see then, that Airbnb hosts are for the most part single-property marketers, at least for the cities we’ve gathered. For those interested in more detail, we could try this:

Row Labels: calculated_host_listing_count

Columns: City

Values: calculated_host_listing_count (Count, % of Column Total, formatted in percentage terms to two decimals)

I get, in excerpt:


You get the idea, though we see London owners are notably more likely to offer multiple properties.

Speaking of which, croll to the bottom of the table above and you’ll find a 711, signifying the voluminous, apparent holdings of a fellow named Tom in London. But When I returned to our original, entire Airnbnb dataset, including the rooms for which availability was set at 0 days, I discovered but 350 properties strewn about London associated with his name.

Now Tom’s the kind of person Murray Cox wants us to know about; he owns so many properties that he’s lost track of half of them.

Airbnb Data, Part 1: A Tale of Three Cities

27 Apr

Would you rent your apartment to me? I have references from real people, don’t smoke, clean up after myself (is my nose growing?), and probably can be counted on not to trash your living room and adjoining vicinities.

Still don’t want to take my scratch for your pad? See if I care; there are plenty of other flats out there where yours came from.

Too many, in fact, according to Murray Cox, the self-identified “data activist” whose researches into Airbnb’s rental listings opened the door on a warehouse of dodgy practices, in numerous localities, e.g. property owners who market multiple dwellings, a clear and present violation of New York state law. Cox maintains that, among other things, the outsized scale of Airbnb offerings can worrisomely constrict a city’s available rental stock, and has published a city-by-city inventory (brought to my attention by a student) of Airbnb listings that you and I can download in most convenient spreadsheet form (look for the Summary Information and metrics for listings in… link attaching to each city).

It occurred to me that, among other potential takeaways, an intercity comparison of Airbnb activity might advance the journalistic cause a mite. I thus downloaded the data for New York, London, and Paris, all nicely exhibiting the same fields. With the intention of copying and pasting the data to a single workbook I cleared out a new column to the left of A, called it City, and entered and copied down the respective names of the three locations, properly lined up with their data once pasted, culminating in 162,701 rows of data, its 20 megabytes just itching to tell me what Airbandb has been up to.

Of course, the three-city amalgamation means to prime the data for a range of comparisons, but some comparisons don’t avail. I’m thinking in the first instance about the price field in what is now column K. These entries presumably cite daily rental rates, but express themselves in disparate currencies – dollars, pounds, and euros. One supposes an exceedingly determined investigator could mobilize and apply a round of currency equivalences to the records, a prospect that would require a vast compiling of date-specific rate fixes in short, a challenge likely to a real-world, deadline-mindful journo. I’ve thus elected to leave the numbers as they stand, and if that touch of laissez-faire works against the analysis I have no one to blame but myself. The buck stops here – and maybe the euro, too.

In any case, before we get fancy, we can think about this self-effacing pivot table:

Rows: City

Values: City (Count, by definition for a textual field)

I get:


We see that Paris – by far the smallest of the three cities – nevertheless advertises the largest number of Airbnbs. An accounting for that disjuncture would probably justify a deeper look. Might tourist cachet or friendlier legalities spur the Paris margin? I don’t know. But consider that, juxtaposed to Paris’ population of around 2.25 million and its average household size of approximately 2.3 persons, the city’s Airbnb stock could house around 6% of its residents – with the point, of course, that the inventory is apparently being withheld from the permanent-residence rental market.

Other incomparables have their place among the data, too. There’s little comparing to do as such among the three cities’ neighborhoods, and indeed – the neighbourhood group (UK spelling) field for Paris and London is utterly empty (the field for New York comprises the city’s five boroughs).

But of course other workable comparisons are available. What, for example, about average minimum stay requirements by city and type of rental? We could try this:

Rows: City

Columns: room_type

Values: minimum_nights (Average, formatted to two decimals)

I get:


We see that diffident London Airbnbers expect notably briefer stays at their places on average, with those uppity Parisians insisting that you agree to set down your knapsack – and debit card – more than a day-and-a-half longer before they let you in. At the same time, New York’s shared-room minimum is disruptively anomalous.

And for more evidence of cross-cultural heterogeneity – if that’s what it is – flip the values into Count mode and hand them over to the Show Values As > % of Row Total, ratcheting the decimals down to zero and switching the Grand Totals off (because the rows must invariably figure to 100%). I get:


The overwhelming Paris proportion devoted to the Entire home/apt offering is not, I would submit, proof positive of yet one more Gallic quirk, but rather a simple function of the Paris housing stock, in which apartments predominate.
For additional, if glancing, corroboration, try this pivot table:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (tick New York)

Values: neighborhood_group (Count, % of Row Total)

I get:


Recall that New York is the only city among our trio whose neighborhood group field is actually occupied with data – the names of its five boroughs. Note the relative Manhattan tilt towards Entire home/apt, even as the other boroughs, whose housing range features far more private homes, incline towards Private room – that is, presumably one private room among the several bedrooms in a home.

And what of daily price by city, keyed as it doubtless is to room type? It looks something like this:

Rows: City

Columns: room_type

Values: price (Average, to two decimals)

I get:


Again, that imperative qualification – namely, that the prices reflect evaluations per their indigenous currencies – need be kept in mind. As such, the New York tariffs verge much closer to the London figures when the appropriate, albeit variable, pound-to-dollar conversion is applied. With those understandings in place, the Paris Entire home/apt average seems strikingly low – because the Euro consistently exhibits a “weaker” relation to the pound, the former as of today equaling .88 of the latter. Yet at the same time, Paris’ private room charge would appear to be effectively higher.

Now again, because the data are there, we could compare average prices for New York’s boroughs:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (New York)

Values: price (Average)

I get:


No astonishments there, not if you’re a New Yorker. Manhattan expectably heads the rate table, though Staten Island’s second-place Entire home/apt standing may issue a momentary pause-giver, along with its basement-dwelling (pun intended) shared room rate.

That’s $7,300 a month for an entire place in Manhattan. And wait until you see the interest on the mortgage.