spreadsheetjournalism

Tracking the Fields in Track and Field: Olympic Athlete Data, Part 2

Have Olympic athletes gotten taller and heavier? Common sense, the hypothesis builder of first, but not last, resort, submits a yes. After all, competitors in other sports seem to have bulked up and out; so why wouldn’t entrants to the global quadrennial enlarge as well?

The conjecture makes sense, common or otherwise, to me, but fortunately more persuasive corroboration is at hand. The Kaggle Olympic athlete data set features height and weight data, thereby adjudging my sense-making a false start.

Nevertheless, a few important caveats need be pronounced upon the data. First, a notably large portion of the height and weight entries – around 60,000 in both cases, or around 22-23% of all athlete records – are missing, and I haven’t paced my way through the data to learn if those ciphers are systematically distributed (and by the way, about 9,500 age records, or 3.5% report no ages either). Second, one must take some care to think about the like-for-like proviso, in view of the turnover of events appearing in this, but not that, Olympic year. There may be little analytical gain, for example, by according a place to the weight and heights of the combined 3,578 aspirants in the Art Competitions last conducted in 1948. One assumes their vital statistics contributed little to their medal fortunes, though one never knows (and since you asked, their average weight came to 165.64 pounds, with a mean height reaching to about 5’9”); but because the demographics for those pastimes simply aren’t there in the later Games their confounding effects must be considered, and probably discarded.

In addition, because boxing necessarily cordons its contestants by weight classes, its 6,047 contestants should be expected to evince little variation along that parameter across Olympics, and they don’t (they haven’t gotten much taller, either).

With those cautions in mind, investigative prudence might justify a concentrated look at those durable sports expected to win a continual place in the Games. We could then proceed to pivot tables buffered by a Sports Slicer that would more afford a cleaner look at the question, one sport at a time.

If we’re satisfied with that recommendation, any pivot table that would spring therefrom sounds like a rather straightforward proposition, but grouping Olympic years by a chosen interval – say a 12-year tranche that would parse three Olympics at a time imposes a small presentational disquiet.

My initial intent, for example, was to group the years into bins of 12 in the interests of averaging heights and weights for three Olympics at a time (but recall that the eventual rethreading of Winter Games into alternative even-numbered years would step up the number for any 12-year stretch). If, then, we move Year into Rows and frame the 12-year grouping, we’ll see:

oly21

Now that bit of consolidation works, but note the disclosure above of some years in which no Games were held (e.g. 1907), and which in fact don’t even appear in the data set. But the pivot table insists on interpolating those years in order to preserve the grouping’s consecutive character, however misleading to the reader. But remember that Row Labels can be edited for display purposes, that is to say you can rewrite the above spans 1896-1904, 1908-1916, etc.

On the other hand, because the data set counts a manageably grand total of 35 modern Olympics, we may be happy to leave the years as we find them in their singular, ungrouped state. If so, we can begin this way:

Rows: Year

Values: Height

Weight (both computing Average and formatted to two decimals)

Columns: Sex

Slicer: Sport

And turn off Grand Totals.

With all those material variables in play – Year, Sex, Height and Weight – the pivot table may tend toward a certain untidiness and hyper-density. Truncating the Average of… labels to something like Avg Height should sharpen its legibility.

Because I’ve delimited my interest to those mainstream sports likely to have been fielded across most or all Olympics, I can aim my first Slicer tick at Gymnastics, mustering these outcomes in excerpt:

oly22

Note, my fellow Americans, that height and weights are expressed in centimeters and kilograms, respectively. (To restate the former in inches you’d need to multiply each entry by .39701, one formulaic way or another. Poundage is expressed by multiplying a weight by 2.20462 (I had long thought the increment was exactly 2.2)). As for the #DIV/0! messages besetting the table that allude to sliced fields containing no data, these can be air-brushed out of sight via the PivotTable Tools > Analyze > Options > Options > Layout & Format tab and leaving the relevant field below blank:

oly23

Returning to the gymnasts, note the decided lightening of women gymnasts from 1948 through 1980, an average weight reduction, so to speak, from 57 to 41.42 kg, or well over 30 pounds. Indeed, women gymnasts were on the average 5.89 kg, or 13 pounds, lighter in the 1980 Games than in the 1976 competition; but the poundage reverts upward again from 1980.

Explaining the variations requires something more than a top-of-the-head, knee-jerk, or any other such anatomically-driven response. Remember for one thing that many women gymnasts are counted multiple times among the data, and so their weights are submitted to the averages as often. Secondly, a survey of the particular gymnastic events in force for any given Games would be in order as well (the Gymnastic heading spreads over a variety of events), as new or abandoned events could have skewed the totals. In addition, the boycott by 66 nations of the 1980 summer Games may have likewise wrenched the average down.

On the other hand, the weight trajectory for male gymnasts is considerably smoother, tossing but one more complication into the mix.

We could next slice for swimming, another Olympic perennial. I get, in excerpt:

oly24

Here we see both men and women getting taller and heavier, with the paired slopes lifting upwards determinedly. That male average weights have burgeoned 16 pounds between 1972 and 2016 and the male-female weight differential has gradually ascended surely means something, though again some deep thinking about conclusions would be in point.

For a look at a winter sport, let’s slice for speed skating, opened to women in 1960:

oly25

Here the fluctuations are gentler, perhaps remarkably so, along with a small but measurable closing of the gender weight differential. You’ll also note that gymnasts tend to be shorter than both swimmers and speed skaters. But swimmers and skaters need to get to their finish lines sooner.

Comments Leave a Comment
Categories Uncategorized

Tracking the Fields in Track and Field: Olympic Athlete Data, Part 1

26 Aug

In the interests of enhancing my medal prospects, I’m lobbying hard to have freestyle array formula writing approved as an Olympic sport. Preposterous and self-serving, you say? Well you’re surely correct on the second count, but a scan across the events that have, at one time or another, qualified for Olympic standing – all remembered for you on the Olympic athlete data set shelved on the Kaggle web site (sign up for free) – will reestablish my credibility. The set unfurls 270,000 entry records throughout its 21 MB of athletes who’ve mined for the gold and other precious metals and the events in which they hoped to find them, and those events make for a motely collection indeed.

Who’s in training nowadays for the Art Competitions Mixed Literature event, for example? Well, no one, in actuality, as the competitions for the best sport-themed literary submissions were last conducted in 1948, the year when all four of art events categories – art, architecture, music, and literature – were finally edited out of the Olympics. But the Kaggle data set also records all the events of which you’ve heard, too, and its medley of columns provide for a host of interesting findings.

Start with a primordial consideration: the relation of Olympic events to their competitors’ average age and gender. Here a methodological caution need be sounded. Because athletes very often compete in different events and different Olympics, a coarsened look at average ages will of course come to admit the same athletes many times. While it would be possible, on the one hand, to account for athletes uniquely via a Remove Duplicates routine, that recourse would leave the data set with one record per athlete – and only one of his/her ages, a reduction a sight too far. It seems to me, rather, that a plural counting of the athletes – each instance of which nevertheless uniquely permutes each entry per age and event – would work here.

That’s my take anyway, but you’re welcome to download the data and subject them to your own assumptions. In any case, I’d start the analysis with this pivot table:

Rows: Sport

Columns: Sex

Values: Age (average, formatted to two decimals)

(Note that Sport is the superordinate rubric beneath which particular events are then subsumed. Thus the sport Swimming spreads its umbrella above 55 different events, for example.)

I get, in excerpt:

oly1

Yes, Aeronautics was an Olympic sport – once, in 1936, including a gliding event, though apparently only men took part (I don’t know, however, if the sport was gender-specific). There’s lots of sporting exotica in there (Basque Pelot – an assortment of raquet sports – was featured as a demonstration sport in 1924, 1968, and 1992 but had bona fide standing in the 1900 Games), but a close look evinces a very general age parity across genders, if not sports. We do see, and for what it’s worth, that men competitors in sailing and shooting are significantly the older (and the numbers are pretty sizable here, if you retool the averages into counts); and it’s probably noteworthy that the men are nearly two years older than women in the very global Swimming sport, in which the men-women participant numbers stand at 13,345 and 9,850 respectively.

And what about average age by gender by Olympic year, and gender distributions (keep in mind that until 1992 the Winter and Summer games were conducted in the same year)? Substitute Year for Sport and I get in excerpt:

oly2

The strikingly higher averages for the 1932 games need to be explored; Wikipedia points out the economic privations wrought by the Depression pared the 1928 athlete complement by a half four years later. It nevertheless remains to be understood why those who did make their way to Los Angeles, where the games were contested, were the older ones. Perhaps they had to pay their own way, and could better afford the trip.

An additional curiosity is the age trough bottoming for women in the 1972 games (again, the numbers above reflect both Games that year). The screen shot clips these data, but in fact their average age of 25.57 for the 2016 (Summer) games pushes nearly five years higher than that for the Games 44 years earlier. Explanations, anyone? And you’ll note the far flatter arc for mens’ ages across the same span.

In connection with the above, you can also drill down the numbers by Season, the heading beneath which the Summer and Winter games are distinguished. Slide Season under Year and you’ll see, in part:

oly3

Again, a close look is called for here. The pronounced Summer/Winter women’s disparity in the selfsame 1932 competitions may be reconcilable in part by the grand total of 22 female participants in the latter Games, a figure not likely to gladden a statistician’s heart. Yet the impressive men’s Summer margin for that year of more than eight years is founded upon a more workable base of 2,622 and 330 athletes, the latter count compiled for the Winter Games. As for the watershed 1972, the women’s Summer-Winter averages come to 20.53 and 21.79. And if you’re interested in the overall, undifferentiated aggregate gender-age averages, they look like this: Women-23.73, Men-25.56. Of course, those computations have controlled for nothing (e.g. event category), so to speak, but probably mean something just the same. And the total number of entries by gender (remembering that these gather many instances of the same athlete): Women -74,522, Men-196,594.

Nevertheless, those numbers should and could be associated with the years of the competition. The obvious intimation here is that womens’ rates of Olympic involvement could have been expected to rise. Thus, we could proceed:

Row: Year (grouped in tranches of five years)

Columns: Sex

Values: Sex (% of Row Total; turn off Grand Totals, which must invariably yield 100%).

I get:

oly4

Not particularly shocking, but a detailed confirmation of the “obvious” can’t hurt just the same. The proportion of female Olympic athletes has about doubled in the past 40 years, a development heavily abetted by the expansion of events open to women; committed to primitive chart mode, the gender curves look like this:

oly5

And now I need to get back to that letter I’m sending to Nike, the one requesting sponsorship for my array-formula training. Look – it’s either that or crowdfunding.

Comments 2 Comments
Categories Uncategorized

Data Stories: NY City Property Sales

14 Aug

You’re wondering what to do with that discretionary $2,000,000 you have laying around – you know, that sheaf of bills that’s been angling your pillow up to untenable heights. In the interests of a good night’s sleep, then, how about dropping those two big ones on some New York real estate?

It sounds like a prudent bet; even the president of the United States has been known to commit some rocks to the idea, after all, and if that example doesn’t inspire you you may find yourself on the wrong end of an all-caps tweet from you-know-who. So now that you’re properly inspired and keen to know what you can get for your seven digits, you’ll want to consult the data set of the New York’s property sales for the year Feb 1, 2017 through January 31, 2018 – 60295 of them, or 165 a day – brought to you by the folks at Enigma public data, who’ve joined the sales records from a pair of City agencies. You can download the set over there, once you’ve signed into their site (for free).

Shift the default Save As format from CSV over to standard Excel and you’ll wind up with a 13 MB workbook – not a prohibitively weighty file, but with 62 fields that pull all the way into column BJ you may find yourself grabbing onto the scroll bar with prohibitive pique. You may thus want to abolish fields not likely to serve your analysis, starting with borough_code in Column A; you’d doubtless prefer to work instead with the five actual borough names filed in the curiously named borough_code_definition field in B. Other candidates for excision, at least in my view and at least for starters: community_district_definition in column Z, school_district_defintion, and city_council_district_definition. All three in effect reiterate the New York neighborhood names promulgated in the neighborhood field in what was, at least at the outset, housed in column B. Those curtailments kick back a megabyte to my hard drive, and every million counts, after all.

We can next move on to some standard, obvious need-to-knows, e.g, the number of properties sold in each of the city’s boroughs, understanding at the same time that obvious questions don’t always yield obvious answers. We could try this:

Rows: borough_code_definition

Values: borough_code_definition

I get:

nyprop1

You may be surprised by the smallish sales total for Manhattan, as I was. But the boroughs feature different kinds of properties in different proportions. Drag borough_code_definition into Columns and substitute building_class_category_code_definition in Rows, and you’ll get the picture, which looks something like this in excerpt:

nyprop2

Observe the One Family Dwellings entries, and their deeply understated Manhattan representation. While these standard residential homes contribute few jags to the island’s storied skyline they loom rather larger among the byways of New York’s more suburban boroughs.

But what about property values? We could plausibly guess that Manhattan sales will lift themselves atop the listings, even as a thorough appraisal would control for building class, but for a preliminary assay:

Rows: borough_code_definition

Values: sale_price (average, formatted to two decimals with the comma)

That yields:

nyprop3

Our guess is ratified, and unsurprisingly. Now augment the Values area with these two fields, both formatted as above: gross_square_feet and floors_number_of:

nyprop4

Manhattan properties are clearly larger – and higher (note, however, the blanks in the floor field). A 19-floor apartment can have a penthouse; the top one in a three-floored one is called the third floor.

But it next occurred to me that those averages may have absorbed multiple apartment purchases in the same properties, a possibility that could skew the numbers – provided one is concerned to isolate unique parcels.

With that thought in mind I ran a Remove Duplicates at the data, identifying address and zip code as sources of potential redundancy (zip code, in the event properties in two different boroughs had the same address, a possibility). 12251 records were thus uprooted, after which I reran the above pivot table. This time I got:

nyprop5

I was startled by the Manhattan decrement, so much so that I initiated the table twice for verification. Sort the floor numbers largest to smallest in the original data set, however, and you’ll indeed discover repeated sales in the same, very tall properties. Note as well the heightened average sale price for the borough, a figure that needs to be considered carefully; after all, if a single property enjoyed multiple sales at conceivably various price points, only one of them remains here.

And since the data span precisely twelve months, a look at sales by month might prove instructive (I’m back in the original, full dataset here). But here, again, we’ve run across – or afoul – of an Engima protocol, the decision to portray dates, the ones stored in the sale_date field, in textual terms, e.g.

2017-02-01T00:00:00

In fact, because all the dates possess equal character length, a formula such as =VALUE(LEFT(W2,10)) could be put to the task, refashioning the entries into authentic values that could then be formatted in date terms. But a slightly swifter if less elegant workaround might be the thing to do here: since each entry in sale_date bears the suffix T00:00:00 (for whatever reason), selecting the sale_date column and targeting the data with a Replace All Find and Replace routine, e.g.

nyprop6

And that restores quantitative standing to the field’s labels. Subtract T00:00:00 from 2017-02-01T00:00:00, and you’re left with 2017-02-01, a usable date. And it seems to work all 60,295 times.

And once that deed is done, try this:

Rows: sale_date (grouped by Month)

Values: sale_date (count).

sale_date (again, count > % of Column Total)

(Note that January data emanate from 2018.)

I get:

nyprop7

Note the conspicuous shortfall of sales in January, which again, is the most recent of the months. Apart from that discrepancy I’m not sure if a pattern could be properly declared, though the tycoons among you may have something to say on that count.

Now to take the analysis slightly off-piste, what about the average antiquity of the properties sold, that is, years of construction? The information is there, courtesy of the year_built field. Here’s one look at the question:

Rows: borough_code_definition

Values: year_built (average, to two decimals, sans comma)

I get:

nyprop8

Note the relative recency of the Manhattan structures, befitting the borough’s protean impulses. Yet the collective provenance of Staten Island builds are far newer, a likely signifier of its relatively late-in-coming appeal as a residential habitat. By extension, the notably older properties of the Bronx appear to reflect that area’s comparative loss of appeal – but being New York, watch that space.

But bear in mind that the year_built data calls for a measure of review. Sort the field Lowest to Highest, and the commercial garages on 7^th Avenue in Manhattan’s Chelsea district rise to the top. If in fact the facilities were completed in the reported year 1111 – about 800 years before the advent of the car – we need to commend the owners for their foresight.

Also clearly wrong is the attribution for the property in sort position 2 – the two-story home hammered together at 509 Seguine Avenue in Staten Island. Its 1680 time stamp should qualify it for landmark status, but a couple of web sites let us know that the building in fact became available in 1920.

In addition, some pause should be given to the 28 sales of properties dated 1800, most of these ascribed to a batch of condos on offer at 365 Bridge Street in Brooklyn Heights – an elevator building. In fact, it seems it was built in 1929.

And I thought the elevators were autographed by Mr. Otis himself.

Comments Leave a Comment
Categories Uncategorized

World Cup Data: Goal Seeking- and Finding

31 Jul

Like American presidential elections, the World Cup doesn’t seem to end; the four-year interregnum separating those two events seems ever more ostensible; and because some prognosticators have instated Brazil as winners of the 2022 Cup it may already be time to wonder if the tournament should be held at all.

But my latest information is that it’s all systems go, and anyway, Qatar is lovely in November; so in the interests of limning some helpful deep backgrounding of the picture, you may want to download and kick around the data set of all World Cup match outcomes extending through the 2014 go-round, niched here in Kaggle’s repository of holdings.

The set then records all 836 matches contested from the Cup’s inception in 1930 through the immediately previous competition, in relatively self-evident fields, with the exceptions of the Round and Match IDs in columns Q and R. Kaggle maintains that the identifiers are unique, but the Round IDs exhibit a qualified uniqueness, i.e., they appear to signify a certain stage in the tournament (e.g. semi-final) that by definition would have been reached by multiple teams bearing the same id. And the match ids display curiously variable lengths, suggesting a mid-data shift in their coding protocol. The 2014 matches, for example, sport nine-digit identifiers; in 1998 their lengths have shrunk to four characters.

More troublesome is the small but measurable number of redundant game records, signaled by shared match ids. A Remove Duplicates routine earmarking MatchID as the offending field discovered 16 doubled records, which were promptly shown a red card.

Once you’ve stretched all the columns via the requisite auto fit, you can begin to think about what’s interesting in here. What, for example, about putative home field advantage? That vaunted residential edge is something of a legal fiction here; the first record, for example, names France as the home team and Mexico as the visitors, in a 1930 game set in host country Uruguay. But that only nominal imputation spices the question; might even a desultory home team standing impact game outcomes?

Let’s see. Any formula presuming to calculate win percentages needs to reckon with a common soccer/football eventuality – the fact that many games culminate in a draw. As such, we can take over next-available column U, title it Home Win Pct., and enter in U2:

=IF(G2>H2,1,IF(G2=H2,0.5,0))

That simple expression means to ask: if the goal entry in G exceeds the figure in H – that is, if the home team wins, then enter 1 the appropriate U cell. If, however, the values in G and H are identical – signifying a draw – then assign .5 to the cell, the standard evaluation of an outcome in which each team in effect arrogates half a win. Once we copy the formula down U, we can simply add all the numbers and divide the total by 836, the number of records in the data set (remember we deleted 16 of them). The result: a home-team win percentage of 68.42, a disproportion that piques the question as to exactly how home teams are nominated.

For example: in the 1930 debut Cup, Argentina received home-team standing for four of its five matches, its sole “away” status assigned to its role in the final, which it lost to the authentic home team, Uruguay. Mexico, on the other hand, competed under the away rubric for all three of its games that year. And the home team in 1930 – however denoted – won all 18 matches during the tournament.

Explanations notwithstanding – though they do need to be pursued – we can calculate Cup-by-Cup home-team winning percentages via a rather neat deployment of the AVERAGEIFS function.

First, enter a bare section of the spreadsheet and list the years in which the Cup was held, starting with 1930 and coming to a stop at 2014 (I’m commencing in Y3). Once you type 1934, of course, you can drag the remaining years down their column by autofilling their four-year interval, remembering, however, that the Cup was suspended in 1942 and 1946. Then name the Year field in A yr, the winner field in U win, and enter in Y3:

=AVERAGEIFS(win,yr,Y3)

And copy down the Y column.

How does simply averaging the win data – which after all, comprise either a 1, a .5, or a 0 – forward our objective? Contemplate this example: a team winning two games and losing one receives 1, 1, and 0 points for its exertions. Average the three values and the resulting .6667 returns the winning percentage for two wins and one loss.

If we’re happy with that understanding and then proceed to format the results in percentage terms, I get:

Cup1

It is clear that somewhere, perhaps in the 70s, the idea of a home team underwent a rethink; I’m not sure what drove the apparent definitional overhaul, but it seems to have been put into place (for a possible partial accounting see this discussion). We even see an away-team edge gained in the 2010 Cup. I’m happy to entertain surmises about these disparities.

In any case, what about goals – e.g., have their outputs ebbed or surged across the Cups? If we want to figure a winning-to-losing team metric, say the average winning and losing goal total – or really, the average score – by game by Cup, we’ll have to improvise, because those data aren’t expressed in existing fields. A couple of simple formulas should be able to answer our question, however. I’ve moved into column V, called it Win Goals, and jotted in V2:

=IF(G2>H2,G2,H2)

That expression simply declares that if the goal total in G exceeds the one in the corresponding H cell, then return the value in G; otherwise report the number in H. If a game was drawn the logical test will not have been met, of course, but no matter; since in such a case the G and H figures are identical it matters not which one the formula returns.

I next head into to column W, label it Lose Goals, and write what is in effect the flip side of the above formula in W2:

=IF(G2<H2,G2,H2)

Both formulas are copied down their respective columns, of course, and conduce toward this pivot table:

Rows: Year

Values: Win Goals (average, formatted to two decimals)

Lose Goals (same treatment as above)

I get:

Cup2

The marked downturn in goal scoring is associated with the recency of the Cups; indeed, the overall winning-game average of 2.18 goals was last attained in the 1970 tournament, and the average victory margin of three goals in the 1954 contests exceeds the average per-game combined goal total for the last 14 Cups. Average winning margin for all games: 1.51 goals.

And let’s see VAR verify that .51 goal.

Comments 1 Comment
Categories Uncategorized

Hacks of Stacks of Wax: Billboard Top 100 Data, Part 2

16 Jul

The hits keep coming on the Billboard 100 dataset, and its mighty chorus of voces populi (it’s the plural; I checked) sounds an arpeggio of questions our spreadsheet is prepared to answer. Topping the survey, perhaps, is one that is both obvious and most compelling: who’s the most prodigious hit maker? The answer, again, should emerge from the trenchant Data Model Distinct Count query we described last week. It informs a pivot table that should look something like this:

Rows: Performer

Values: Song (Distinct Count)

Sort the results Highest to Lowest.

The listings, at least to this way-behind-the-curve listener, were cause for surprise:

top1001

Pulling away from the pack, and by a couple of orders of magnitude, is the vast vocal catalogue of ditties crooning your way from the Glee television show, its cover versions of other person’s hits splattering all over the charts, but with a curious aggregate brevity. Its 183 unique hits resounded through the rankings for a total of but 223 weeks, if I’ve gotten my filter right; not one-hit wonders, then, but one-week.

But those counts call for a measure of refinements. In addition to the generic Glee Cast appellation, a filtered scan of the data for the artists bearing the name Glee somewhere in their handle reports:

top1002

Filter-confining our pivot table to that expanded Glee complement, I get

top1003

Apart from the fact that I haven’t heard of half of the above collaborators, we’ve boosted the Glee count to 206 unique tracks that some time, somehow, succeeded in booking their place in the top 100.

And of course, the multi-name problem is no idiosyncrasy of the Glee phenomenon. You’ll note a Mr. Presley, whose 53 chart visits essayed in conjunction with his antiquarian colleagues the Jordanaires combine with his 49 solo efforts (we’re apparently not counting his backup singers here). That’s 102 appearances for the troubadour from Tupelo, but we’re not finished. Filter for Elvis Presley, and

top1004

I’m all shook up. (And like you, I have no idea who the Carol Lombard Trio/Quartet was. The movie star was spelled Carole, but so is one of the listings up there.) And by the way, remember that the Billboard 100 data tracks back to August, 1958; but Elvis’ debut hit, “Heartbreak Hotel”, bears a time stamp of February 1956, and so won’t be found here (though four renditions of a song with the same name by others will).

Aim a like filter at the Beatles – that is, the words Beatles, and –

top1--5

Or take James Brown. Soul brother number 1 has 47 entries per his name in stand-alone mode, but filter for all instances of the original disco man and we see:

top1006

You’ll appreciate the problem; a proper census of each and every artist’s top 100 total would appear to require a filter of the sort we’ve applied above, a necessity that, if plied, can’t be directly pivot tabled, in part because a great many songs would need to be counted more than once. You’d need to allot an entry, after all, to each artist enumerated in a tandem hit, e.g. you’d be bidden to assign one hit each to the Beatles and Billy Preston for “Don’t Let Me Down” and “Get Back”. Remember them?

Now the route to another need-to-know metric, the total number of weeks an artist’s offerings have informed the top 100, offers a smoother ride, particularly if you simply need the total:

Rows: Performer

Values: Songs (Count)

Each appearance of a song in the data set amounts to a 1, after all, or one week’s visit to the top 100. Sort the outcomes by Highest to Lowest, and I get, in excerpt:

top1007

Your cultural weltanschauung will dictate your adjective, i.e., the results are interesting, surprising, confirmatory, or dismaying. I am in addition either embarrassed or proud to say I’ve never heard of Kenny Chesney, Keith Urban, and Brad Paisley; that these titans are country and western singers explains my arrant illiteracy in this character-defining matter.

But the complication we uncovered earlier reappears here. If you’re asking after the missing Elvis Presley in the above screen shot, for example, run the Performer filter for the generic Elvis Presley – again, filter for all instances of his name:

top1008

And you’ll see:

top1009

That’s 19 years’ worth of time spent in the top 100. Next filter for all mentions of Elton John:

top10010

A remarkably comparable durability, but again, we haven’t accredited Presley’s pre-August 1958 incursions into the chart.

And just for the record, here’s some other all-mentions-of-an-artist/top-100 week tenures:

The Beatles: 608

Michael Jackson: 726 (none of which factor in the Jackson Five’s 212 weeks, however)

James Brown: 682

U2: 397

Kelly Clarkson: 542

Diana Ross: 626 (but the Supremes qua Supremes, sans any official allusion to Ross, contribute another 299)

Barry White: 175

The erstwhile Beatles in solo-act capacity:

Paul McCartney: 344 (but Wings brings another 201 weeks to the count)

John Lennon: 161

George Harrison: 161

Ringo Starr: 129

But just don’t ask me what any of it means.

And still another need-to-know follow-on takes our analysis to its interrogative crescendo: Which tracks have enjoyed the longest stays (welcome or not) on the Billboard 100?

That question seems to admit of a pretty straightforward answer:

Rows: SongID

Values: SongID (count, of necessity; the field is text)

(Remember that SongID, and not Song, need be applied to the pivot table. SongID imparts a unique identifier to each song, in order to disambiguate multiple versions of the same song.)

I get, in excerpt:

top10011

Remember that SongID concatenates title and artist; and so oblivious am I to all these next big things that I wasn’t sure if the week leader above is entitled Radioactive Imagine by the Dragons, or Radioactive by the Imagine Dragons. I have since learned the latter formulation properly parses song and group; described by Wikipedia as a sleeper hit, Radioactive nevertheless somnambulated across the charts for 87 weeks (a figure Wikipedia corroborates), or about 1 2/3 years. That’s a long snooze; but don’t overlook their Demons, chart-resident for another 61 weeks. In fact, a scan down the list counts 55 songs that persisted somewhere in the top 100 for at least a year.

And I think the only one I know is Unchained Melody, by the Righteous Brothers. Are you amused?

Comments Leave a Comment
Categories Uncategorized

Hacks of Stacks of Wax: Billboard 100 Data, Part 1

2 Jul

You don’t read big data, you analyze it. No one unrolls themselves into their hammock, reaches for their mint julep, and thrills to that page-turner of a 300,000-row data set they’ve been craving to get at all winter. Big data is meant to revel in its bigness, favoring the curious with its possibilities for aggregated, patterned and macro-leveled largesse, and largeness.

But sometimes the revel is in the details. Now and then a big data set comprises a gigantic compound of molecular bits whose very protons might be of sufficient interest to make you put your julep on hold – and I’m thinking about the 59-years of variously memorable hits filling 309,000 rows of the Billboard top 100 workbook, playing its enormous medley here on the data.world site.

As indicated, the Billboard set recollects its chart-toppers all the way back to August, 1958, and if you’re just bursting to know for exactly how many weeks “She Loves You” oooed its way into the listings – and you probably are – or precisely when the epochal “Rapper’s Delight” first hip-hopped onto the rankings and your consciousness (15, and the week of November 10, 1979, respectively; but remember that the Beatles’ own German cover version “Sie Liebt Dich” also checked in for a week at 97 in June, 1964), you’ve assuredly come to the right place.

I don’t know about you, but I think the Billboard data – all 21.6 megabytes of it (i.e., you’ll have to download it yourself) – makes for a cracking good read – but it’s a spreadsheet, after all, and so some intriguing global findings should be in there, too. But as usual, the data need some inspecting before the work gets underway.

Note, for example, that the Peak Position and Weeks on Chart fields installed in columns I and J are, at least in theory, dispensable; one could derive both findings from a pivot tabling of the songs, subjecting Peak Position to a Min in Values, and then applying the song titles themselves to Values, realizing a count that would deliver a Weeks on Chart equivalent. That sparer approach would relieve the data of a slew of redundant entries, e.g. a song’s peak position appears identically for each week in which it appears.

If you’re wondering about the Instance field and what it means, you’re not alone. I originally supposed that it counts the number of times the same chart-bound song was performed by different artists (I use the term loosely), but that conjecture proved a false take. Rather, Instance seems to number a given version’s non-contiguous revisits to the charts. For example, Nicky Jam’s El Amante – a performer and song whose identities draw a pair of blanks in my uncomprehending mind – exhibits six instances; its debut at position 99 in the week of February 18, 2017 was succeeded by its disappearance the following week, only for the tenacious ditty to stage a three-week comeback dating from the week of March 4. Continuing to loll in the high 90s, El Amante submerged once again, before clambering back into 98 on April 4, etc. It last held its place in the rankings until the week of September 2, 2017, concluding its sixth instance – before it fell back into the oblivion it likely deserved.

Note in addition the SongID field, a unique identifier crafted by a concatenation of the entries in Song and Performer. Slightly curious is the retention of the formulas in their cells; their work has been completed, and could be paved over with a Paste > Values routine, an austerity move that serves to reduce the file’s size to 19.5 MB.

And if you’re wondering what purpose a song id might fulfill – that is, what analytical need would spur the assignment of an id to each song – I can think of at least one, one that returns us to an exigency with which I’ve contended before, and not optimally, as it turns out.

If we want to learn how many discrete songs clambered into the top 100 for any particular year we need – again – to do something about the recurring weekly appearances of the same songs, songs we want to count exactly once. I had expressed a similar wish, for example, in my posts on the Supreme Court Voting data, in which I wanted to count unique cases heard by the Court per year. I developed the count by embedding case data into the Rows area, where of course they’re enumerated but one time each. I then moved to analyze that satellite table instead.

But I’ve since learned that the above exertion is unnecessary, thanks to Excel frontliner Chandoo. He inclined my attention to an unassailably more elegant maneuver, that works like this:

But before I demonstrate, recall what I’m aiming to do: I want to pivot table a tabulation of the number of unique songs crashing the charts by year, and as such a prior step need be enacted upon the data before I set the table – I need to release year information from the WeekID field in B. That intention can be effected in several ways, but in the interest of simplicity I’ll scamper to next-available column K, call it Year, and enter in K2:

=LEFT(B2,4)

And copy down the column. That simple device releases the first four characters from each week id, which in every case offers up the year of the song’s chart entry (WeekId is text-formatted, by the way).

When the venerable Create Pivot Table dialog box opens, tick the Add this data to the Data Model box at its lower left (and someone tell the folks in Redmond it should be these data):

bill1

That tick activates Excel’s Data Model (which first made itself freely available in the 2013 release), an add-in that enables a number of data-querying enhancements, including the potential for building relational pivot tables. But our interest here is in those unique song titles, and so once you’ve executed the tick and the Data Model loads, advance to the pivot table (notice the slightly modified field list drawn up by the Data Model) and earmark Year for the Rows area. Next show SongID into Values, right-click into Summarize Values by, click More Options… scroll down and…

bill2

Wow – Distinct Count; what a concept. Click it, click OK, and I get (in excerpt):

bill3

(Note that the 1958 data are partial, encompassing only the last five months of that year. The 2017 listings extend to the end of October.) Subjected to a rudimentary line chart, the slope looks like this:

bill4

I’m not sure what sociological conclusions beg our attention, but the peak in song numbers in the 60s is marked, as is the decided slump in the ensuing years.

Put it this way: There’s something happening here/What it is ain’t exactly clear.

“For What It’s Worth”; hit the charts the week of January 28, 1967, and stayed there for 15 weeks. Peak position: 7.

Comments Leave a Comment
Categories Uncategorized

The Verdict on Cook County Court Sentencing Data, Part 2

18 Jun

There’s plenty to learn from the feature-rich Cook County sentencing dataset, and so let the learning commence. We could, by way of first look at the data, examine what the set calls SENTENCE_TYPE, a self-evident header whose entries could be additionally broken out by another field, SENTENCE_DATE:

Rows: SENTENCE_TYPE

Columns: SENTENCE_DATE (grouped by year; see the previous post’s caution about the date-grouping challenges peculiar to these data.

Values: SENTENCE_TYPE (% of Column Total, as you turn Grand Totals off for the table’s column, which must necessarily come to 100%.)

I get:

cook21

(Bear in mind that the numbers for the 2010 and 2018 are very small, a concomitant of their fraction-of-the-year data representation.) Not surprisingly, Prison – the lengths of its terms unmeasured in this field, at least – accounts for the bulk of sentences, and rates of imprisonment verge towards virtual constancy across the represented years.

Probations, however, describe a bumpier curve, peaking in 2014 but exhibiting markedly lower proportions in both 2010 and 2018 (to date). Those wobbles, doubtless both statistically and criminologically significant, call for a look beyond the numbers we have.

I’ve been unable to turn up a denotation for the Conversion sentence type, but I can state that the zeros, first appearing in 2011, are actual nullities, and not an artifact of the two-decimal roundoff. Slightly more bothersome to this layman, though, was the Jail sentence; I didn’t know if that event discreetly named a particular mode of punition, or merely, and wrongly, reproduces Prison sentences under a second heading. It turns out that the former conjecture is in point, and that jail time is typically imposed for shorter sentences or individuals awaiting trial (see this clarification, for example. For an explication of the Cook County Boot Camp, see this piece).

The natural – and plausible – follow-on would associate sentence types and their distributions with types of crimes, but the presentational challenge proposed by a keying of the sentences to more than 1500 offense titles very much calls for a considered approach, indeed, one that could perhaps essayed with a Slicer populated with the offense titles. And while that tack will “work”, be prepared to scroll a long way until the offense about which you want to learn rises into the Slicer window. But such is the nature of the data.

And in view of that profuseness one could, perhaps, engineer a more practicable take on the matter, for example, by inaugurating a new pivot table, filling the Rows area with say, the top 20 offense types (right-click among Rows and click Filter > Top 10, entering 20 in the resulting dialogue box). Sort the types highest to lowest, line the Columns area with the Sentence Types, drop Offense_Type again into Values, and view them via the % of Column Total lens.

A screen shot here would be unacceptably fractional in view of the table’s width, but try it and you’ll see, for example, that 93.41% of the convictions for unlawful use or possession of a weapon by a felon resulted in a prison sentence, whereas perpetrators guilty of theft – however defined in this judicial context – incurred prison but 36.41% of the time, along with a 45.69% probation rate.

More troubling, however, is the small but measurable number of death sentences that appear to have been imposed on individuals sentenced for crimes not typically deemed capital. For example, .04% of the convictions for the possession of cannabis with intent to deliver/delivery of cannabis have drawn the death penalty, as have .05% of forgery convictions. These legal reprisals don’t ring true, and surely obligate the journalist in a redoubled scrutiny, if only to confirm their accuracy, and/or their propriety.

If you’re looking for other fields to irrigate, sentence length, expressed here as COMMITMENT_TERM, likewise begs for correlation with other fields; but here a new roadblock stalls the journey. Inspect the field and the adjoining COMMITMENT_UNIT column and the obstruction will loom large. Because the units present themselves in a raft of different durations, the correlations can’t proceed until some manner of unit reconciliation is brought to the data.

My early sense about the COMMITMENT_UNIT field told me that the disparities merely juxtaposed years to months; and were that the case, a simple formula could be stamped atop an new field, one in which either years could in effect be multiplied by or months divided by that value, e.g. a six-year sentence could be translated into 72 months.
But in fact, the units are a good deal more numerous and qualitatively varied than that. Turn a filter on the data set and click the arrow for COMMITMENT_UNIT. You’ll see:

cook22

While it would be possible to construct an equivalence lookup table for the chronological units enumerated above, e.g., one month rephrased as 720 hours, a sentence delivered in monetary terms – dollars – can’t be subjected to a like treatment. And a sentence of Natural Life – presumably an indefinite, open-ended prison stay – is similarly unavailable for equating. Moreover, I have no idea what the two records declaring sentences of “pounds” – 30 of them for a criminal trespass of a residence, and 2 for driving with a suspended or revoked license, and both pronounced in Cook County District 5 – Bridgeview – can mean. And you may note that 19 sentences comprising 365 days each were issued as well; how these distinguish themselves from one-year terms is unclear to me. Nor do I understand the 1526 sentences consisting of what are described as Term.

On the one hand, of course, the data set can’t be faulted for admitting all these “discrepancies” into its fold; they’re perfectly valid and pertinent records. On the other hand, they cannot, by definition, be forced into comparability with the other entries; they’re oranges to the predominating crop of apples.

The simple way out, of course, would be to sort out and excise the non-chronologicals and proceed, and on a blunt practical level that stratagem might work. But it would work here for the simple empirical reason that those incongruities are few, and as such, would not compromise the greater body of data. But what if these irregulars were formidably populous, and hence unavoidable? What would we do with them?

That is a good and interesting question.

Comments 1 Comment
Categories Uncategorized

The Verdict on Cook County Court Sentencing Data, Part 1

29 May

You can trace the course of justice in Chicago, including the direction and speed at which it travels, at the fledgling Cook County Government Open Data portal, a site brought to its URL at the behest of Kimberly Foxx, Illinois State’s Attorney for the county in which the city of the big shoulders shrugs. Four of the portal’s holdings – Initiation, Dispositions, Sentencing, and Intake – chronologize the dispositions of cases processing through the system; I chose Sentencing for my look here.

It’s a big data set for a big city, recalling as it does sentencing records dating back to January 2010 and pulling through December 2017. With 189,000 cases and a field complement stretching to column AJ, don’t even think about calling it up in Google Sheets (the data-supporting capacity there: two million cells), but Excel is agreeable to its 41 megabytes if you are, and it’s available for download from the second above.

And at 41 megs, the minimalist in you will be avid to put your scissors to fields that might be rightly deemed dispensable. Cases in point: the four ID parameters fronting the data set in columns A through D, none of which are likely to advance your reportorial cause (note, by the way the interjection of commas into the large-sized identifiers, an unusual formatting fillip). Deleting the fields and their750,000 or so entries actually slimmed my workbook down to a lithe 29.7 mb, and that’s a good thing.

You may also note the slightly extraneous formatting besetting the INCIDENT_BEGIN_DATE, RECEIVED_DATE, and ARRAIGNMENT_DATE fields, their cells bearing time stamps all reading 0:00. I suspect these superfluities owe their unwanted appearances to the data in the ARREST_DATE field, which do exhibit meaningful times of suspect apprehension. We’ve seen this kind of excess before, but again it’s proper to wonder if any of it matters. If, after all, it’s your intention to re-present the data in pivot table form, for example, you’ll attend to any formatting disconnects there, and not here. If so, a reformatting of the data source may be no less superfluous.

But whatever you decide we can proceed to some analysis, acknowledging at the same time the scatter of blank cells dotting the records. Given the welter of substantive fields in there, quite a few possibilities beckon, and we could start by breaking out types of offenses by year, once you answer the prior question submitting itself, i.e. which of the available date parameters would be properly deployed here? I’d opt for ARREST_DATE, as it affords a kind of read on Chicago’s crime rate at the point of commission – or at least the rate of crimes culminating in arrest, surely a different and smaller-sized metric.

But if you’re thinking about installing the arrest dates into the column area, think twice – because the dates accompanied by their time-stamps are sufficiently granulated that they surpass Excel’s 16,384- column frontier. You’ll thus first have to swing these data into the Rows area, group them by Year, and only then can you back them into Columns, if that’s where you want them stationed.

And that’s what I did, only to be met up with a surprise. First, remember that Excel 2016 automatically decides upon a (collapsible) default date grouping by year, like it or not; and when I corralled the arrest dates into Rows I saw, in excerpt:

cook1

Now that ladder of years seems to be fitted with a column of rickety rungs. Remember that the sentence data appear to span the years 2010-2017, and so the aggregates above hint data entry typos, and at least some of them – e.g. the 1900 and 1915 citations – doubtless are.

The additional point, however, is that some of these putative discrepancies might tie themselves to crimes that were in fact brought to the attention of the justice system well in the past, and that took an extended while before they were actually adjudicated. Remember that our data set archives sentences, and some criminal dispositions take quite some time before a sentence is definitively pronounced.

For example, the 12 sentences associated with arrests made in 1991 reference serious crimes – seven murder or homicide charges, one armed robbery, one unlawful use of a weapon charge, one robbery and two thefts. One of the thefts, however, records an incident-began date (a separate field) of November 17, 2013, and thus appears to be erroneous.

But in any event, since our immediate concern is with arrests carried out in the 2010-17 interval I could click anywhere among the dates and proceed to group the data this way:

cook2

Note that I’ve modified the Starting at date to exclude the pre-2010 arrests, be they errantly captured or otherwise. Now after I click OK I can drag the years into the Columns area, after filtering out the residual <1/1/2010 or (blank) item.

Now I can drag OFFENSE_TITLE into Rows.

Surprise. With 1268 Offense categories cascading down the area you’ll have your work cut out for you, once you decide what to do next. Do you want to work with the data as they stand, or collapse near-identical types, and vet for misspellings along the way? Good questions – but in the interests of exposition we’ll leave them be.

How about something more immediately workable then, say age at incident? Exchange AGE_AT_INCIDENT for OFFENSE_TITLE, filter out the 2300 blanks, and group the ages by say, 3 years. Reprise AGE_AT_INCIDENT into Values (count). I get:

cook3

We see an extremely orderly negative association between age and arrests, with only the 20-22 tranche exceeding its predecessor bracket among the grand totals and only slightly. You’ll also observe that the numbers for 2017 are far smaller than the previous years, a likely function of incomplete data. In addition, track down to the Grand Totals row and behold the very significant ebbing of overall arrest totals from 2013 to 2016. Again, our table records arrest, and not crime totals, but the two likely point the same way – unless one wants to contend that the downturn in the former owes more to policing inefficiencies that any genuine diminution in crime – a not overwhelmingly probable development.

I’d then move to a Show Values As > % of Column Total look to learn how the brackets contribute differentially to arrests:

cook4

(The zeroes at the lowest reaches of the table reflect round-offs.)

Among other things, note the considerable, relative pull-back in arrests of suspects in the 17-19 range.

No, I don’t have an explanation at the ready for that, but perhaps you do.

Comments Leave a Comment
Categories Uncategorized

Airbnb Data, Part 2: A Tale of Three Cities

14 May

There’s such a thing as overstaying your welcome, but Airbnb landlords – and in effect that’s what they sometimes appear to be – may be prepared to happily abide your long-term self, or selves, in their abode.

And that heartening show of hospitality may be illegal. They take a dim view of Airbnb’s good neighbor policy in London, for but one example, where the powers-that-be aren’t thrilled about the kind of lucrative serial subletting some Airbnbers perpetrate, straitening as it does the market for folks who’d prefer to hang tight in an actual, lease-driven apartment.

The long and the short of it then, is that it a review of Airbnb property availabilities – defined as the number of days in a year in which a given room remains on offer – could prove instructive, and our data for New York, London, and Paris devotes a field to just that question.

The analysis, then, should proceed pretty straightforwardly, once we do something about the startingly sizeable count of rooms – about 55,000 – that presently aren’t to be had. That is, their value in their dataset’s availability_365 field states 0, indicating that, for now at least, that room has been withheld from the market. An email from Inside Airbnb compiler Murray Cox informed me that zero means the property’s next 365 days in its calendar (presumably its start date is a moveable inception point, which Cox finds in the the scrape_date field in a different dataset) aren’t in play, at least temporarily.
And as such, those zeros – which are, after all, values that would contribute to and very much confound any formula result – have to be barred from the data set. Here I rely on my venerable highest-to-lowest sort of the availability_365 field, relegating the zeros to the bottom of the set; and once put in their place, an interpolated blank row immediately above the first zero will detach them from the usable data, for now (of course they can be recalled if needed via a simple deletion of the blank row).

And all that enables us to proceed here:

Rows: City

Values: availability_365 (Average, formatted to two decimals)

I get:

airb1

Real city variation is in force; and recall the linked article above, the one reporting the December 2016 London-approved bill “limiting Airbnb hosts to renting their property for only 90 days”. Looks as if a few thousand room owners in that city haven’t read the news lately.

We could next cross-tab the results by room type, by rolling room_type into Columns:

airb2

All the cities trend in the same direction, though not identically – itself a differentiation worth pursuing, perhaps. Availability widens as the rental space constricts, with shared rooms – defined by Airbnb as those in which “Guests sleep in a bedroom or a common area that could be shared with others”, presumably humans, who presumably might or might not be actually residents of the property – freed up for a considerably larger fraction of the year.

And the results make sense – even common sense, perhaps. Entire homes and apartments need be empty, by definition, and if so, where would their owners be expected to go for the duration of the rental?

That’s a good question, one that directs itself to one of flashpoints of the Airbnb controversy. Are its hosts the kinds of proprietors who hoard multiple listings that might otherwise be released to the conventional rental/purchase market?

A few sure-footed steps toward an answer would require us to divide all the rentals in a city by its number of hosts, i.e., an average of properties per host; and that simple division exercise needs to fill its denominator with a unique count of hosts, thus returning us to a problem with which we’ve tangled before. To reiterate it: an owner of multiple properties will naturally appear that many times in the data set, that is, once each for each holding, even as we want him/her here to appear once. In light of that complication I think the neatest way out this time is to conduct a Remove Duplicates maneuver (Data ribbon > Data Tools), and ticking the host_id field, the parameter whose entries contains the duplicates we want to shake out (again, you may want to save these results to a new workbook precisely because you’re shrinking the original data set).

But how do the host ids, once in their respective, solitudinous states facilitate a calculation of the number of properties they own, on average? Here’s how: once we’ve identified each host id singly, we can average the calculated_host_listings_count in column P via a city pivot table breakout. That field, which restates the number of each host’s holdings in each of his/her record entries, is one I would have deemed redundant to the data set’s design. After all, the owner-property count could otherwise be derived when needed, for example, via a pivot tabling of the hosts id, delivering the field to both the Rows and Values areas. But because we’ve removed all host id duplicates, that plotline has to be red-penciled – and that’s where the calculated_host_listings_count comes to salvage the script:

Rows: Country

Values: calculated_host_listing_count (Average, to two decimals)

I get:

airb3

We see then, that Airbnb hosts are for the most part single-property marketers, at least for the cities we’ve gathered. For those interested in more detail, we could try this:

Row Labels: calculated_host_listing_count

Columns: City

Values: calculated_host_listing_count (Count, % of Column Total, formatted in percentage terms to two decimals)

I get, in excerpt:

airb4

You get the idea, though we see London owners are notably more likely to offer multiple properties.

Speaking of which, croll to the bottom of the table above and you’ll find a 711, signifying the voluminous, apparent holdings of a fellow named Tom in London. But When I returned to our original, entire Airnbnb dataset, including the rooms for which availability was set at 0 days, I discovered but 350 properties strewn about London associated with his name.

Now Tom’s the kind of person Murray Cox wants us to know about; he owns so many properties that he’s lost track of half of them.

Comments Leave a Comment
Categories Uncategorized

Airbnb Data, Part 1: A Tale of Three Cities

27 Apr

Would you rent your apartment to me? I have references from real people, don’t smoke, clean up after myself (is my nose growing?), and probably can be counted on not to trash your living room and adjoining vicinities.

Still don’t want to take my scratch for your pad? See if I care; there are plenty of other flats out there where yours came from.

Too many, in fact, according to Murray Cox, the self-identified “data activist” whose researches into Airbnb’s rental listings opened the door on a warehouse of dodgy practices, in numerous localities, e.g. property owners who market multiple dwellings, a clear and present violation of New York state law. Cox maintains that, among other things, the outsized scale of Airbnb offerings can worrisomely constrict a city’s available rental stock, and has published a city-by-city inventory (brought to my attention by a student) of Airbnb listings that you and I can download in most convenient spreadsheet form (look for the Summary Information and metrics for listings in… link attaching to each city).

It occurred to me that, among other potential takeaways, an intercity comparison of Airbnb activity might advance the journalistic cause a mite. I thus downloaded the data for New York, London, and Paris, all nicely exhibiting the same fields. With the intention of copying and pasting the data to a single workbook I cleared out a new column to the left of A, called it City, and entered and copied down the respective names of the three locations, properly lined up with their data once pasted, culminating in 162,701 rows of data, its 20 megabytes just itching to tell me what Airbandb has been up to.

Of course, the three-city amalgamation means to prime the data for a range of comparisons, but some comparisons don’t avail. I’m thinking in the first instance about the price field in what is now column K. These entries presumably cite daily rental rates, but express themselves in disparate currencies – dollars, pounds, and euros. One supposes an exceedingly determined investigator could mobilize and apply a round of currency equivalences to the records, a prospect that would require a vast compiling of date-specific rate fixes in short, a challenge likely to a real-world, deadline-mindful journo. I’ve thus elected to leave the numbers as they stand, and if that touch of laissez-faire works against the analysis I have no one to blame but myself. The buck stops here – and maybe the euro, too.

In any case, before we get fancy, we can think about this self-effacing pivot table:

Rows: City

Values: City (Count, by definition for a textual field)

I get:

air1

We see that Paris – by far the smallest of the three cities – nevertheless advertises the largest number of Airbnbs. An accounting for that disjuncture would probably justify a deeper look. Might tourist cachet or friendlier legalities spur the Paris margin? I don’t know. But consider that, juxtaposed to Paris’ population of around 2.25 million and its average household size of approximately 2.3 persons, the city’s Airbnb stock could house around 6% of its residents – with the point, of course, that the inventory is apparently being withheld from the permanent-residence rental market.

Other incomparables have their place among the data, too. There’s little comparing to do as such among the three cities’ neighborhoods, and indeed – the neighbourhood group (UK spelling) field for Paris and London is utterly empty (the field for New York comprises the city’s five boroughs).

But of course other workable comparisons are available. What, for example, about average minimum stay requirements by city and type of rental? We could try this:

Rows: City

Columns: room_type

Values: minimum_nights (Average, formatted to two decimals)

I get:

air2

We see that diffident London Airbnbers expect notably briefer stays at their places on average, with those uppity Parisians insisting that you agree to set down your knapsack – and debit card – more than a day-and-a-half longer before they let you in. At the same time, New York’s shared-room minimum is disruptively anomalous.

And for more evidence of cross-cultural heterogeneity – if that’s what it is – flip the values into Count mode and hand them over to the Show Values As > % of Row Total, ratcheting the decimals down to zero and switching the Grand Totals off (because the rows must invariably figure to 100%). I get:

air3

The overwhelming Paris proportion devoted to the Entire home/apt offering is not, I would submit, proof positive of yet one more Gallic quirk, but rather a simple function of the Paris housing stock, in which apartments predominate.
For additional, if glancing, corroboration, try this pivot table:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (tick New York)

Values: neighborhood_group (Count, % of Row Total)

I get:

air4

Recall that New York is the only city among our trio whose neighborhood group field is actually occupied with data – the names of its five boroughs. Note the relative Manhattan tilt towards Entire home/apt, even as the other boroughs, whose housing range features far more private homes, incline towards Private room – that is, presumably one private room among the several bedrooms in a home.

And what of daily price by city, keyed as it doubtless is to room type? It looks something like this:

Rows: City

Columns: room_type

Values: price (Average, to two decimals)

I get:

air5

Again, that imperative qualification – namely, that the prices reflect evaluations per their indigenous currencies – need be kept in mind. As such, the New York tariffs verge much closer to the London figures when the appropriate, albeit variable, pound-to-dollar conversion is applied. With those understandings in place, the Paris Entire home/apt average seems strikingly low – because the Euro consistently exhibits a “weaker” relation to the pound, the former as of today equaling .88 of the latter. Yet at the same time, Paris’ private room charge would appear to be effectively higher.

Now again, because the data are there, we could compare average prices for New York’s boroughs:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (New York)

Values: price (Average)

I get:

air6

No astonishments there, not if you’re a New Yorker. Manhattan expectably heads the rate table, though Staten Island’s second-place Entire home/apt standing may issue a momentary pause-giver, along with its basement-dwelling (pun intended) shared room rate.

That’s $7,300 a month for an entire place in Manhattan. And wait until you see the interest on the mortgage.

Comments Leave a Comment
Categories Uncategorized

← Older Entries

Newer Entries →

Search

spreadsheetjournalism

Tracking the Fields in Track and Field: Olympic Athlete Data, Part 2

Tracking the Fields in Track and Field: Olympic Athlete Data, Part 1

Data Stories: NY City Property Sales

World Cup Data: Goal Seeking- and Finding

Hacks of Stacks of Wax: Billboard Top 100 Data, Part 2

Hacks of Stacks of Wax: Billboard 100 Data, Part 1

The Verdict on Cook County Court Sentencing Data, Part 2

The Verdict on Cook County Court Sentencing Data, Part 1

Airbnb Data, Part 2: A Tale of Three Cities

Airbnb Data, Part 1: A Tale of Three Cities

Follow Spreadsheet Journalism via Email

Monthly Posts

My Pivot Table Chapter