Airbnb Data, Part 2: A Tale of Three Cities

14 May

There’s such a thing as overstaying your welcome, but Airbnb landlords – and in effect that’s what they sometimes appear to be – may be prepared to happily abide your long-term self, or selves, in their abode.

And that heartening show of hospitality may be illegal. They take a dim view of Airbnb’s good neighbor policy in London, for but one example, where the powers-that-be aren’t thrilled about the kind of lucrative serial subletting some Airbnbers perpetrate, straitening as it does the market for folks who’d prefer to hang tight in an actual, lease-driven apartment.

The long and the short of it then, is that it a review of Airbnb property availabilities – defined as the number of days in a year in which a given room remains on offer – could prove instructive, and our data for New York, London, and Paris devotes a field to just that question.

The analysis, then, should proceed pretty straightforwardly, once we do something about the startingly sizeable count of rooms – about 55,000 – that presently aren’t to be had. That is, their value in their dataset’s availability_365 field states 0, indicating that, for now at least, that room has been withheld from the market. An email from Inside Airbnb compiler Murray Cox informed me that zero means the property’s next 365 days in its calendar (presumably its start date is a moveable inception point, which Cox finds in the the scrape_date field in a different dataset) aren’t in play, at least temporarily.
And as such, those zeros – which are, after all, values that would contribute to and very much confound any formula result – have to be barred from the data set. Here I rely on my venerable highest-to-lowest sort of the availability_365 field, relegating the zeros to the bottom of the set; and once put in their place, an interpolated blank row immediately above the first zero will detach them from the usable data, for now (of course they can be recalled if needed via a simple deletion of the blank row).

And all that enables us to proceed here:

Rows: City

Values: availability_365 (Average, formatted to two decimals)

I get:

airb1

Real city variation is in force; and recall the linked article above, the one reporting the December 2016 London-approved bill “limiting Airbnb hosts to renting their property for only 90 days”. Looks as if a few thousand room owners in that city haven’t read the news lately.

We could next cross-tab the results by room type, by rolling room_type into Columns:

airb2

All the cities trend in the same direction, though not identically – itself a differentiation worth pursuing, perhaps. Availability widens as the rental space constricts, with shared rooms – defined by Airbnb as those in which “Guests sleep in a bedroom or a common area that could be shared with others”, presumably humans, who presumably might or might not be actually residents of the property – freed up for a considerably larger fraction of the year.

And the results make sense – even common sense, perhaps. Entire homes and apartments need be empty, by definition, and if so, where would their owners be expected to go for the duration of the rental?

That’s a good question, one that directs itself to one of flashpoints of the Airbnb controversy. Are its hosts the kinds of proprietors who hoard multiple listings that might otherwise be released to the conventional rental/purchase market?

A few sure-footed steps toward an answer would require us to divide all the rentals in a city by its number of hosts, i.e., an average of properties per host; and that simple division exercise needs to fill its denominator with a unique count of hosts, thus returning us to a problem with which we’ve tangled before. To reiterate it: an owner of multiple properties will naturally appear that many times in the data set, that is, once each for each holding, even as we want him/her here to appear once. In light of that complication I think the neatest way out this time is to conduct a Remove Duplicates maneuver (Data ribbon > Data Tools), and ticking the host_id field, the parameter whose entries contains the duplicates we want to shake out (again, you may want to save these results to a new workbook precisely because you’re shrinking the original data set).

But how do the host ids, once in their respective, solitudinous states facilitate a calculation of the number of properties they own, on average? Here’s how: once we’ve identified each host id singly, we can average the calculated_host_listings_count in column P via a city pivot table breakout. That field, which restates the number of each host’s holdings in each of his/her record entries, is one I would have deemed redundant to the data set’s design. After all, the owner-property count could otherwise be derived when needed, for example, via a pivot tabling of the hosts id, delivering the field to both the Rows and Values areas. But because we’ve removed all host id duplicates, that plotline has to be red-penciled – and that’s where the calculated_host_listings_count comes to salvage the script:

Rows: Country

Values: calculated_host_listing_count (Average, to two decimals)

I get:

airb3

We see then, that Airbnb hosts are for the most part single-property marketers, at least for the cities we’ve gathered. For those interested in more detail, we could try this:

Row Labels: calculated_host_listing_count

Columns: City

Values: calculated_host_listing_count (Count, % of Column Total, formatted in percentage terms to two decimals)

I get, in excerpt:

airb4

You get the idea, though we see London owners are notably more likely to offer multiple properties.

Speaking of which, croll to the bottom of the table above and you’ll find a 711, signifying the voluminous, apparent holdings of a fellow named Tom in London. But When I returned to our original, entire Airnbnb dataset, including the rooms for which availability was set at 0 days, I discovered but 350 properties strewn about London associated with his name.

Now Tom’s the kind of person Murray Cox wants us to know about; he owns so many properties that he’s lost track of half of them.

Advertisements

Airbnb Data, Part 1: A Tale of Three Cities

27 Apr

Would you rent your apartment to me? I have references from real people, don’t smoke, clean up after myself (is my nose growing?), and probably can be counted on not to trash your living room and adjoining vicinities.

Still don’t want to take my scratch for your pad? See if I care; there are plenty of other flats out there where yours came from.

Too many, in fact, according to Murray Cox, the self-identified “data activist” whose researches into Airbnb’s rental listings opened the door on a warehouse of dodgy practices, in numerous localities, e.g. property owners who market multiple dwellings, a clear and present violation of New York state law. Cox maintains that, among other things, the outsized scale of Airbnb offerings can worrisomely constrict a city’s available rental stock, and has published a city-by-city inventory (brought to my attention by a student) of Airbnb listings that you and I can download in most convenient spreadsheet form (look for the Summary Information and metrics for listings in… link attaching to each city).

It occurred to me that, among other potential takeaways, an intercity comparison of Airbnb activity might advance the journalistic cause a mite. I thus downloaded the data for New York, London, and Paris, all nicely exhibiting the same fields. With the intention of copying and pasting the data to a single workbook I cleared out a new column to the left of A, called it City, and entered and copied down the respective names of the three locations, properly lined up with their data once pasted, culminating in 162,701 rows of data, its 20 megabytes just itching to tell me what Airbandb has been up to.

Of course, the three-city amalgamation means to prime the data for a range of comparisons, but some comparisons don’t avail. I’m thinking in the first instance about the price field in what is now column K. These entries presumably cite daily rental rates, but express themselves in disparate currencies – dollars, pounds, and euros. One supposes an exceedingly determined investigator could mobilize and apply a round of currency equivalences to the records, a prospect that would require a vast compiling of date-specific rate fixes in short, a challenge likely to a real-world, deadline-mindful journo. I’ve thus elected to leave the numbers as they stand, and if that touch of laissez-faire works against the analysis I have no one to blame but myself. The buck stops here – and maybe the euro, too.

In any case, before we get fancy, we can think about this self-effacing pivot table:

Rows: City

Values: City (Count, by definition for a textual field)

I get:

air1

We see that Paris – by far the smallest of the three cities – nevertheless advertises the largest number of Airbnbs. An accounting for that disjuncture would probably justify a deeper look. Might tourist cachet or friendlier legalities spur the Paris margin? I don’t know. But consider that, juxtaposed to Paris’ population of around 2.25 million and its average household size of approximately 2.3 persons, the city’s Airbnb stock could house around 6% of its residents – with the point, of course, that the inventory is apparently being withheld from the permanent-residence rental market.

Other incomparables have their place among the data, too. There’s little comparing to do as such among the three cities’ neighborhoods, and indeed – the neighbourhood group (UK spelling) field for Paris and London is utterly empty (the field for New York comprises the city’s five boroughs).

But of course other workable comparisons are available. What, for example, about average minimum stay requirements by city and type of rental? We could try this:

Rows: City

Columns: room_type

Values: minimum_nights (Average, formatted to two decimals)

I get:

air2

We see that diffident London Airbnbers expect notably briefer stays at their places on average, with those uppity Parisians insisting that you agree to set down your knapsack – and debit card – more than a day-and-a-half longer before they let you in. At the same time, New York’s shared-room minimum is disruptively anomalous.

And for more evidence of cross-cultural heterogeneity – if that’s what it is – flip the values into Count mode and hand them over to the Show Values As > % of Row Total, ratcheting the decimals down to zero and switching the Grand Totals off (because the rows must invariably figure to 100%). I get:

air3

The overwhelming Paris proportion devoted to the Entire home/apt offering is not, I would submit, proof positive of yet one more Gallic quirk, but rather a simple function of the Paris housing stock, in which apartments predominate.
For additional, if glancing, corroboration, try this pivot table:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (tick New York)

Values: neighborhood_group (Count, % of Row Total)

I get:

air4

Recall that New York is the only city among our trio whose neighborhood group field is actually occupied with data – the names of its five boroughs. Note the relative Manhattan tilt towards Entire home/apt, even as the other boroughs, whose housing range features far more private homes, incline towards Private room – that is, presumably one private room among the several bedrooms in a home.

And what of daily price by city, keyed as it doubtless is to room type? It looks something like this:

Rows: City

Columns: room_type

Values: price (Average, to two decimals)

I get:

air5

Again, that imperative qualification – namely, that the prices reflect evaluations per their indigenous currencies – need be kept in mind. As such, the New York tariffs verge much closer to the London figures when the appropriate, albeit variable, pound-to-dollar conversion is applied. With those understandings in place, the Paris Entire home/apt average seems strikingly low – because the Euro consistently exhibits a “weaker” relation to the pound, the former as of today equaling .88 of the latter. Yet at the same time, Paris’ private room charge would appear to be effectively higher.

Now again, because the data are there, we could compare average prices for New York’s boroughs:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (New York)

Values: price (Average)

I get:

air6

No astonishments there, not if you’re a New Yorker. Manhattan expectably heads the rate table, though Staten Island’s second-place Entire home/apt standing may issue a momentary pause-giver, along with its basement-dwelling (pun intended) shared room rate.

That’s $7,300 a month for an entire place in Manhattan. And wait until you see the interest on the mortgage.

NY Regents Exam Data, Part 2: Multiple Choices

12 Apr

Numbered among the additional conclusions we can draw from the New York Regents data is a natural next question from the aggregate test averages we reckoned last week, answered by kind of reciprocal finding: namely, the aggregate fail rates. Guided by the concerns about weighting we sounded in the previous post, I’d allow that a calculated field need be applied to the task here too, a field I’ve called PctFail.

But before we proceed we again need to contend with the not insignificant number of records who, for apparent reasons of confidentiality, won’t count their fewer-than-five students, replacing the totals with an “s”. Thus I tapped into column N, called it NFail, and entered in N2:

=IF(J2=”s”,0,K2)

The formula assays the relevant cell in J for an “s”; if it’s there, a 0 is supplied. Otherwise the value in K – the Number Scoring Below 65 – is returned.

Again, we’ll copy that formula down N and proceed to calculate the PctFail field:

reg1

Once effected, this simple pivot table, abetted by the All Students Slicer selection we ticked last post, opens the story:

Rows:  Year

Values: PctFail (formatted here in Percentage mode to three decimals)

I get:

reg2

The failure rates are substantial, a near-ineluctable follow-on from the overall score averages settling in the 68 range (remember that 65 passes a Regents exam).

But you’ll want to know about failures by Regents subjects, too. Sight unseen, you’d expect appreciable variation among the test areas, and that drill down can tool its way into the data via the Regents Exam field, e.g. something like this:

Rows: Regents Exam

Columns: Year

Values: PctFail (formatted similarly to the table above)

I get:

reg3

And variation there is, some more provocative than others. You’ll note the massive leap in failure rates for English and Geometry from 2015 to 2016, a determined, ascending slope of failures for Algebra2/Trigonometry, and a restitutive, noteworthy shrinkage in failures for Common Core Algebra. (The Common Core tests are controversial, in part because of their redesign; see this report, for example).

You’ll also want to do something about those error messages. In some cases, the #DIV/0! outcomes simply key the absence of data for the exam, owing to an exam’s discontinuation or later introduction, while the (blank) label appears in virtue of the ten rows that bear no exam name. Should you want to pave over the errors, and you probably do, click anywhere in the pivot table and proceed to PivotTable Tools > Analyze > Options > Options > Layout & Format tab > and tick For error values show:. Enter some appropriate stand-in for #DIV/0!, e.g. — , and click OK. Because the dashes in the 2015 column push far left and look almost as unseemly as the original error message, you may want to select all the Values and align them right. (You could also filter out the blanks.)

Now if you want to crunch failure rates by ethnicity, for example, you’ll again have to reconcile the double-counting character of the fields we described last post. The ethnicities – Asian, Black, Hispanic, Multiple Race Categories Not Represented, and White – have been quartered in the Demographic Variable field, but so have a potpourri of other, disconnected items bound to other Variables, e.g. Female to Gender, English Proficient to ELL Status.

We’ve stubbed our toe against this odd problem in the previous post, in which Excel’s million-record limit has forced records otherwise deserving of their own field into one, messy cosmopolitan column – the one called Demographic Category, itself dispersed into no-less-heterogeneous items in Demographic Variable. It’s a confusing issue, but I think we need to tick Ethnicity in the Slicer now and slide Demographic Category –confined by the Slicer to its ethnic item entries – into Rows. Sweep Year into Columns and you get:

reg4

The disparities here are dramatic, and rather self-explanatory – the results, that is, not the accountings of them.

Next opt for Gender in the Slicer:

reg5

Women outdo men, a finding that more-or-less jibes with current understandings of gender performance differentials. The female margin, nearly equivalent across 2015 and 2016, pulls away slightly in the following year.

And what of gender outcomes by exam? Slip Regents Exam atop Demographic Category (which has been sliced to Gender) in Rows, and (in excerpt):

reg6

And (in second, remaining excerpt):

reg7

You’re looking for palpable divergences, of course, but palpable congruences mean something here, too.  The decisive female advantages in the Common Core English scores are perhaps notable but not arresting; but their edge in Common Core Algebra does a fair share of stereotype busting, even as males emerge the stronger in Common Core Algebra2. (Important note, the Grand Total pass rates vary by Demographic Category even as the Total Tested remains neatly constant across all Demographic Variables. That’s because the distribution of “s” entries across the Categories isn’t constant.)

There are plenty of other permutations in there, but let’s try one more. Column Q quantifies the number of students in the record whose score achieves what the Regents calls College Readiness (CR), i.e., a 75 in the English Language Regents or an 80 on any Math exam in the system.

And here’s where I have to own up to a bit of spreadsheeting excess. In the previous post I implemented what I termed an NFail field, embodied by its foundational formula – an expression that replaced “s” entries with a 0, the better to factor these into a calculated field. I now understand that those exertions were unnecessary, because Excel will completely ignore an “s” or any other label in any case. Thus here (and last week, too) we can work directly with the Number Scoring CR field in Q. But because we do need to acknowledge the large number of “s” and “NA” entries in Q (NA, because only some Regents’ qualify as CR exams) that will impact any denominator we also need what I call here a CRCounted field to be put in place in the next available column, punctuated by this formula that gets copied down:

=IF(ISNUMBER(Q2),I2,0)

We then need compose that calculated field, which I’m calling CRPass:

reg11

Remember here, and for the first time, we’re computing pass rates. This pivot table among others, awaits, under the aegis of the Demographic Category Slicer – Gender:

Rows: Demographic Variable

Columns: Year

Value: CRPass

I get:

reg9

Provided I’ve done my homework correctly the results point to a striking spike in CR attainments, a set of findings that calls for some journalistic deep-backgrounding. (Note that the absolute CR numbers are far smaller than the global Total Tested figures, because as indicated above only certain exams march under the CR banner.) We see a small-scaled but real stretching of the female advantage in the pass rates between 2015 and 2017, one that also needs to be sniffed by some nose for news.

Now let my take a break while I double-check my homework. I hear this blogster is one nasty grader.

NY Regents Exam Data, Part 1: Multiple Choices

28 Mar

We haven’t met, but I can say with a jot of confidence that I’ve likely done something you haven’t – taken a Regents exam. Exams.

I’m not clipping the achievement to my lapel as a badge of honor, you understand, just stating a biographical matter of fact. The Regents – a staple of the New York State educational system in which I spent more than a little time – comprise a series of what are termed exit exams; pass enough of them and you walk away with a high school diploma of the same name. Your correspondent took his share of Regents, his scores emobdying the “scatter” in scattergram, but no matter; I took them, and I’m a better person for having done so.

But before you file a Freedom of Information Act request to verify those abnormally-curved results you may want to review a larger, ultimately more interesting record of Regents attainments, the dataset supplied by the New York City open data site that summarizes in grand form the Regents scores of students statewide for the years 2014-17. It’s a big file, needless to say – so big you’ll need to download it yourself – and its 212,000 or so records have a lot to say about the testing profile of New York’s high schoolers.

It also has a lot to say about spreadsheet organization, more particularly the juxtaposition of column G, Demographic Category, to H, Demographic Variable. Those columns/fields in fact identify a series of putative fields and field items respectively; and as such, G’s contents could, at least in theory, have been more conventionally structured into discrete parameters, each owning a column all its own.

But that prescription calls for an elaboration. Consider this fledgling pivot table drawn from the Regents data:

Rows: Demographic Category

Demographic Variable

Values: Total Tested

I get:

regents

Note that that all the Total Tested subtotals are equivalent, intimating that the five Demographic Categories cleave the same population into an assortment of cross-cutting attributes, and thus summing the same student count five times. The by-the-book alternative, again, would have assigned each Category to an independent field, such that an interest in test achievements by Gender, for example, would require the user to simply drag Gender into Rows. As it stands, however, a Slicer (or filter) would have to grease the process, e.g.:

regents2

Moreover, casting the potential field-bearers into mere item status beneath that singular Demographic Category banner appears to obviate a good many cross-tabulating possibilities, e.g.: a breakout of tests by both Ethnicity and Gender. How, for example, are we to learn how black female students score on the Regents when both attributes are lodged in the same field, and so must occupy the same label area?

But at the same time, the spreadsheet designers had to contend with a supplementary problem that overrides my challenges – namely, that the upgrading of say, Ethnicity and Gender to field standing would appear to require that data present themselves in individual record form, e.g. each student’s performance on each test; and those 2,000,000 scores/records would burgeon beyond Excel’s data-accommodation space.
In any case, there is indeed lots to learn structural complications notwithstanding, and we could begin by starting coarsely – by calculating the average overall Regents scores by year:

Row: Year

Values: Mean Score (Average, formatted to two decimal points)

I get:

regents3

(Note that one record, attaching to the Island School in Manhattan, exhibits a nonsensical entry for its year. By filtering and comparing the Island School data, it appears that the record belongs to 2017.)

The averages are remarkably similar, though I’d venture that, given the 2,000,000-pupil universe, the one-point differential distancing the 2015 and 2017 scores is significant. Remember that the Regents passing score is pegged at 65, suggesting that the test designers got their threshold right.

But those averages aren’t quite definitive, and for a couple of connected reasons, one subtler than the other. The first recognizes that the student double-count pinpointed above proceeds in effect to compute the average scores multiple times, because the records operate under the steam of different demographic categories and numbers per record. Thus the mean average for the Gender category alone – which nevertheless contains all students – is likely to depart at least slightly from the mean average for Ethnicity, which likewise contains all students. If, for example, we reintroduce the Slicer for Demographic Category for the current pivot table and tick All Students, we’ll get:

regents4

The differences from the initial pivot table are very small but evident, again because the Total Tested numbers per the All Students records don’t perfectly line up with the Total Tested per-record numbers for Ethnicity, for example – because each record receives an equal weight, irrespective of its Total Tested value.

And it is the matter of weighting that points its arrow directly at the second question, one we’ve seen elsewhere (here, for example). The per-record mean averages ascribe an identical mean score input to each record, even as the test taker numbers vary. And that bit of record democratization vests greater, relative influence to the smaller numbers. The result, again: a possible skewing of the averages.

But because Column I enlightens us about the actual test taker numbers, we should be able to derive a simple calculated field to impart a corrective to the weighting problem – once we deal with the very large number of records – about 75,000 – that just don’t report their score results. These are the entries dotted with the “s” code, and our formula needs to ignore them as well as their test taker numbers, which are stated, after all.
So here’s what I did. I headed next-available column S TotalPts and entered, in S2:

=IF(J2=”s”,0,I2*J2)

That expression means to assign 0 for any “s” datum, and otherwise multiply the record’s mean score by its number of students. (Important note: absent Means Scores almost always associate themselves either with a Total Tested number of five or less, or with entries possessing the SWD or Non-SWD values. (SWD stands for Students With Disabilities.) One assumes that both types exclusions are justified by reasons of confidentiality; but remember that the five-way count of students in the data set should subsume most of the SWD takers anyway, via their inclusion in the All Students item.)

After copying the formula down I titled the T column CountedStu and dropped down a row, wherein I entered:

=IF(J2=”s”,0,I2)

The formula asks if the relevant row in J contains an s. If so, a zero is returned; otherwise, the number of test takers cited in I is returned.

I next devised a simple calculated field, ActAvg (for actual):

regents5

That field can now be made to work with any other field whose items are banked in Row Labels, e.g. by substituting ActAvg for Mean Score in Values (and leaving the All Students Slicer selection in place). I get:

regents6

It’s clear that our weighting refinements have uncovered a “true” higher set of averages, and that continue to sustain the near-point improvement from 2015 to 2017.

But again, there’s quite a bit more to learn from the scores. But I’m asking now to recess for lunch – and yes, I raised my hand.

U. S. Vaccine Data: A Different Treatment

14 Mar

It’s alarmingly philistine of me to say so, but I know what I like; and I like the heat maps that track the vaccine-driven advances in the United States wrought against infectious diseases, schematized by graphics boffins Tynan DeBold and Dov Friedman and featured in the Wall Street Journal in 2015. One of them looks like this:

Debold1

These are of course heat maps in the broadened sense, plotting epidemiological movement across time instead of territory, and tiling their data into tightly-bound mosaics, whose colors blanche as the diseases recede. In short, a well-told story of stirring medical progress, the data for which is contributed by Project Tycho of the University of Pittsburgh and available for download there (you can sign into it for free if you’re affiliated with an educational institution). By contrast, think what a line chart bearing 50 data series would look like.

It should be added that the Tycho site itself hardwires a heat-mapping utility into its pages, in what could be construed as an ancestral precursor of the DeBold-Friedman outputs, e.g.

debold2

It is worth asking, then, about the refinements DeBold and Friedman commended to the Tycho charts, and the respective reasons why.

But in any event the map designers don’t need my everyman’s encomium; their depictions have won at least two awards – one conferred by the Global Editor’s Network (GEN), the other by the Kantar Information is Beautiful judges. The former’s site declares the maps were “…wildly popular on social media as well as with statisticians and graphics editors weighing in on how they would’ve approached this project”.

Sounds good to me, but note for the record that the Y axis above records 26 state names, pulling up 24 (really 25; the District of Columbia – the nation’s capital that possesses extra-state status – is likewise counted) short of the full American complement. Those names – at least as they’re represented above – have been invested with a font size that misaligns them with the 51 rows of data; the names are simply too large for the row heights that capture the data, a mismatch that conduces toward the follow-on question as to why the particular states above were earmarked for display, to the exclusion of the others.

You’ll also observe that the maps’ cell widths vary. Contrast the DeBold-Friedman above with this one:

Debold3

Because the above map – and its 26 states – time-stamps its data from a later point of inception its cells are dilated, a refit that widens it equivalently to the other maps. Do these discrepancies matter (along with the ten-year intervals marked in the first map that have been halved in the second)? I’m not sure.

And something else about the maps provoked a thought or two. On his website Dov Friedman tells us that he used Excel “to aggregate over 100,000 data points. The data was then plotted on heatmaps using Highcharts. All sections were templatized with handlebars [sic].”

Now Highcharts and Handlebars are two applications with which I am not familiar (I told you I was a philistine); But in the course of perusing and admiring the maps a renegade idea gatecrashed my cerebellum: could the maps be emulated – at least more or less – with Excel alone?

I think the answer is yes, at least more or less; it seems to me that, with the disease data dropped into the Values area, the heat maps could be made to emerge from a pivot table that would crosstab state names and years against the data, whose numbers could be comparatively scaled through a series of conditional formats.

In receipt of that self-issued marching order, I proceeded to download the measles data for five states to see how the heat maps might be framed in Excel-only mode. The numbers were redirected into a pivot table (not a straightforward program; the download assigned each state’s figures to an independent field, and as such had to be harnessed inside a single State parameter via the Get & Transform utility. In addition the “-“ cell entries marking absent data had to be quantified, but I’ll spare you the particulars).

Once the above reconstructive work was carried out the resulting pivot table proceeded pretty straightforwardly, something like (depending on how you’d name the fields):

Rows: States

Columns: Year

Values: Incidence

And once the table was put in place, the cells would be subjected to the color-graded conditional formats described above (in tandem with white borders), and while I didn’t do a primo job of replicating the DeBold-Friedman scheme it’s all about the concept, after all.

Now of course the pivot table will line its upper border with the year (Column area) entries, even as the maps in question underscore the values with those data. The simplest route toward emulation would call for a blue-on-blue formatting of the years above, or a simple hiding of their row, after which the years could be typed below. The Vaccine introduced line was almost surely drawn, and that’s what I did. The color legend

Debold4

Is a tricky one; it could perhaps be rendered though an equally simple coloring of consecutive cells per the conditional formats, though it appears that the DeBold-Friedman band comprises tints of varying width. By way of resolution, four cells or so could be allotted per 1k, to support multiple colors spanning the same interval (note that the data for the states I selected don’t report numbers in the 2-4k range.)
In light of the preceding, I’ve gotten this, by way of a demo:

Debold5

The larger point has been made, one hopes. Proper equivocations aside, the map approximates the representations of the DeBold-Friedman efforts, but again without recourse to applications beyond Excel. Indeed – for me the phase of the task that threw up the most resistance was the foundational pivot table itself. But once you toss off that gauntlet, the map beckons – because the map is the pivot table (and there’s no need by the way to resize the numbers in values for visibility’s sake, because the conditional formats will obscure them anyway).

Thus it seems to me that my infographical proposal works, albeit on a slight lower end from the DeBold-Friedman portrayals. And while I’m not claiming it’s the stuff for which awards are bestowed, I’m working on my acceptance speech anyway.

US Politicians’ Worth: A Wealth of Disclosures

20 Feb

If the citizen’s right to know shades into a gossipy prurience, the right to know nevertheless prevails; thus if the disclosures of the estimated net worth of American lawmakers compiled by the Center for Responsive Politics and reposed in the Enigma public data site titillates or outrages, one assumes the greater good is being served just the same. Those estimates are here:

LawmakerNetWorth-2014

First, don’t be fooled by the 2014 date stamp, as I was at first blush. After all, the worth list, sorted as it is highest-to-lowest, is headed by the irrepressibly well-heeled Mr. Trump, who if memory serves only pulled into the White House in 2017, whether he wanted to or not. The 2014, then, must point to the date of the estimates instead (and as such, the year field and its relentlessly uniform 2014s could be deleted, though keep in mind that Enigma stores Center worth data for previous years as well), and if a good many of the numbers appear to you in scientific-notational form (e.g. 1.07E+.08) that’s just one more way by which Excel means to tell you that their columns are presently too narrow. Just auto-fit away, as usual. (Note that the origin field in H is likewise uninformative, comprising nothing but the Center for Responsive Politics entry, and the source field in I is even sparser, amounting to nothing but its header.)

Get past those preliminaries and you’ll need to remind yourself about the codes informing the chamber field: E – Executive, S – Senate, H – House of Representatives, and J – Judiciary. That latter code estimates the worth of the nine now-serving Supreme Court Justices, an interesting set of disclosures about a coterie one might deem slightly above the financial fray; the S and H signifiers apparently name and enumerate each and every current Senator and Congressperson, a nicely comprehensive complement. The twelve executive entries, however, stand in a different relation to the dataset; they identify presidential hopefuls for the 2016 election and not office holders, with the exception of the winner.

Perhaps the first attention-grabber in the data to seize your lapels is the trio of worth estimates lined up for each politician. The Center for Responsive Politics explains that the data spring from disclosure forms completed by office holders which “do not require exact values”, but rather “the range of value into which an asset falls”. The Center thus filled ranges with lower and upper estimates and averaged the two into the avgvalue field, those computations in turn exhibiting considerable plus-minus variation. Indeed, 77 of the politicians declared a negative lower-end asset value, with another 11 submitting negative upper estimates. Of course, our data can’t take that question further.

In any case, working with the averages we could broach a first obvious question, namely how worth associates with chamber:

Rows: chamber

Values: avgvalue (formatted to two decimals with a comma)

I get:

worth1

Remember of course that the Executive universe is small, numbering but 12 cases. As such the 1.482- billion-dollar average estimate for Donald Trump slingshots the mean skyward, in spite of negative-asset claimers Scott Walker, Rick Santorum, and Martin O’Malley. Whether that means erstwhile Maryland governor O’Malley truly has less money than you and I is perhaps a query better put to his accountant, but the data maintain as much.

But the larger, if ingenuous, conclusion urged upon us is the reality that American national politicians – including Supreme Court justices – are overwhelmingly multi-millionaires. If the averages of the averages are credible, Senators possess over $10,000,000 worth of assets, with Representatives checking in a $7,500,000, accumulations that of course can’t ascribed to their $174,000 salaries. At the same time, however, variation within the Congress is massive, including outlier Darrel Issa, a Republican Congressman from southern California worth, per his average estimate, over $436,000,000.

For a sense of the dispersion of wealth, then, you can drag avgvalue in to values again, setting it to Stddevp (for the operational difference between Stddev and Stddevp, look here):

worth2

When the standard deviations exceed the averages you’re dealing with a lot of variation, even after discounting the ten-figured Mr. Trump.

Now for perhaps the juicier question: Might net worth vary palpably by party affiliation? Conventional wisdom would respond in the affirmative, i.e. Republicans and their putative sympathies for the haves could be expected to have more as well than the latter-day Blues. But either way, the dataset should enable an answer of sorts to the question.

The marching order here is to isolate party, as it’s referenced by its initial in the name field. Because that initial appears (almost) in every case immediately after the open parenthesis alongside each surname, we can take over the empty source field, rename it Party, and write in I2 (assuming you haven’t deleted the extraneous year field):

=MID(B2,FIND(“(“,B2)+1,1)

That expression seeks the character position of the open parenthesis and adds one to it, enabling MID to grab the next single character, which denotes political party. Copy the formula down the field and that should work, with nine exceptions – the Supreme Court justices, whose party memberships have been withheld from the data, an exclusion that imposes a #VALUE message upon their formulas. The workaround: rewrite the formulas thusly:

=IFERROR(MID(B2,FIND(“(“,B2)+1,1),”J”)

The nifty IFERROR function returns the default, nested formula result unless it evaluates to a error, in which case it’ll interject the defined if-false result, in the above case J, for Judiciary.

Once that revamped expression is copied down the field the pivot table follows:

Rows: Party

Values: avgvalue (Average, with appropriate formatting)

I get:

worth3

 

The thesis is confirmed, apparently (the I represents the two independent legislators informing the list, by the way, one of whom is Bernie Sanders). Republicans are worth on average 47% more than the (very) relatively proletarian Dems; but toss Stdevp back in to the table and we see:

worth4

The mammoth standard deviation among GOP sectarians has a good deal to do with the Trump billion, and Issa’s 436,000,000, too. Thus, if one deletes the row bearing Trump’s data and refreshes the pivot table, you’ll see:

worth5

Check it out.  The Republicans, we see, in actuality average a smaller net worth and standard deviation than Democrats. Go further and delete all 12 Executive entries – none of whom, with the exception of the incumbent, are election officials, after all – and run the pivot table once more:

worth6

Here, Democrats possess 9.94% more worth on average than Republicans.

Now that’s unconventionally wise.

Tweets of a Stable Genius

11 Feb

Notch November 7, 2017 as a watershed tickpoint on your presidential timeline. It was the day on which Twitter doubled its message character allotment to 280, an allowance that induced shudders in both the twittersphere along with vast terrestrial swaths: it mightily enriched the opportunities for the President of the United States to tweet his mind to you and me.

I’ll leave the rest of the editorials to you and me, but the character enhancement begs a few novel questions we could instructively put to the downloadable spreadsheet of Mr. Trump’s communiques offered up (for a small fee) by twdocs.com, the go-to site to which I’ve gone on more than one occasion. My latest requisition of @realDonaldTrump tweets (accept no substitutes) pumped out the incumbent’s latest 2805 entries through February 8, affording us exactly three months’ worth of post-280 streams of consciousness, with the earliest of these bearing the time stamp October 24, 2016. Again, I think you’ll have to pay for the sheet yourself if you want to click along.

A first question, then, an obvious one: has the widened message space encouraged Trump to issue longer tweets? And a second, slightly subtler follow-on: might 280 characters inspire longer, but fewer of them? (Note that this submission considers features of the commander-in-chief’s twitter proclivities over and above his preferred key words, a concern which predominates in my previous posts on the matter).
An answer to the former hinges itself on the LEN function, with which I’ve overwritten the contents of the H column, declaring nothing but the name Donald J. Trump in each of its populated cells. Renaming the field LEN I entered in H7:

=LE

N(h7)

and copied down. (I’m not quite sure what the field name a in the otherwise empty G column is doing. On the one hand, the name suffices to enable the pivot table to enfold the fields following G, but I’m not sure why the field need be there at all.)
The next move is slightly trickier, because the defining November 7, 2017 breakpoint doesn’t lend itself to a graceful pivot table grouping, or at least I don’t think it does. That is, grouping the date data in the A column by Months and Years will naturally turn out conventionally-demarcated months, e.g. Nov and Dec, and won’t group the months beginning with the 7th. A more elegant realization would tap into a pair of array formulas, i.e.: name the LEN field range Length and name the data in Created At (in the A column) Date. Enter the 11/7/2017 boundary date anywhere, say H1, and enter, say in G1:

{=AVERAGE(IF(Date<H1,Length))}

The formula seeks out all dates prior to the benchmark November 7 and averages their tweet lengths. Extended to two decimal points, I get 126.80. Then enter in G2:

{=AVERAGE(IF(Date>=H1,Length))}

This companion expression averages the character length of the President’s tweets dating from November 7, 2017 and evaluates to 193.29, a none-too electrifying differential, I’d allow; ask Twdocs.com to download your tweets, after all, and your stats might look much the same. Survey a few million post-November 7 tweets and then, and only then, might we be able to say something distinctive about Trump’s relative outputs.
But what about the second question posed above – namely, if the heightened loquacity promoted by the 280 characters thus curtails Trump’s need for as many tweets? That finding could again be essayed in a number of ways, including some messier array formulas that would count the available tweets on either side of November 7, 2017 and divide those respective tallies by the number of days elapsed prior to and following that date. In fact, however, a slightly more imprecise but swifter pivot table could do a representative job:

Rows: Created At (UTC + 1h) (The parenthetical notation points to the French provenance of twdocs, at its recording of tweet transmission times per that time in France, an important qualification. The User Created At field, however, merely reports and repeats the date/time at which Trump initiated his Twitter account).

Values: Created At (UTC + 1h) (Count)

I get:

BlogT1

True, the Nov 2017 above conflates the pre and post-November 7 numbers a bit, but the larger thrust of the counts pushes ahead. We see an inconclusive range of distributions: while tweets across the last few months have indeed downturned in relation to October of last year (and that month stands as the most profuse among the table’s holdings), tweet counts in the Feb-Jun 2017 span were decisively lower than the latter-day entries. It’s thus not clear if Trump’s tweeting inclinations have been stunted by the new opportunity to say more at any one time.

And what about Trump’s average retweet count, proposed here as a proxy popularity index? Just substitute Retweet Count for Created At (UTC + 1) in Values and tick Average (with suitable formatting). Here I’d also turn off any subtotals, thus uprooting one higher level of slightly confounding aggregation from the counts. I get:

BlogT2

Commit the above to a rudimentary line chart, and I get:

BlogT3

Is that something worth writing home about, or at least submitting on deadline? Maybe. There’s something at least statistically significant about the line’s fitful locus, and it might be worth investigating April’s retweet dip and/or the more recent spike in RT’s, which does roughly correlate with the opening of the 280-character gateway, though I’m hard-pressed to advance beyond a simple correlation – unless one wants to maintain that longer tweets bespeak deeper policy insights that warrant stepped-up circulation across the twittersphere. Sounds like a reach.

Then add Favorite Count to Values (again, Average):

BlogT4

Favorites correlate most significantly with retweets (I get .8997), an unsurprising resonance. One supposes the prior question, rather a generic one, would ask why favorites reliably outnumber rewteets so enormously (look here, for example, for one take on the matter).

And to fit a final, relatively novel frame atop the Trump tweets, what about the president’s means of transmission? That is:

Rows: Created At (UTC + 1) (again, grouped by Years and Months)

Columns: Source

Values: Source (count)

I get:

BlogT5

Note among other things Trump’s late abjuring of Twitter for Android, and a corresponding uptick in his resort to Twitter for iPhone., his clear source of choice.
Does that mean that Mr. Trump will be recruited to spearhead Apple’s next Think Different campaign?