US Politicians’ Worth: A Wealth of Disclosures

20 Feb

If the citizen’s right to know shades into a gossipy prurience, the right to know nevertheless prevails; thus if the disclosures of the estimated net worth of American lawmakers compiled by the Center for Responsive Politics and reposed in the Enigma public data site titillates or outrages, one assumes the greater good is being served just the same. Those estimates are here:

LawmakerNetWorth-2014

First, don’t be fooled by the 2014 date stamp, as I was at first blush. After all, the worth list, sorted as it is highest-to-lowest, is headed by the irrepressibly well-heeled Mr. Trump, who if memory serves only pulled into the White House in 2017, whether he wanted to or not. The 2014, then, must point to the date of the estimates instead (and as such, the year field and its relentlessly uniform 2014s could be deleted, though keep in mind that Enigma stores Center worth data for previous years as well), and if a good many of the numbers appear to you in scientific-notational form (e.g. 1.07E+.08) that’s just one more way by which Excel means to tell you that their columns are presently too narrow. Just auto-fit away, as usual. (Note that the origin field in H is likewise uninformative, comprising nothing but the Center for Responsive Politics entry, and the source field in I is even sparser, amounting to nothing but its header.)

Get past those preliminaries and you’ll need to remind yourself about the codes informing the chamber field: E – Executive, S – Senate, H – House of Representatives, and J – Judiciary. That latter code estimates the worth of the nine now-serving Supreme Court Justices, an interesting set of disclosures about a coterie one might deem slightly above the financial fray; the S and H signifiers apparently name and enumerate each and every current Senator and Congressperson, a nicely comprehensive complement. The twelve executive entries, however, stand in a different relation to the dataset; they identify presidential hopefuls for the 2016 election and not office holders, with the exception of the winner.

Perhaps the first attention-grabber in the data to seize your lapels is the trio of worth estimates lined up for each politician. The Center for Responsive Politics explains that the data spring from disclosure forms completed by office holders which “do not require exact values”, but rather “the range of value into which an asset falls”. The Center thus filled ranges with lower and upper estimates and averaged the two into the avgvalue field, those computations in turn exhibiting considerable plus-minus variation. Indeed, 77 of the politicians declared a negative lower-end asset value, with another 11 submitting negative upper estimates. Of course, our data can’t take that question further.

In any case, working with the averages we could broach a first obvious question, namely how worth associates with chamber:

Rows: chamber

Values: avgvalue (formatted to two decimals with a comma)

I get:

worth1

Remember of course that the Executive universe is small, numbering but 12 cases. As such the 1.482- billion-dollar average estimate for Donald Trump slingshots the mean skyward, in spite of negative-asset claimers Scott Walker, Rick Santorum, and Martin O’Malley. Whether that means erstwhile Maryland governor O’Malley truly has less money than you and I is perhaps a query better put to his accountant, but the data maintain as much.

But the larger, if ingenuous, conclusion urged upon us is the reality that American national politicians – including Supreme Court justices – are overwhelmingly multi-millionaires. If the averages of the averages are credible, Senators possess over $10,000,000 worth of assets, with Representatives checking in a $7,500,000, accumulations that of course can’t ascribed to their $174,000 salaries. At the same time, however, variation within the Congress is massive, including outlier Darrel Issa, a Republican Congressman from southern California worth, per his average estimate, over $436,000,000.

For a sense of the dispersion of wealth, then, you can drag avgvalue in to values again, setting it to Stddevp (for the operational difference between Stddev and Stddevp, look here):

worth2

When the standard deviations exceed the averages you’re dealing with a lot of variation, even after discounting the ten-figured Mr. Trump.

Now for perhaps the juicier question: Might net worth vary palpably by party affiliation? Conventional wisdom would respond in the affirmative, i.e. Republicans and their putative sympathies for the haves could be expected to have more as well than the latter-day Blues. But either way, the dataset should enable an answer of sorts to the question.

The marching order here is to isolate party, as it’s referenced by its initial in the name field. Because that initial appears (almost) in every case immediately after the open parenthesis alongside each surname, we can take over the empty source field, rename it Party, and write in I2 (assuming you haven’t deleted the extraneous year field):

=MID(B2,FIND(“(“,B2)+1,1)

That expression seeks the character position of the open parenthesis and adds one to it, enabling MID to grab the next single character, which denotes political party. Copy the formula down the field and that should work, with nine exceptions – the Supreme Court justices, whose party memberships have been withheld from the data, an exclusion that imposes a #VALUE message upon their formulas. The workaround: rewrite the formulas thusly:

=IFERROR(MID(B2,FIND(“(“,B2)+1,1),”J”)

The nifty IFERROR function returns the default, nested formula result unless it evaluates to a error, in which case it’ll interject the defined if-false result, in the above case J, for Judiciary.

Once that revamped expression is copied down the field the pivot table follows:

Rows: Party

Values: avgvalue (Average, with appropriate formatting)

I get:

worth3

 

The thesis is confirmed, apparently (the I represents the two independent legislators informing the list, by the way, one of whom is Bernie Sanders). Republicans are worth on average 47% more than the (very) relatively proletarian Dems; but toss Stdevp back in to the table and we see:

worth4

The mammoth standard deviation among GOP sectarians has a good deal to do with the Trump billion, and Issa’s 436,000,000, too. Thus, if one deletes the row bearing Trump’s data and refreshes the pivot table, you’ll see:

worth5

Check it out.  The Republicans, we see, in actuality average a smaller net worth and standard deviation than Democrats. Go further and delete all 12 Executive entries – none of whom, with the exception of the incumbent, are election officials, after all – and run the pivot table once more:

worth6

Here, Democrats possess 9.94% more worth on average than Republicans.

Now that’s unconventionally wise.

Advertisements

Tweets of a Stable Genius

11 Feb

Notch November 7, 2017 as a watershed tickpoint on your presidential timeline. It was the day on which Twitter doubled its message character allotment to 280, an allowance that induced shudders in both the twittersphere along with vast terrestrial swaths: it mightily enriched the opportunities for the President of the United States to tweet his mind to you and me.

I’ll leave the rest of the editorials to you and me, but the character enhancement begs a few novel questions we could instructively put to the downloadable spreadsheet of Mr. Trump’s communiques offered up (for a small fee) by twdocs.com, the go-to site to which I’ve gone on more than one occasion. My latest requisition of @realDonaldTrump tweets (accept no substitutes) pumped out the incumbent’s latest 2805 entries through February 8, affording us exactly three months’ worth of post-280 streams of consciousness, with the earliest of these bearing the time stamp October 24, 2016. Again, I think you’ll have to pay for the sheet yourself if you want to click along.

A first question, then, an obvious one: has the widened message space encouraged Trump to issue longer tweets? And a second, slightly subtler follow-on: might 280 characters inspire longer, but fewer of them? (Note that this submission considers features of the commander-in-chief’s twitter proclivities over and above his preferred key words, a concern which predominates in my previous posts on the matter).
An answer to the former hinges itself on the LEN function, with which I’ve overwritten the contents of the H column, declaring nothing but the name Donald J. Trump in each of its populated cells. Renaming the field LEN I entered in H7:

=LE

N(h7)

and copied down. (I’m not quite sure what the field name a in the otherwise empty G column is doing. On the one hand, the name suffices to enable the pivot table to enfold the fields following G, but I’m not sure why the field need be there at all.)
The next move is slightly trickier, because the defining November 7, 2017 breakpoint doesn’t lend itself to a graceful pivot table grouping, or at least I don’t think it does. That is, grouping the date data in the A column by Months and Years will naturally turn out conventionally-demarcated months, e.g. Nov and Dec, and won’t group the months beginning with the 7th. A more elegant realization would tap into a pair of array formulas, i.e.: name the LEN field range Length and name the data in Created At (in the A column) Date. Enter the 11/7/2017 boundary date anywhere, say H1, and enter, say in G1:

{=AVERAGE(IF(Date<H1,Length))}

The formula seeks out all dates prior to the benchmark November 7 and averages their tweet lengths. Extended to two decimal points, I get 126.80. Then enter in G2:

{=AVERAGE(IF(Date>=H1,Length))}

This companion expression averages the character length of the President’s tweets dating from November 7, 2017 and evaluates to 193.29, a none-too electrifying differential, I’d allow; ask Twdocs.com to download your tweets, after all, and your stats might look much the same. Survey a few million post-November 7 tweets and then, and only then, might we be able to say something distinctive about Trump’s relative outputs.
But what about the second question posed above – namely, if the heightened loquacity promoted by the 280 characters thus curtails Trump’s need for as many tweets? That finding could again be essayed in a number of ways, including some messier array formulas that would count the available tweets on either side of November 7, 2017 and divide those respective tallies by the number of days elapsed prior to and following that date. In fact, however, a slightly more imprecise but swifter pivot table could do a representative job:

Rows: Created At (UTC + 1h) (The parenthetical notation points to the French provenance of twdocs, at its recording of tweet transmission times per that time in France, an important qualification. The User Created At field, however, merely reports and repeats the date/time at which Trump initiated his Twitter account).

Values: Created At (UTC + 1h) (Count)

I get:

BlogT1

True, the Nov 2017 above conflates the pre and post-November 7 numbers a bit, but the larger thrust of the counts pushes ahead. We see an inconclusive range of distributions: while tweets across the last few months have indeed downturned in relation to October of last year (and that month stands as the most profuse among the table’s holdings), tweet counts in the Feb-Jun 2017 span were decisively lower than the latter-day entries. It’s thus not clear if Trump’s tweeting inclinations have been stunted by the new opportunity to say more at any one time.

And what about Trump’s average retweet count, proposed here as a proxy popularity index? Just substitute Retweet Count for Created At (UTC + 1) in Values and tick Average (with suitable formatting). Here I’d also turn off any subtotals, thus uprooting one higher level of slightly confounding aggregation from the counts. I get:

BlogT2

Commit the above to a rudimentary line chart, and I get:

BlogT3

Is that something worth writing home about, or at least submitting on deadline? Maybe. There’s something at least statistically significant about the line’s fitful locus, and it might be worth investigating April’s retweet dip and/or the more recent spike in RT’s, which does roughly correlate with the opening of the 280-character gateway, though I’m hard-pressed to advance beyond a simple correlation – unless one wants to maintain that longer tweets bespeak deeper policy insights that warrant stepped-up circulation across the twittersphere. Sounds like a reach.

Then add Favorite Count to Values (again, Average):

BlogT4

Favorites correlate most significantly with retweets (I get .8997), an unsurprising resonance. One supposes the prior question, rather a generic one, would ask why favorites reliably outnumber rewteets so enormously (look here, for example, for one take on the matter).

And to fit a final, relatively novel frame atop the Trump tweets, what about the president’s means of transmission? That is:

Rows: Created At (UTC + 1) (again, grouped by Years and Months)

Columns: Source

Values: Source (count)

I get:

BlogT5

Note among other things Trump’s late abjuring of Twitter for Android, and a corresponding uptick in his resort to Twitter for iPhone., his clear source of choice.
Does that mean that Mr. Trump will be recruited to spearhead Apple’s next Think Different campaign?

UK Immigration Raids: Migrating the Data

16 Jan

Designing a dataset falls on the complexity band somewhere between child’s play and rocket science, though I’m not sure to which of the poles the enterprise is closer. Rows and columns, you say; what else do you need, after all? And your rhetorical smirk is right, so far as it goes.

But sometimes the rows and columns require a think-through, preferably somewhere in that design stage – and particularly, for example, if the data are farmed out to more than a single sheet for starters.

Case in point: the look, or pair of looks, at what the Bureau of Investigative Journalism calls Immigration encounters and arrests, a breakdown of city and nationality between January 2012 and January 2017 of individuals either questioned or arrested by law enforcement officials, presumably for a putative trespass of immigration law. The Bureau makes much of the large numbers of British citizens bunched with the foreign detainees, stoking the suggestion that these nationals, by definition entitled to be where they were when they were apprehended, were ethnically profiled, an inference the UK government discounts. The BIJ data are gathered here:

FOI 42354 – Annex A

(Observe the notes stationed beneath the data. If I understand the qualifications of note 2 on the ENCOUNTERED NOT ARRESTED sheet, the “total number of persons” seems to provide for the possible counting of the same person multiple times, should he/she be questioned more than once. Thus, the data appear to tabulate all encounters, not discrete individuals. And note 2 on ENCOUNTERED ARRESTED apparently makes analytical room for multiple arrests of the same person.)

And those data comprise two identically organized worksheets; and so, it seems to me, the pair should be pared. The likenesses among the two argue for a consolidation, but with a particular end in mind: to calculate the ratios marking the encountered and arrested figures, both by city of apprehension and origin of detainee (note: the misspelled city names Cardiff and Sheffield in the ENCOUNTERED ARRESTED sheet must be emended).

But before the consolidation comes the reconstitution. It’s something we’ve seen many times before, but because the worksheets assign field status to the several cities in the data set, these really need to be reined by the yoke of a single field; and to achieve that end we can review and apply Excel’s Power Query capability to the task, more particularly its doubtfully-named Unpivot feature.

First, and this script speaks to both data sets, delete the Grand Total column and row, the better to stem a double-count of the numbers. Then click anywhere in the data set and click Data > From Table/Range in the Get & Transform Data button group (remember I’m working in Excel 2016).

Next, select the fields to be incorporated into what Excel terms termed the attribute-value pairs (use the shift-arrow keyboard combination to select the fields) e.g. in excerpt:

imm1

Then click Transform > Unpivot Columns:

imm2

And follow with Home > Close & Load. The outcome should look like this, more or less:

imm3

But that doesn’t conclude the process, even when you’d do much the same as above for the second data set. We need to differentiate the records per their determination -that is, we need to clarify whether or not the stop culminated in an arrest. Thus we need to append a Status field to the results, coding and copying non-arrest outcomes Enc (Encounter, or something like it), and arrest dispositions, e.g. Arr, down the tables that require this or that status.

And once all that work has been carried out we need to merge the data, by simply copying one set immediately beneath the other – but that amalgamation stirs a small problem, too.

Note that when Power Query succeeds in unpivoting the data (and that verb won’t concede that the data it unpivots was never a pivot table to begin with), it refashions them into a table; and so when we perform the copy described above the copied table continues to retain its now-irrelevant header row, now threaded among the genuine records. But attempt a Delete> Table Rows of the header and Excel resists; the delete option is grayed out, because the second table – the one we’ve pasted beneath the first – remains a table very much in its own right, its positioning notwithstanding, and a table requires a header row, apparently. A workaround is at hand, though: click successively anywhere within the two tables and click Table Tools > Design > Convert to Range, and reply Yes to the prompt. Apart from gifting you with the answer to a question that has recurrently rustled your sleep – namely, when would you ever want to downgrade a table back to what Excel judgmentally calls a normal range – we’ve just learned the answer to a companion question – why? Here, converting a table, or tables, into ranges enables the user to delete the second, now superfluous header row, and agglomerate the two erstwhile, contiguous tables into a single unitary data set. And there are pivot tables in there for the gleaning. (And once these deeds have been done we could rename the default Attribute field City, and Value could be identified anew as Number.­)

And one such table, per our stated intention about 500 words ago, looks to discover the proportion of detainees actually arrested, broken out here by city:

Rows: City

Columns: Number

Values: Count (% of Row Total) Turn off Grand Totals.

I get:

imm5

We see, for reasons that remain to be expounded, some notable differentials. It appears that Bristol’s cohort of suspects – 1143 in absolute terms – was by far most likely to experience arrest, more than 120% more vulnerable to arraignment than Sheffield’s 4687.  But accountings for those disparities await, of course. It is at least possible, for example, to imagine that Bristol’s immigrant authorities proceeded with a greater precision, perhaps being guided by more trusted information. Far larger London exhibits a detention rate of 24.24%, a touch lower than the aggregate 25.30%.

Substituting Nationality for City in the table and restoring Grand Totals yields a dossier of mixed findings, owing in part to the broad variation in the numbers of citizens for the respective countries. Sort by Grand Totals (the sort will work, even though they all display 100%; Excel sorts the values skulking behind the percentages), and the uppermost finding will give extended pause:

British citizens are practically never arrested when after being stopped for questioning, returning us to the proper question of profiling, and the kindred fact that the native British are stopped far more often – over 19,000 times that anyone else. On the other hand, we’re left to similarly explain Italy’s low arrest-to-encounter rate, or the fact the Norwegian nationals were stopped 32 times and never arrested.

Indeed, the mighty variation in rates begs a new round of questions; but the spreadsheet develops the first round of answers.

US Police Shooting Data, Part 2: A Civilian’s Review

29 Dec

There’s more tweaking to do inside the Vice data set of big-city police-shooting incidents, though some tweaks are more concessionary than ameliorative. The abundance of NAs flecking the sheet’s fields is not a matter of slipshod data collection; Vice’s preface to the data recalls the issues the site experienced with respondent cooperativeness and data quality. But missing data are missing, and so the fact that shooting victim ages (SubjectAge, in column H) don’t appear for more than 53% of incident cases remains a fact; and while of course the available ages of the other approximately 2000 victims amounts to a sizeable cohort, one wonders if the absent data inflict a sampling skew on the ages. I suspect they don’t, but the question should be recorded.

Pivot-tabling the age data, then, will deliver a mixed yield, e.g. NA along with N/A, U and UNKNOWN, two instances of the label Juvenile, 29 hard-coded #REF! entries that seem to mark a formula gone wrong, and a few hundred records reporting estimated ages in spanned swaths that are no less textual – 0-19 and 20-29, for example. Again, the data sources are uncontrollably heterodox, but a decision about these data needs to be made – which might consist of a determination to simply ignore the age parameter, or alternatively of course to proceed with the usable information in hand.

And missing data likewise compromise the SubjectRace field. Submitting this pivot table for consideration:

Rows: SubjectRace

Values: SubjectRace

SubjectRace (again, % of Column Total)

Churns out these distributions:

vice1a

The disproportion of Black victims is clear, but of course remains to be explained by a larger context framed by prior variables, e.g. number of overall incidents, number of alleged perpetrators alleged to have been carrying a weapon, etc., but again note the raft of NAs, Us, and blanks disqualifying more than 30% of the entries from the analysis (the blanks simply aren’t counted above at all). (I’m assuming, by the way, that A signifies Asian and O Oriental. But even if so, I don’t know if those categorizations denote distinct or problematically redundant ethnic niches.)

But in any case, there are things to be learned from the data, their shortcomings notwithstanding. For example, we could look at the relationship of victims’ race to shooting incidents resulting in a fataility. In that connection we could assemble this pivot table:

Rows: SubjectRace

Columns: Fatal

Values: SubjectRace (% of Row Total; filter out (blank) and turn off Grand Totals)

I get:

vice2a

The results are fairly startling.  We have already learned that the absolute number of black victims far outpaces the other groups, but the ratio of fatal police shootings by race involving a white victim – 51.27% – is far higher than that for black or Latinos, a less-than-“intuitive”, literally provocative outcome.

How might these proportions be accounted for? If in fact police may make swifter resort to their weapons in confrontations with black suspects – if – then perhaps white suspects are shot and killed for compelling, truly defensible reasons, i.e., more of them might be armed, or at least seen as such?

In view of that question we could thus introduce a third variable, SubjectArmed. Because three-variable pivot tables complicate readability, we could assign a Slicer to the field instead:

Rows: SubjectRace

Columns: SubjectArmed

Values: SubjectRace (count, % of Row Totals)

Slicer: Fatal

Ticking F in the Slicer for Fatal, I get:

vice 3a

 

Remember that the column entries above comprise the SubjectArmed items – No, Not Available, Unknown, and Yes.

We see that, so far as the data can take the question, white fatalities are actually less likely to be armed than black or Latino victims. But because the figures for B incur so many NAs and Unknowns, the conclusions are obscured. Indeed – as a consequence of the missing data, we see that white victims are also more likely to be armed.

It could also be noted that 153 fatal shootings received an NA race code, and another 80 were assigned a U. One wonders if these could be inferentially mapped against and compared with the known-race victim distributions, though that supposition is a highly adventurous one.

But turn to the OfficerRace and OfficerGender fields and you’ll encounter a new set of challenges, only one of which are the 450 OfficerRace cells containing no officer information at all, and the not-inconsiderable number of other cells bearing the code H.  Given the cities with which the code is associated I’d guess H stands for Hispanic, but how or if that letter means to distinguish itself  from L I’m not sure, particularly in light of the absence of H data from the SubjectRace field.

OfficerRace and OfficerGender accompany the straightforwardly quantified NumberOfOfficers field, and attempt to code the ethnicities of incident-participant officers – in a single cell.  That is, a cluster of items, each of which would typically populate its own record, here share one cell’s space.

Moreover, a good many cells replace the elliptical W and B entries with WHITE or BLACK respectively, a set of inconsistencies that have to be dealt with as well. We could dispose of these latter anomalies, however, via batch of finds and replaces, e.g., exchanging a W for WHITE.

But it’s the multi-item cells that call for a deeper think. If, for example, we wanted to break out the officer totals by ethnicity, say white officers for starters, we could select OfficerRace, name it offrace and enter, somewhere:

{=SUM(LEN(offrace)-LEN(SUBSTITUTE(offrace,”W”,””)))}

I get 2329 – but how? Those tell-tale curly braces can only mean one thing: an array formula is at work, and what it’s doing first is first summing the aggregate length of the cells committed to offrace, and from that default baseline substracting a revised aggregate length – achieved by substituting nothing, as it were, for every instance of a W throughout the offrace range (remember that SUBSTITUTE replaces every instance of the searched-for text, not merely the first one), and totalling the lengths these. The math, then, yields the 2329, and to empower the process you could do something like this:  enter these race identities in Q2:Q7 respectively: W,B,L,A,H,NA. Follow with

{=SUM(LEN(offrace)-LEN(SUBSTITUTE(offrace,Q2,””)))}

in R2, and copy down accordingly. I get

vice 4a

First, of course, the numbers need to be weighted alongside the aggregated ethnic representations of the police forces studied, but we don’t have that information here. But a more proximate operational concern looms above: At least two of the results are wrong.

For one thing, the formula’s search for NA (assuming that search term is in Q7) must read instead

{=SUM(LEN(offrace)-LEN(SUBSTITUTE(offrace,Q2,” “)))}

What’s different about the above emendation is the space “ “ supplying the final argument in the expression. Because we need to substitute here for two characters – the N and the A – their replacement need comprise one character, in order to foster a differential of one for each original NA. Thus my redone substitution returns 1174, or precisely half the original, errant NA total.

But the other false start is considerably trickier. The count of 1299 for A enormously overstates the incidence of that code, because my formula also registers appearances of NA (and other entries as well) which, after all, contain that selfsame A. The task, then, is to account only for isolated A’s, disassociated from any other entries in the field that also happen to contain the letter – and there are other entries, for example AI/AN, and even one instance of Multi-Racial. The formula as it stands will substitute those A’s too, and be duly counted.

One far-from optimum end run around the problem would be something like this:

{=SUM(LEN(offrace)-LEN(SUBSTITUTE(offrace,Q5,””)))-R7}

That formula in effect calculates all A appearances and simply subtracts the NA total from them, which holds down cell R7. It yields a far more plausible 125 but remains inexact, owing to the other A-freighted cells that continue to be counted above. What could have been done, then: a seeing to it that every coded race/ethnicity consisted of unique letter(s).

But I’d allow I’ve done enough deep thinking for one post.

US Police Shooting Data, Part 1: A Civilian’s Review

18 Dec

You’ll want to read the fine print attending Vice.com’s database on US police shootings perpetrated between 2010 and 2016 for the country’s 50 largest police departments before you turn to the data themselves here, having been exported into Excel via its native Google sheet:

ViceNews_FullOISData

That expository lead-in on the Vice site recounts the provenance of the data, the mixed cooperativeness of departments Vice encountered when requesting incident information, and some of the coding decisions that shaped the workbook as it has emerged. In fact, Vice reports the figures for 47 police departments and for what it tabulates as 4,099 episodes, though the data set itself comprises 4,381 records. That discrepancy may owe to the fact that Vice appears to have assigned one record to each civilian party to an incident, even if multiple civilians were participative (note the NumberofSubjects field in column B records 1 for each row; that unrelieved redundancy suggests the field is dispensable).

Of course, the spreadsheet can’t tell us what truly happened in any of these 4,000-plus episodes, but that should discourage neither Vice nor us from evaluating its contents; but as usual, those contents need be adjudged for their data-worthiness first.

And a few judgements need to be made about some authentic issues. When I attempted to sort the NumberofSubjects field I was detained by this message:

vice1

In other words, some merged cells have edged their way into the set, congesting the sort prohibitively. The swiftest means for unmerging (lovely Microsoft verb) the culprit cells: click the empty Select all button to the immediate left of the A column header, thus selecting the entire sheet, and click Home ribbon > Unmerge Cells, the final option waiting behind the Merge & Center button in the Alignment group. (If you need to learn the specific addresses of merged cells, look here. The merged-cell adulteration is particularly pronounced among the data for Austin, Texas, by the way, these commencing in row 105).

There’s more. My customary date-validity check of the Date field, i.e. submitting a =COUNT(A2:A4382) formula to the column in search of a total of 4381 – the count of entries in the range that would roll up should all of them prove to be numeric – presented me with 4355 instead, netting, or nettling, a shortfall of 26. And that means that 26 date bluffers – textual flim-flammers hoping to con you and me into believing they’re for real – are up to no good somewhere in there.

If you click in an actual date in A and sort the date data Oldest to Smallest, those scruffy 26 settle to the bottom of the field. They’re all phrased in European regional syntax (dd/mm/yyyy) (as they are in the native Google Sheet), and they all emanate from Baltimore County, as it turns out. But origins aside, their numeric character needs to be reclaimed.

Something you can do, then: clamber into Q4357 – the next free cell in the first row bearing the phony dates – and enter:

=DATE(RIGHT(A4357,4),MID(A4357,4,2),LEFT(A4357,2))

DATE is the function that gathers individuated year, month, and day references and compounds them into a usable date. Here I get 41134; format the result in Short Date terms (use the Number format dropdown in the Number button group) and you wind up with a for-real 8/13/2012. Copy down Q, then Copy > Paste Value the results to A4357:A4382, and format accordingly (and now delete the Q data).

That works, but now ride the elevator to A2 and consider the value – and it is a value – that sits in that cell. That 2010 you see is not some stripped-down cipher for a fuller expression, e.g. 6/5/2010. Rather, it’s nothing but, and only 2010; and if you format those four digits into Short Date terms you get 7/2/1905 – which after all, is the 2010th day distanced from Excel’s date baseline of January 1, 1900. Without the supplementary day and month data, that 2010 won’t be transposed into anything later in the calendar, and there are 593 cells in the A column – or 13.5% of all of its holdings – comprising four-digit entries that won’t evaluate to anything like the years they purport to portray.  And even if you pivot table the dates and hope to pivot them by year only, that won’t work.

These four-digit discordances inflict both a spreadsheet and a journalistic/ethical problem of sorts upon the analysis. The former is spoken to above, but the latter asks about a workaround. What can we do with these data? If we adjoin some place-holding day-month place holder – e.g. 1/1 to the numbers in the interests of lifting them to usability, is that something we’re allowed to do? Perhaps we could, if we indeed want to group the dates by year, and only by year. In that case, since our singular interest is in grouping the years that we in effect already know, any fictional days and months would play a merely technical, facilitating role in reconciling all the year data to the table. But ascribing any substantive role to the days and months, e.g., grouping the data by the latter parameter, would appear to abuse the system.

If that rationale is deemed supportable, we could then remake the 2010 in A2 via

=DATE(A2,1,1)

And do the same for its fellow 592 cells (followed by Copy > Paste Values and reformatting to Short Date). Complete that task and you could engineer this pivot table:

Rows: Date (grouped by Years only)

Values: Date (Count)

 I get:

vice2

The preliminary count points to a marked diminution in police shootings since 2013 – encouraging, perhaps, but a suggestion that needs to be inspected for the completeness and comparability of the data across the 2010-2016 span.

Now if you want to enumerate shooting incidents by city – an aim bulking high on any investigative agenda (factoring population size, of course) you could proceed here:

Rows City

Values: City (Count)

The values could be sorted Largest to Smallest. I get, in excerpt:

vice3

Chicago – a city less than half as populous as New York – records a surpassingly high incident total, in fact accounting for 12% of all the reported shootings among the cities. The city’s spike in gun violence has been no stranger to media coverage, and so resort by the police to that weapon is not shocking in and of itself. Return Date to the pivot table, this time to the Columns area, and

vice4

Note Chicago police have actually pulled back on gun interventions (but note the missing data for Detroit for 2014-16, contributory in part to the lowered recent numbers).

One other point, here: I think the BaltimoreCity/BaltimoreCounty and MiamiDade/City of Miami entries are properly distinguished, and not instances of spelling inconsistencies. But there are other data issues, too.

London Transport Expenses: Overhead of the Underground

6 Dec

London’s straphangers don’t want to be taken for a ride, at least not one of the nasty, metaphorical variety. The city’s underground system costs a touch more than your model train set, and you’re probably not charging the neighbors £3.80 to hop on your Lionel O-Gague either – even if it’s powered by Bluetooth, no less.

Londoners who want to know what their system is spending on their rough and ready commute, then, can turn to this ledger of Transport for London’s expenses (including payments for buses and other means of getting about) for the fiscal year April 2016-March 2017 (there’s a curious measure of bean-counting ambiguity at work here; the above link announces a restriction of the data to expenses exceeding £500, even as the web page on which it’s situated declares a floor of £250. Yet the data themselves divulge sums falling beneath even the latter stratum). Once you save the native-CSV workbook and its 250,000 records into Excel mode, you’ll be charged 10 MB (and not the 26.5 MB the link declares) for the acquisition – a not unreasonable expense.

And as usual, a public dataset of this size isn’t issue-free, even  over and above the standard column auto-fit chore you’ll need to perform (you may want to want to reel in the width on the Expenditure Account field in C if you want to see all the columns onscreen simultaneously, though). The expenses themselves come sorted largest to smallest, but cells E251874:F251874 shouldn’t be there. The former cell hard-codes a count – that is, a COUNT – of the date entries in E and comes away with 251873, corroborating the genuine date-value (as opposed to textual) standing of all the cells in the field. That’s a very good thing to know, but once that finding is disseminated it needs to be deleted, or else E251874 will be treated as an item in another, next record, along with the sum of all expenditures in the adjoining cell – and that number needs to go, too. Moreover, the Merchant Account category in G is so occasionally populated with unexplained codes that it descends into purposelessness.

There’s more, but in the interests of making a start we could break out payments by what the workbook calls Entity, presumably the agency doing the paying:

Rows: Entity

Values: Entity (Count)

Amount (£) (Sum, taken to two decimals with commas)

Amount (£) (Average, identical formatting)

I get:

exp1

Note that London Bus Services accounts for nearly a third of all expenses, even as the London Underground writes almost ten times more checks – or cheques, if you’re the one signing them.

But look again, in the table’s lower reaches. TUBELINES LTD seems to have imposed two variant spellings in the Row Labels, and that’s the data equivalent of fare beating. The restitution: select the Entity column and engineer a Find and Replace:

exp2

But substitute Vendor Name for Entity in the pivot table in order to commence a dissection of expenses by supplier instead, and the problem multiplies. Apart from the 8,000 traders with whom Transport for London did business last year, a flickering scan of the names turns up what could be much more of the same, e.g.:

exp3

The first entry is surely mistaken, and the last has incurred a superfluous trailing space. Or

exp4

Don’t know about those, or these:

exp5

And that’s only the W’s, proceeding upward. Thus a responsible and sober breakout of vendor income would bid (pun slightly intended) the investigator to comb the entire gallery of names, should that prove practicable. If it isn’t, don’t try this one at home.

More promising, however, are the prospects for a breakout of the expenses by say, month (remember, the data pull across one fiscal year). Just replace Vendor Name with Clearing Date and group by Months and Years:

exp8

Note the modal month – the final one, March, running through the 31st of 2017. Perhaps an accumulation of unspent, budgeted monies at the year’s close spurred the outlays, but only perhaps. Note on the other hand that the number of purchases then are comparatively few – but that very understated figure pumps the average expenditure to a month-leading 33,486.45. It is April, the financial year’s inaugural month, that surpassingly records the most expenses – perhaps because budgetary allotments may have been in place by then and immediate-need purchases were transacted sooner rather than later –  but at the same time with the lowest per-expense average. I sense a story angle there, but reasonable people may choose to disagree.

Because the expense amounts have been sorted highest to lowest, you’ve also likely noticed the cluster of negative expenses gathering at the foot of the sort. Presumably these quantify overpayments, or perhaps late fees and/or other penalties. In any event, if you want to sum the negatives two formulaic approaches come to mind. First, I named the expense range F2:F251873 amnt and played this by-the-book strategy:

=SUMIF(amnt,”<“&0)

But I also kicked out this dedicated array formula:

{=SUM(IF(amnt<0,amnt))}

Either way, I get £-44,928,905.53, and that calls for an awfully large petty cash drawer. Call me the iconoclast, but the array formula actually seems to me to make a bit more “sense”, in spite of its resort to the fabled Ctrl-Shift-Enter keystroke triad; it’s that text-based “<”&0 concatenation in SUMIF that gets me. How the function reads numeric value into a text-concatenated operator – something the array formula manages to avoid – is just one of those things, I suppose. But it isn’t “intuitive” to me, whatever that means.

And for one more journey through the data, might there be an association between disbursements and day of the week? You may  impugn the question as silly or trifling, but I prefer to view it in blue-sky research terms. If you’re still with me one (but not the only) route toward an answer is to careen into the next-free column H, name it Weekday or something like it, and enter in H2:

=WEEKDAY(D2)

And copy down (remember that Sunday returns a default value of 1). Next, try

Rows:  Weekday

Values: Weekday (Count)

Weekday (again, this time % of Column Total; note that here too you’ll need to define the field in Count terms first)

I get:

exp6

Those numbers are more odd than silly, or at least notable. Monday accounts for 44% of all expenses issued to vendors, with the following days exhibiting a strangely alternating up-down jitter.

Now those outcomes count the number of expenses administered, not their aggregate value. Drop Amount (£) into Values too and cast it into to % of Column Total terms and:

 

exp7

That’s pretty notable too. The Monday – and Friday – expense counts are by no means proportioned to their monetary outgoings (that 0.00% sum for day 7, or Saturday, is an illusory, two-decimal round-off).

Why the Monday-Friday inversions? Fare question.

 

 

 

 

 

Burlington, Vermont Property Sales: Buying into the Data

24 Nov

House hunting? Then why not think about setting up stakes in Burlington, Vermont? Bucolic but sophisticated, a university town festooned with postcard-perfect New England autumnal foliage, lovely this time of year… and the very city in which Ben and Jerry’s Ice Cream was founded.

What more could you want? But the real reason I’m so keen on Burlington – a city I’ve never actually visited, after all – is its spreadsheet-archived history of property transactions, a veritably sociological, 100-plus-year account of the residential flux of its citizenry, available to you on Burlington’s open data site, and here, too:

Burlington Property Sales

All of which begs a first question about those 100-plus years: how complete is the sheet? Once you’ve done the column auto-fit thing, if you sort the data A to Z by column G – Sale_Date, whose holdings presently appear before us in text mode, you’ll observe that the three transactions in the data set lifted to the top of the sort are dated January 1, 1800, and succeeded by another seven sales, each of which was apparently conducted on July 14, 1897. Add the exactly twelve property purchases agreed upon in the 50s and the 21 more ratified in the following decade, and the market is starting to look a little sluggish, isn’t it?

I’m no developer, but I’m a speculator; and I’m speculating that the sales record as we have it must be partial, for whatever reason. I’m also conjecturing that the Sales Prices (column H) amounting to $0 denote a property bestowal of sorts, perhaps from a charitable donor to a governmental entity. In fact, I put that $0 conjecture to the Burlington open data site, and await their reply.

And in light of those sporadic sale representations it might be sensible to confine our analytical looks to the records commencing with 1990, because purchases in the 70s number a none-too-profuse 41, and the sales count for the 80s – 137 – appears questionably scant as well; and so it seems to me that a first analytic go-round, at least, might sensibly push off at 1990, the inception of the decade in which the data uncork a quantum leap in activity. But needless to say, a reportorial consideration of the property sales would force a closer look into the pre-1990 lacunae.

That’s one issue. For another, inspect the Sale_Date entries in column G. As adverstised, each and every one of these appears to be textual, encumbered by the T00:00:00.000Z suffix that entraps the real, immanent dates in their cells. But a fix is available (really more than one is out there): select column G and apply this Find and Replace:

Burl1

Replacing the suffix with nothing at all frees the actual, quantifiable dates within each cell, and even restores the results to my American regional date settings, in spite of their year-first formats.

But you may notice nine recalcitrants to the above workaround – the sale dates preceding 1900. They’ve shed their T00:00:00.000Z appendages but remain steadfastly textual; and that’s because Excel can’t directly process, or quantify, dates falling before January 1, 1900 (for some suggestions for allaying the problem look here). In light of the discussion above, however, that sticking point won’t adhere here, because our intention is to ignore sales predating 1990 anyway. But at the same time leaving those insistent textual entries in the data set will effectively stymie any attempt to group the dates in a pivot table; throw even one text datum into a date mix and you just can’t group it, or at least I can’t. Thus if you click on any date in the Sale_Date field and sort Oldest to Newest, the nine textual charlatans drop to the bottom of the column, whereupon one can thread a blank row immediately above them, and break them off from the remainder of the data set.

The larger question – albeit an academic one here – is to ask about a strategy for contending with say, a few thousand pre-1900 dates among the data, had they been there.  With a number that large, we would want to use them – but then what? The simplest tack, I think: not to run the Find and Replace, then, but rather draw up a new field in the next-available N column, call it something like Year, and enter in N2:

=VALUE(LEFT(G2,4))

and copy down N.

The reason we’d look past the Find and Replace here is because the above formula will return the pre-1900 postings as simple numbers, e.g. 1800; and 1800 doesn’t represent the year 1800. It stands in fact for December 4 1904, the 1800th day of the sequence commencing with the date baseline January 1, 1900 (and we’ll ignore Excel’s February 29, 1900 willful error taken up in the article linked above). In the interests of consistency, we’d want to regard the all the data in G one way or the other: either as bona fide dates or unvarnished numbers – and if we found ourselves working with pre-1900 dates we’d have to resort to the latter plan.

But again, because in our case the pre-1900 data are so negligible we’re ignoring them, and reverting to the actual date-making Find and Replace scheme. All of which at last enables the answer to an obvious research question:  how many Burlington properties have been purchased by year?

Start with this pivot table:

Rows: Sale_Date

Values: Sale_Date (count, grouped by Years. Click the Filter button by Row Labels, tick Date Filters, and enter

Burl2

I get:

Burl3

Note that the above Grand Total and its record complement date only from 1990, but nevertheless encompass 28790 of the 29032 entries packing the entire data set (note in addition that the 2017 data take the sales through September 15). The activity curve trends this way and that, with the numbers peaking in 2003 and scaling down erratically in the intervening 14 years (though all this of course assumes the data are complete for the respective years in hand, an unsubstantiated but not implausible guess).

We next might want to pursue a natural follow-up: an association of average purchase prices by year. But here too some forethought is needed – because by admitting the raft of $0 “purchases” into the calculations the averages would skew badly. We could thus apply a simple fix: take over the next free column (probably N; remember that the earlier data settlement in that column was hypothetical), call it Nonzero, and enter in N2:

=If(H2>0,H2,””)

H stores the Sales_Price data, and the “”, or length-less text double-quotes, will be ignored in any average computation. Once we expand the pivot table data source to N we can garnish the existing table with Nonzero, Average, and figured to two decimal points. I get:

Burl4

Unlike actual sales, the average prices proceed more or less upwards, an unsurprising ascent syncopated by the spike in 2012. Filter that year in the data set and you’ll find four identically described and dated purchases of $6,485,000, conducted on December 18 of that year by the Bobbin Mill Building Company and sold to the Burlington Housing Authority. A sweeping property acquisition by the powers that be, or a data-entry redundancy? I don’t know. And I haven’t come across any Land_Use and Deed_Type legends that would decode the entries in those fields, and that could inspire some additional breakouts of the data.

C’mon, Burlington, help me out here; give me those answers and there’s a Double Caramel Brownie Ben and Jerry’s cup in it for you. With an extra scoop – my treat.