UK Immigration Raids: Migrating the Data

16 Jan

Designing a dataset falls on the complexity band somewhere between child’s play and rocket science, though I’m not sure to which of the poles the enterprise is closer. Rows and columns, you say; what else do you need, after all? And your rhetorical smirk is right, so far as it goes.

But sometimes the rows and columns require a think-through, preferably somewhere in that design stage – and particularly, for example, if the data are farmed out to more than a single sheet for starters.

Case in point: the look, or pair of looks, at what the Bureau of Investigative Journalism calls Immigration encounters and arrests, a breakdown of city and nationality between January 2012 and January 2017 of individuals either questioned or arrested by law enforcement officials, presumably for a putative trespass of immigration law. The Bureau makes much of the large numbers of British citizens bunched with the foreign detainees, stoking the suggestion that these nationals, by definition entitled to be where they were when they were apprehended, were ethnically profiled, an inference the UK government discounts. The BIJ data are gathered here:

FOI 42354 – Annex A

(Observe the notes stationed beneath the data. If I understand the qualifications of note 2 on the ENCOUNTERED NOT ARRESTED sheet, the “total number of persons” seems to provide for the possible counting of the same person multiple times, should he/she be questioned more than once. Thus, the data appear to tabulate all encounters, not discrete individuals. And note 2 on ENCOUNTERED ARRESTED apparently makes analytical room for multiple arrests of the same person.)

And those data comprise two identically organized worksheets; and so, it seems to me, the pair should be pared. The likenesses among the two argue for a consolidation, but with a particular end in mind: to calculate the ratios marking the encountered and arrested figures, both by city of apprehension and origin of detainee (note: the misspelled city names Cardiff and Sheffield in the ENCOUNTERED ARRESTED sheet must be emended).

But before the consolidation comes the reconstitution. It’s something we’ve seen many times before, but because the worksheets assign field status to the several cities in the data set, these really need to be reined by the yoke of a single field; and to achieve that end we can review and apply Excel’s Power Query capability to the task, more particularly its doubtfully-named Unpivot feature.

First, and this script speaks to both data sets, delete the Grand Total column and row, the better to stem a double-count of the numbers. Then click anywhere in the data set and click Data > From Table/Range in the Get & Transform Data button group (remember I’m working in Excel 2016).

Next, select the fields to be incorporated into what Excel terms termed the attribute-value pairs (use the shift-arrow keyboard combination to select the fields) e.g. in excerpt:

imm1

Then click Transform > Unpivot Columns:

imm2

And follow with Home > Close & Load. The outcome should look like this, more or less:

imm3

But that doesn’t conclude the process, even when you’d do much the same as above for the second data set. We need to differentiate the records per their determination -that is, we need to clarify whether or not the stop culminated in an arrest. Thus we need to append a Status field to the results, coding and copying non-arrest outcomes Enc (Encounter, or something like it), and arrest dispositions, e.g. Arr, down the tables that require this or that status.

And once all that work has been carried out we need to merge the data, by simply copying one set immediately beneath the other – but that amalgamation stirs a small problem, too.

Note that when Power Query succeeds in unpivoting the data (and that verb won’t concede that the data it unpivots was never a pivot table to begin with), it refashions them into a table; and so when we perform the copy described above the copied table continues to retain its now-irrelevant header row, now threaded among the genuine records. But attempt a Delete> Table Rows of the header and Excel resists; the delete option is grayed out, because the second table – the one we’ve pasted beneath the first – remains a table very much in its own right, its positioning notwithstanding, and a table requires a header row, apparently. A workaround is at hand, though: click successively anywhere within the two tables and click Table Tools > Design > Convert to Range, and reply Yes to the prompt. Apart from gifting you with the answer to a question that has recurrently rustled your sleep – namely, when would you ever want to downgrade a table back to what Excel judgmentally calls a normal range – we’ve just learned the answer to a companion question – why? Here, converting a table, or tables, into ranges enables the user to delete the second, now superfluous header row, and agglomerate the two erstwhile, contiguous tables into a single unitary data set. And there are pivot tables in there for the gleaning. (And once these deeds have been done we could rename the default Attribute field City, and Value could be identified anew as Number.­)

And one such table, per our stated intention about 500 words ago, looks to discover the proportion of detainees actually arrested, broken out here by city:

Rows: City

Columns: Number

Values: Count (% of Row Total) Turn off Grand Totals.

I get:

imm5

We see, for reasons that remain to be expounded, some notable differentials. It appears that Bristol’s cohort of suspects – 1143 in absolute terms – was by far most likely to experience arrest, more than 120% more vulnerable to arraignment than Sheffield’s 4687.  But accountings for those disparities await, of course. It is at least possible, for example, to imagine that Bristol’s immigrant authorities proceeded with a greater precision, perhaps being guided by more trusted information. Far larger London exhibits a detention rate of 24.24%, a touch lower than the aggregate 25.30%.

Substituting Nationality for City in the table and restoring Grand Totals yields a dossier of mixed findings, owing in part to the broad variation in the numbers of citizens for the respective countries. Sort by Grand Totals (the sort will work, even though they all display 100%; Excel sorts the values skulking behind the percentages), and the uppermost finding will give extended pause:

British citizens are practically never arrested when after being stopped for questioning, returning us to the proper question of profiling, and the kindred fact that the native British are stopped far more often – over 19,000 times that anyone else. On the other hand, we’re left to similarly explain Italy’s low arrest-to-encounter rate, or the fact the Norwegian nationals were stopped 32 times and never arrested.

Indeed, the mighty variation in rates begs a new round of questions; but the spreadsheet develops the first round of answers.

Advertisements

US Police Shooting Data, Part 2: A Civilian’s Review

29 Dec

There’s more tweaking to do inside the Vice data set of big-city police-shooting incidents, though some tweaks are more concessionary than ameliorative. The abundance of NAs flecking the sheet’s fields is not a matter of slipshod data collection; Vice’s preface to the data recalls the issues the site experienced with respondent cooperativeness and data quality. But missing data are missing, and so the fact that shooting victim ages (SubjectAge, in column H) don’t appear for more than 53% of incident cases remains a fact; and while of course the available ages of the other approximately 2000 victims amounts to a sizeable cohort, one wonders if the absent data inflict a sampling skew on the ages. I suspect they don’t, but the question should be recorded.

Pivot-tabling the age data, then, will deliver a mixed yield, e.g. NA along with N/A, U and UNKNOWN, two instances of the label Juvenile, 29 hard-coded #REF! entries that seem to mark a formula gone wrong, and a few hundred records reporting estimated ages in spanned swaths that are no less textual – 0-19 and 20-29, for example. Again, the data sources are uncontrollably heterodox, but a decision about these data needs to be made – which might consist of a determination to simply ignore the age parameter, or alternatively of course to proceed with the usable information in hand.

And missing data likewise compromise the SubjectRace field. Submitting this pivot table for consideration:

Rows: SubjectRace

Values: SubjectRace

SubjectRace (again, % of Column Total)

Churns out these distributions:

vice1a

The disproportion of Black victims is clear, but of course remains to be explained by a larger context framed by prior variables, e.g. number of overall incidents, number of alleged perpetrators alleged to have been carrying a weapon, etc., but again note the raft of NAs, Us, and blanks disqualifying more than 30% of the entries from the analysis (the blanks simply aren’t counted above at all). (I’m assuming, by the way, that A signifies Asian and O Oriental. But even if so, I don’t know if those categorizations denote distinct or problematically redundant ethnic niches.)

But in any case, there are things to be learned from the data, their shortcomings notwithstanding. For example, we could look at the relationship of victims’ race to shooting incidents resulting in a fataility. In that connection we could assemble this pivot table:

Rows: SubjectRace

Columns: Fatal

Values: SubjectRace (% of Row Total; filter out (blank) and turn off Grand Totals)

I get:

vice2a

The results are fairly startling.  We have already learned that the absolute number of black victims far outpaces the other groups, but the ratio of fatal police shootings by race involving a white victim – 51.27% – is far higher than that for black or Latinos, a less-than-“intuitive”, literally provocative outcome.

How might these proportions be accounted for? If in fact police may make swifter resort to their weapons in confrontations with black suspects – if – then perhaps white suspects are shot and killed for compelling, truly defensible reasons, i.e., more of them might be armed, or at least seen as such?

In view of that question we could thus introduce a third variable, SubjectArmed. Because three-variable pivot tables complicate readability, we could assign a Slicer to the field instead:

Rows: SubjectRace

Columns: SubjectArmed

Values: SubjectRace (count, % of Row Totals)

Slicer: Fatal

Ticking F in the Slicer for Fatal, I get:

vice 3a

 

Remember that the column entries above comprise the SubjectArmed items – No, Not Available, Unknown, and Yes.

We see that, so far as the data can take the question, white fatalities are actually less likely to be armed than black or Latino victims. But because the figures for B incur so many NAs and Unknowns, the conclusions are obscured. Indeed – as a consequence of the missing data, we see that white victims are also more likely to be armed.

It could also be noted that 153 fatal shootings received an NA race code, and another 80 were assigned a U. One wonders if these could be inferentially mapped against and compared with the known-race victim distributions, though that supposition is a highly adventurous one.

But turn to the OfficerRace and OfficerGender fields and you’ll encounter a new set of challenges, only one of which are the 450 OfficerRace cells containing no officer information at all, and the not-inconsiderable number of other cells bearing the code H.  Given the cities with which the code is associated I’d guess H stands for Hispanic, but how or if that letter means to distinguish itself  from L I’m not sure, particularly in light of the absence of H data from the SubjectRace field.

OfficerRace and OfficerGender accompany the straightforwardly quantified NumberOfOfficers field, and attempt to code the ethnicities of incident-participant officers – in a single cell.  That is, a cluster of items, each of which would typically populate its own record, here share one cell’s space.

Moreover, a good many cells replace the elliptical W and B entries with WHITE or BLACK respectively, a set of inconsistencies that have to be dealt with as well. We could dispose of these latter anomalies, however, via batch of finds and replaces, e.g., exchanging a W for WHITE.

But it’s the multi-item cells that call for a deeper think. If, for example, we wanted to break out the officer totals by ethnicity, say white officers for starters, we could select OfficerRace, name it offrace and enter, somewhere:

{=SUM(LEN(offrace)-LEN(SUBSTITUTE(offrace,”W”,””)))}

I get 2329 – but how? Those tell-tale curly braces can only mean one thing: an array formula is at work, and what it’s doing first is first summing the aggregate length of the cells committed to offrace, and from that default baseline substracting a revised aggregate length – achieved by substituting nothing, as it were, for every instance of a W throughout the offrace range (remember that SUBSTITUTE replaces every instance of the searched-for text, not merely the first one), and totalling the lengths these. The math, then, yields the 2329, and to empower the process you could do something like this:  enter these race identities in Q2:Q7 respectively: W,B,L,A,H,NA. Follow with

{=SUM(LEN(offrace)-LEN(SUBSTITUTE(offrace,Q2,””)))}

in R2, and copy down accordingly. I get

vice 4a

First, of course, the numbers need to be weighted alongside the aggregated ethnic representations of the police forces studied, but we don’t have that information here. But a more proximate operational concern looms above: At least two of the results are wrong.

For one thing, the formula’s search for NA (assuming that search term is in Q7) must read instead

{=SUM(LEN(offrace)-LEN(SUBSTITUTE(offrace,Q2,” “)))}

What’s different about the above emendation is the space “ “ supplying the final argument in the expression. Because we need to substitute here for two characters – the N and the A – their replacement need comprise one character, in order to foster a differential of one for each original NA. Thus my redone substitution returns 1174, or precisely half the original, errant NA total.

But the other false start is considerably trickier. The count of 1299 for A enormously overstates the incidence of that code, because my formula also registers appearances of NA (and other entries as well) which, after all, contain that selfsame A. The task, then, is to account only for isolated A’s, disassociated from any other entries in the field that also happen to contain the letter – and there are other entries, for example AI/AN, and even one instance of Multi-Racial. The formula as it stands will substitute those A’s too, and be duly counted.

One far-from optimum end run around the problem would be something like this:

{=SUM(LEN(offrace)-LEN(SUBSTITUTE(offrace,Q5,””)))-R7}

That formula in effect calculates all A appearances and simply subtracts the NA total from them, which holds down cell R7. It yields a far more plausible 125 but remains inexact, owing to the other A-freighted cells that continue to be counted above. What could have been done, then: a seeing to it that every coded race/ethnicity consisted of unique letter(s).

But I’d allow I’ve done enough deep thinking for one post.

US Police Shooting Data, Part 1: A Civilian’s Review

18 Dec

You’ll want to read the fine print attending Vice.com’s database on US police shootings perpetrated between 2010 and 2016 for the country’s 50 largest police departments before you turn to the data themselves here, having been exported into Excel via its native Google sheet:

ViceNews_FullOISData

That expository lead-in on the Vice site recounts the provenance of the data, the mixed cooperativeness of departments Vice encountered when requesting incident information, and some of the coding decisions that shaped the workbook as it has emerged. In fact, Vice reports the figures for 47 police departments and for what it tabulates as 4,099 episodes, though the data set itself comprises 4,381 records. That discrepancy may owe to the fact that Vice appears to have assigned one record to each civilian party to an incident, even if multiple civilians were participative (note the NumberofSubjects field in column B records 1 for each row; that unrelieved redundancy suggests the field is dispensable).

Of course, the spreadsheet can’t tell us what truly happened in any of these 4,000-plus episodes, but that should discourage neither Vice nor us from evaluating its contents; but as usual, those contents need be adjudged for their data-worthiness first.

And a few judgements need to be made about some authentic issues. When I attempted to sort the NumberofSubjects field I was detained by this message:

vice1

In other words, some merged cells have edged their way into the set, congesting the sort prohibitively. The swiftest means for unmerging (lovely Microsoft verb) the culprit cells: click the empty Select all button to the immediate left of the A column header, thus selecting the entire sheet, and click Home ribbon > Unmerge Cells, the final option waiting behind the Merge & Center button in the Alignment group. (If you need to learn the specific addresses of merged cells, look here. The merged-cell adulteration is particularly pronounced among the data for Austin, Texas, by the way, these commencing in row 105).

There’s more. My customary date-validity check of the Date field, i.e. submitting a =COUNT(A2:A4382) formula to the column in search of a total of 4381 – the count of entries in the range that would roll up should all of them prove to be numeric – presented me with 4355 instead, netting, or nettling, a shortfall of 26. And that means that 26 date bluffers – textual flim-flammers hoping to con you and me into believing they’re for real – are up to no good somewhere in there.

If you click in an actual date in A and sort the date data Oldest to Smallest, those scruffy 26 settle to the bottom of the field. They’re all phrased in European regional syntax (dd/mm/yyyy) (as they are in the native Google Sheet), and they all emanate from Baltimore County, as it turns out. But origins aside, their numeric character needs to be reclaimed.

Something you can do, then: clamber into Q4357 – the next free cell in the first row bearing the phony dates – and enter:

=DATE(RIGHT(A4357,4),MID(A4357,4,2),LEFT(A4357,2))

DATE is the function that gathers individuated year, month, and day references and compounds them into a usable date. Here I get 41134; format the result in Short Date terms (use the Number format dropdown in the Number button group) and you wind up with a for-real 8/13/2012. Copy down Q, then Copy > Paste Value the results to A4357:A4382, and format accordingly (and now delete the Q data).

That works, but now ride the elevator to A2 and consider the value – and it is a value – that sits in that cell. That 2010 you see is not some stripped-down cipher for a fuller expression, e.g. 6/5/2010. Rather, it’s nothing but, and only 2010; and if you format those four digits into Short Date terms you get 7/2/1905 – which after all, is the 2010th day distanced from Excel’s date baseline of January 1, 1900. Without the supplementary day and month data, that 2010 won’t be transposed into anything later in the calendar, and there are 593 cells in the A column – or 13.5% of all of its holdings – comprising four-digit entries that won’t evaluate to anything like the years they purport to portray.  And even if you pivot table the dates and hope to pivot them by year only, that won’t work.

These four-digit discordances inflict both a spreadsheet and a journalistic/ethical problem of sorts upon the analysis. The former is spoken to above, but the latter asks about a workaround. What can we do with these data? If we adjoin some place-holding day-month place holder – e.g. 1/1 to the numbers in the interests of lifting them to usability, is that something we’re allowed to do? Perhaps we could, if we indeed want to group the dates by year, and only by year. In that case, since our singular interest is in grouping the years that we in effect already know, any fictional days and months would play a merely technical, facilitating role in reconciling all the year data to the table. But ascribing any substantive role to the days and months, e.g., grouping the data by the latter parameter, would appear to abuse the system.

If that rationale is deemed supportable, we could then remake the 2010 in A2 via

=DATE(A2,1,1)

And do the same for its fellow 592 cells (followed by Copy > Paste Values and reformatting to Short Date). Complete that task and you could engineer this pivot table:

Rows: Date (grouped by Years only)

Values: Date (Count)

 I get:

vice2

The preliminary count points to a marked diminution in police shootings since 2013 – encouraging, perhaps, but a suggestion that needs to be inspected for the completeness and comparability of the data across the 2010-2016 span.

Now if you want to enumerate shooting incidents by city – an aim bulking high on any investigative agenda (factoring population size, of course) you could proceed here:

Rows City

Values: City (Count)

The values could be sorted Largest to Smallest. I get, in excerpt:

vice3

Chicago – a city less than half as populous as New York – records a surpassingly high incident total, in fact accounting for 12% of all the reported shootings among the cities. The city’s spike in gun violence has been no stranger to media coverage, and so resort by the police to that weapon is not shocking in and of itself. Return Date to the pivot table, this time to the Columns area, and

vice4

Note Chicago police have actually pulled back on gun interventions (but note the missing data for Detroit for 2014-16, contributory in part to the lowered recent numbers).

One other point, here: I think the BaltimoreCity/BaltimoreCounty and MiamiDade/City of Miami entries are properly distinguished, and not instances of spelling inconsistencies. But there are other data issues, too.

London Transport Expenses: Overhead of the Underground

6 Dec

London’s straphangers don’t want to be taken for a ride, at least not one of the nasty, metaphorical variety. The city’s underground system costs a touch more than your model train set, and you’re probably not charging the neighbors £3.80 to hop on your Lionel O-Gague either – even if it’s powered by Bluetooth, no less.

Londoners who want to know what their system is spending on their rough and ready commute, then, can turn to this ledger of Transport for London’s expenses (including payments for buses and other means of getting about) for the fiscal year April 2016-March 2017 (there’s a curious measure of bean-counting ambiguity at work here; the above link announces a restriction of the data to expenses exceeding £500, even as the web page on which it’s situated declares a floor of £250. Yet the data themselves divulge sums falling beneath even the latter stratum). Once you save the native-CSV workbook and its 250,000 records into Excel mode, you’ll be charged 10 MB (and not the 26.5 MB the link declares) for the acquisition – a not unreasonable expense.

And as usual, a public dataset of this size isn’t issue-free, even  over and above the standard column auto-fit chore you’ll need to perform (you may want to want to reel in the width on the Expenditure Account field in C if you want to see all the columns onscreen simultaneously, though). The expenses themselves come sorted largest to smallest, but cells E251874:F251874 shouldn’t be there. The former cell hard-codes a count – that is, a COUNT – of the date entries in E and comes away with 251873, corroborating the genuine date-value (as opposed to textual) standing of all the cells in the field. That’s a very good thing to know, but once that finding is disseminated it needs to be deleted, or else E251874 will be treated as an item in another, next record, along with the sum of all expenditures in the adjoining cell – and that number needs to go, too. Moreover, the Merchant Account category in G is so occasionally populated with unexplained codes that it descends into purposelessness.

There’s more, but in the interests of making a start we could break out payments by what the workbook calls Entity, presumably the agency doing the paying:

Rows: Entity

Values: Entity (Count)

Amount (£) (Sum, taken to two decimals with commas)

Amount (£) (Average, identical formatting)

I get:

exp1

Note that London Bus Services accounts for nearly a third of all expenses, even as the London Underground writes almost ten times more checks – or cheques, if you’re the one signing them.

But look again, in the table’s lower reaches. TUBELINES LTD seems to have imposed two variant spellings in the Row Labels, and that’s the data equivalent of fare beating. The restitution: select the Entity column and engineer a Find and Replace:

exp2

But substitute Vendor Name for Entity in the pivot table in order to commence a dissection of expenses by supplier instead, and the problem multiplies. Apart from the 8,000 traders with whom Transport for London did business last year, a flickering scan of the names turns up what could be much more of the same, e.g.:

exp3

The first entry is surely mistaken, and the last has incurred a superfluous trailing space. Or

exp4

Don’t know about those, or these:

exp5

And that’s only the W’s, proceeding upward. Thus a responsible and sober breakout of vendor income would bid (pun slightly intended) the investigator to comb the entire gallery of names, should that prove practicable. If it isn’t, don’t try this one at home.

More promising, however, are the prospects for a breakout of the expenses by say, month (remember, the data pull across one fiscal year). Just replace Vendor Name with Clearing Date and group by Months and Years:

exp8

Note the modal month – the final one, March, running through the 31st of 2017. Perhaps an accumulation of unspent, budgeted monies at the year’s close spurred the outlays, but only perhaps. Note on the other hand that the number of purchases then are comparatively few – but that very understated figure pumps the average expenditure to a month-leading 33,486.45. It is April, the financial year’s inaugural month, that surpassingly records the most expenses – perhaps because budgetary allotments may have been in place by then and immediate-need purchases were transacted sooner rather than later –  but at the same time with the lowest per-expense average. I sense a story angle there, but reasonable people may choose to disagree.

Because the expense amounts have been sorted highest to lowest, you’ve also likely noticed the cluster of negative expenses gathering at the foot of the sort. Presumably these quantify overpayments, or perhaps late fees and/or other penalties. In any event, if you want to sum the negatives two formulaic approaches come to mind. First, I named the expense range F2:F251873 amnt and played this by-the-book strategy:

=SUMIF(amnt,”<“&0)

But I also kicked out this dedicated array formula:

{=SUM(IF(amnt<0,amnt))}

Either way, I get £-44,928,905.53, and that calls for an awfully large petty cash drawer. Call me the iconoclast, but the array formula actually seems to me to make a bit more “sense”, in spite of its resort to the fabled Ctrl-Shift-Enter keystroke triad; it’s that text-based “<”&0 concatenation in SUMIF that gets me. How the function reads numeric value into a text-concatenated operator – something the array formula manages to avoid – is just one of those things, I suppose. But it isn’t “intuitive” to me, whatever that means.

And for one more journey through the data, might there be an association between disbursements and day of the week? You may  impugn the question as silly or trifling, but I prefer to view it in blue-sky research terms. If you’re still with me one (but not the only) route toward an answer is to careen into the next-free column H, name it Weekday or something like it, and enter in H2:

=WEEKDAY(D2)

And copy down (remember that Sunday returns a default value of 1). Next, try

Rows:  Weekday

Values: Weekday (Count)

Weekday (again, this time % of Column Total; note that here too you’ll need to define the field in Count terms first)

I get:

exp6

Those numbers are more odd than silly, or at least notable. Monday accounts for 44% of all expenses issued to vendors, with the following days exhibiting a strangely alternating up-down jitter.

Now those outcomes count the number of expenses administered, not their aggregate value. Drop Amount (£) into Values too and cast it into to % of Column Total terms and:

 

exp7

That’s pretty notable too. The Monday – and Friday – expense counts are by no means proportioned to their monetary outgoings (that 0.00% sum for day 7, or Saturday, is an illusory, two-decimal round-off).

Why the Monday-Friday inversions? Fare question.

 

 

 

 

 

Burlington, Vermont Property Sales: Buying into the Data

24 Nov

House hunting? Then why not think about setting up stakes in Burlington, Vermont? Bucolic but sophisticated, a university town festooned with postcard-perfect New England autumnal foliage, lovely this time of year… and the very city in which Ben and Jerry’s Ice Cream was founded.

What more could you want? But the real reason I’m so keen on Burlington – a city I’ve never actually visited, after all – is its spreadsheet-archived history of property transactions, a veritably sociological, 100-plus-year account of the residential flux of its citizenry, available to you on Burlington’s open data site, and here, too:

Burlington Property Sales

All of which begs a first question about those 100-plus years: how complete is the sheet? Once you’ve done the column auto-fit thing, if you sort the data A to Z by column G – Sale_Date, whose holdings presently appear before us in text mode, you’ll observe that the three transactions in the data set lifted to the top of the sort are dated January 1, 1800, and succeeded by another seven sales, each of which was apparently conducted on July 14, 1897. Add the exactly twelve property purchases agreed upon in the 50s and the 21 more ratified in the following decade, and the market is starting to look a little sluggish, isn’t it?

I’m no developer, but I’m a speculator; and I’m speculating that the sales record as we have it must be partial, for whatever reason. I’m also conjecturing that the Sales Prices (column H) amounting to $0 denote a property bestowal of sorts, perhaps from a charitable donor to a governmental entity. In fact, I put that $0 conjecture to the Burlington open data site, and await their reply.

And in light of those sporadic sale representations it might be sensible to confine our analytical looks to the records commencing with 1990, because purchases in the 70s number a none-too-profuse 41, and the sales count for the 80s – 137 – appears questionably scant as well; and so it seems to me that a first analytic go-round, at least, might sensibly push off at 1990, the inception of the decade in which the data uncork a quantum leap in activity. But needless to say, a reportorial consideration of the property sales would force a closer look into the pre-1990 lacunae.

That’s one issue. For another, inspect the Sale_Date entries in column G. As adverstised, each and every one of these appears to be textual, encumbered by the T00:00:00.000Z suffix that entraps the real, immanent dates in their cells. But a fix is available (really more than one is out there): select column G and apply this Find and Replace:

Burl1

Replacing the suffix with nothing at all frees the actual, quantifiable dates within each cell, and even restores the results to my American regional date settings, in spite of their year-first formats.

But you may notice nine recalcitrants to the above workaround – the sale dates preceding 1900. They’ve shed their T00:00:00.000Z appendages but remain steadfastly textual; and that’s because Excel can’t directly process, or quantify, dates falling before January 1, 1900 (for some suggestions for allaying the problem look here). In light of the discussion above, however, that sticking point won’t adhere here, because our intention is to ignore sales predating 1990 anyway. But at the same time leaving those insistent textual entries in the data set will effectively stymie any attempt to group the dates in a pivot table; throw even one text datum into a date mix and you just can’t group it, or at least I can’t. Thus if you click on any date in the Sale_Date field and sort Oldest to Newest, the nine textual charlatans drop to the bottom of the column, whereupon one can thread a blank row immediately above them, and break them off from the remainder of the data set.

The larger question – albeit an academic one here – is to ask about a strategy for contending with say, a few thousand pre-1900 dates among the data, had they been there.  With a number that large, we would want to use them – but then what? The simplest tack, I think: not to run the Find and Replace, then, but rather draw up a new field in the next-available N column, call it something like Year, and enter in N2:

=VALUE(LEFT(G2,4))

and copy down N.

The reason we’d look past the Find and Replace here is because the above formula will return the pre-1900 postings as simple numbers, e.g. 1800; and 1800 doesn’t represent the year 1800. It stands in fact for December 4 1904, the 1800th day of the sequence commencing with the date baseline January 1, 1900 (and we’ll ignore Excel’s February 29, 1900 willful error taken up in the article linked above). In the interests of consistency, we’d want to regard the all the data in G one way or the other: either as bona fide dates or unvarnished numbers – and if we found ourselves working with pre-1900 dates we’d have to resort to the latter plan.

But again, because in our case the pre-1900 data are so negligible we’re ignoring them, and reverting to the actual date-making Find and Replace scheme. All of which at last enables the answer to an obvious research question:  how many Burlington properties have been purchased by year?

Start with this pivot table:

Rows: Sale_Date

Values: Sale_Date (count, grouped by Years. Click the Filter button by Row Labels, tick Date Filters, and enter

Burl2

I get:

Burl3

Note that the above Grand Total and its record complement date only from 1990, but nevertheless encompass 28790 of the 29032 entries packing the entire data set (note in addition that the 2017 data take the sales through September 15). The activity curve trends this way and that, with the numbers peaking in 2003 and scaling down erratically in the intervening 14 years (though all this of course assumes the data are complete for the respective years in hand, an unsubstantiated but not implausible guess).

We next might want to pursue a natural follow-up: an association of average purchase prices by year. But here too some forethought is needed – because by admitting the raft of $0 “purchases” into the calculations the averages would skew badly. We could thus apply a simple fix: take over the next free column (probably N; remember that the earlier data settlement in that column was hypothetical), call it Nonzero, and enter in N2:

=If(H2>0,H2,””)

H stores the Sales_Price data, and the “”, or length-less text double-quotes, will be ignored in any average computation. Once we expand the pivot table data source to N we can garnish the existing table with Nonzero, Average, and figured to two decimal points. I get:

Burl4

Unlike actual sales, the average prices proceed more or less upwards, an unsurprising ascent syncopated by the spike in 2012. Filter that year in the data set and you’ll find four identically described and dated purchases of $6,485,000, conducted on December 18 of that year by the Bobbin Mill Building Company and sold to the Burlington Housing Authority. A sweeping property acquisition by the powers that be, or a data-entry redundancy? I don’t know. And I haven’t come across any Land_Use and Deed_Type legends that would decode the entries in those fields, and that could inspire some additional breakouts of the data.

C’mon, Burlington, help me out here; give me those answers and there’s a Double Caramel Brownie Ben and Jerry’s cup in it for you. With an extra scoop – my treat.

World Series Winners and Losers: Playing the Percentages

13 Nov

Don’t be fooled by last week’s denouement of the World Series; the baseball season never really ends. The looks at the data from what was once boomed as the National Pastime just don’t stop, including some looks at the World Series itself, e.g. a survey put together by Joel Pozin of the regular-season winning percentages of Series participants dating back to the first series in 1903. It’s here:

Regular Season Winning Percentages of World Series Champions vs. Losers_ 1903-2017 Joel Pozin

The survey in fact contributes one of three small, public-domain Series’ datasets Pozin makes available on the collaboration-fostering data.world site (you’ll need to sign up for free for access to the other two workbooks; note that the percentage data for 1904 and 1994 aren’t there, because the World Series weren’t contested those years. In addition, I’ve appended percentage win-percentage data for the 2017 season to the sheet here.)

The other two workbooks recount the Series winner and loser percentages in their discrete sheets, but they do something else as well: they bare the formulas that return the team winning percentages, formulas that do a slightly different math from that performed by Major League Baseball’s number crunchers. A winning percentage, after all, emerges from a rather simple act of division: Wins/All games played. But Pozin has taken account of the mini class of games that, for whatever reason, culminated in a tie or draw, games that baseball officialdom simply bars from the win-loss calculation. Pozin, on the other hand, admits them to the All-games-played denominator, and assigns a .5 for each tie to the numerator. Pozin’s outcomes thus don’t invariably comport with the canonical percentages, though the differences of course aren’t game-changing, so to speak. But differences they are.

Those formulas themselves are interesting, however. On the loser sheet, for example, the 1903 Series runner-up Pittsburgh Pirates won 91 games, lost 49, and tied one, those respective accumulations committed to cells C2:E2 in the losers’ worksheet. The formula in F2 then declares:

=ROUND((C2+E2/2)/SUM(C2:E2),3)

(Note that the sheet featuring Series winners formulates its denominators this way instead, e.g.: (C2+D2+E2) ). The single tied game recorded in E2 is halved and added to the win figure in C2 to build the formula’s numerator; but in addition, the rounding of the result to three decimals quantifies the value in F2 to exactly what we see – .649, or .6490000.

But one could opine that the cause of exactitude could have been slightly better served with

=(C2+E2/2)/SUM(C2:E2)

followed by a formatting of the result to three decimals, thus preserving the quotient’s precision. The ROUND function forces a substantive pullback in precision – because after applying ROUND, the number you see is truly and only .649. But does my nit-pick here matter? Maybe.

And while we’re deliberating about things formatting, the winning percentages expressed in the workbooks in their Excel-default, 0.649 terms could be made to assume the baseball standard .649 deportment per this custom format:

series1

Now back to the winners and losers in the master sheet I’ve offered for download. A simple inaugural inquiry would have us calculate and compare the average winning percentage of the winners and losers. Rounded off to the usual three decimals I get .619 and .614 respectively, a dissimilitude that is neither great nor surprising. World Series competitors, after all, are the champions of their respective leagues, and so could be regarded as more-or-less equivalently skilled. And while of course only one team can win, the best-of-seven-game motif (in fact four series proceeded on a best-of-nine basis) could be ruled as too brief to define the truly superior squad.

But additional comparisons may point to other interesting disparities. If we pivot table and group the winning percentages in say, ten-year tranches:

Rows: Year

Values: ChampWinRatio (Average)

LoserWinRatio (Average)

(Remember that no Series was played in 1904 and 1994, and that the custom format we commended above must be reintroduced to the pivot table if you want it in play here. In addition, of course, the 2013-2022 tranche, forced by our grouping instruction to embody the ten-year amplitude, comprises only five years’ worth of data).

I get:

series2

Note the rather decided scale-down in winning percentages set in motion during the 1973-1982 tranche. Do these smallish but apparently real curtailments hint at a press toward parity among baseball’s teams that dulled the advantage of elite clubs? Perhaps the advent of free agency following the 1975 season, in which teams’ contractual hold on their players was relaxed, played its part in smoothing the distribution of talent.

But another, if distantly related, exposition of the trend could also be proposed. Baseball rolled out a post-regular-season playoff system in 1969, one that now qualifies ten of its 30 teams each season; and that broadened inclusiveness overwrites any guarantee that the clubs with the best regular-season records will find themselves in the fall classic.  The 1973 National League champion New York Mets, to call up the extreme example, beached up in the Series with a regular-season winning percentage of .509. But they won the playoff.

Now let’s return to my quibble about the deployment of the ROUND function, and my counter-suggestion for simply calculating win percentages without it and formatting the numbers to three decimals instead. Working with Joel Pozen’s rounded figures, we can write an array formula that’ll count the number of World Series victors whose regular-season percentage exceeded that of the losers each year:

{=SUM(IF(C2:C114>E2:E114,1))}

The formula assigns the value of 1 to each value in the C column – the one containing Series winners’ percentages – that tops the corresponding value in E, the losers’ column, and then adds all the 1’s (note: the formula can surely be written in alternative ways). I get 57, meaning that according to Pozin’s percentages a jot more than half of all the 113 World Series wins went to the team with the higher regular-season percentage.  Again, not a shocker, but worth demonstrating.

Now if we array-calculate the number of Series winners with the lower of the regular-season winning percentages:

{=SUM(IF(C2:C114<E2:E114,1))}

I get 53 – but there are 113 Series for which to account here, and 57 plus 53 computes to 110.   It turns out then then that in three Series – the ones played in 1949, 1958, and 2013 – the competing teams appear to have achieved the same regular-season win percentage.

And for two seasons, 1949 and 2013, the winner-loser identity is inarguable – the teams in those years had precisely the same numbers of wins and losses. But if we actuate my formulaic alternative, in which the drawn-game half-win-credit is retained but the ROUND function is shed for the reasons advanced above, we learn that the 1958 winner New York Yankees played to a .596774 percentage (rounded to six decimals), because they played a tie game that year; but the losing Milwaukee Braves steamed into the Series with a .597402. Seen that way, the 1958 Braves are the 54th losing team to best the winner’s regular-season mark.

The difference here of course is hair-splittingly slim. But if your head-of-the-class high school average was .000628 points greater than that of the runner-up, would you be willing to share your valedictorian honors by agreeing to a round-off?

Girls, Boys, and NYC Math Scores: An Addendum

31 Oct

Sloth has its limits. I had, if you must know, entertained a couple of questions about the way in which I managed the New York City math test data in my previous post, but place momentum in the service of lassitude and I decided  that for the time being I couldn’t be bothered. Now I’m bothered.

What’s bothering me, at last, is a concern we’ve certainly voiced before in other posts – namely, the fact that my score analyses to date have ascribed equal weight to each record in the data set, irrespective of the number of students populating the listed schools. That forced parity thus imparts disproportionate influence to smaller schools. But because the data set is so sizable, I had originally assumed that a proper weighting of the student populations would exert only a small corrective effect on the test averages.

That assumption isn’t terribly untoward; but because the girls’ test averages evince only slight – if tenacious – superiorities over the boys’, a second, suitably weighted look at the scores might the responsible view to take via a calculated field, in the interests of learning if the results might materially change as a consequence.

To start, then, I visited next-available column S, titled it RealScores (of course the name is your call), and entered, in S2:

=IF(H2=”s”,””,F2*G2)

 The formula simply multiplies every school’s mean score in G by the number of its students posted to F, returning the total number of score points, as it were, to be divided by the total number of students identified by a pivot table breakout (e.g, Borough).

 The IF statement in which the formula is couched replaces each “s” entry (see the previous post below for explication) with a textual cipher (“”), and not 0. That latter numeric would have been processed by any average, thus artificially depressing the results. And the absence of a test result assuredly does not strike an equivalent to zero, as any student will tell you.  

And those “s” rows release an interesting subtlety into the equation. The calculated field, which I’ve I advanced above (I’ve called it actscore) is written thusly:

math1a

That is, the field simply divides the RealScores field we instituted earlier by the total number of students tested (the Number Tested field). The “s” records contain no scores – but they continue to cite student numbers in Number Tested. Unaccompanied as they are by no scores, those numbers thus add ballast to the denominator in our calculated field, and in turn artificially drag down the averages.

The way out – one way out – is to sort the Mean Scale Score (in G) Smallest to Largest, if you’ve clicked on a number in the field. The s entries should fall to the bottom of the field, the first s landing in row 36045. Hang a left to F36045, enter a 0 there, and copy it down the remainder of the column. Now the calculated field will add nothing from the s records – literally – to the Number Tested denominator, obviating the downward pull of the averages to emerge.

Got all that? In any case, now we can go on to construct the same pivot tables that featured in the preceding post, and compare the new recalculated results here with those in that post.

Averages by Gender and Year:

math2a

Averages by Borough and Gender:

math3a

One we didn’t directly bring to the previous posts, Averages by Grade and Gender (we had added Year to the mix there):

math4a

The signal conclusion here is the preservation, per all permutations, of the girls’ edge. The averages are all a touch higher this time around, however, a concomitant of higher scores that appear to obtain in the larger schools – the ones contributing larger numbers to the averages.

Now I’m not so bothered. But I’m still missing a sock.