London Transport Expenses: Overhead of the Underground

6 Dec

London’s straphangers don’t want to be taken for a ride, at least not one of the nasty, metaphorical variety. The city’s underground system costs a touch more than your model train set, and you’re probably not charging the neighbors £3.80 to hop on your Lionel O-Gague either – even if it’s powered by Bluetooth, no less.

Londoners who want to know what their system is spending on their rough and ready commute, then, can turn to this ledger of Transport for London’s expenses (including payments for buses and other means of getting about) for the fiscal year April 2016-March 2017 (there’s a curious measure of bean-counting ambiguity at work here; the above link announces a restriction of the data to expenses exceeding £500, even as the web page on which it’s situated declares a floor of £250. Yet the data themselves divulge sums falling beneath even the latter stratum). Once you save the native-CSV workbook and its 250,000 records into Excel mode, you’ll be charged 10 MB (and not the 26.5 MB the link declares) for the acquisition – a not unreasonable expense.

And as usual, a public dataset of this size isn’t issue-free, even  over and above the standard column auto-fit chore you’ll need to perform (you may want to want to reel in the width on the Expenditure Account field in C if you want to see all the columns onscreen simultaneously, though). The expenses themselves come sorted largest to smallest, but cells E251874:F251874 shouldn’t be there. The former cell hard-codes a count – that is, a COUNT – of the date entries in E and comes away with 251873, corroborating the genuine date-value (as opposed to textual) standing of all the cells in the field. That’s a very good thing to know, but once that finding is disseminated it needs to be deleted, or else E251874 will be treated as an item in another, next record, along with the sum of all expenditures in the adjoining cell – and that number needs to go, too. Moreover, the Merchant Account category in G is so occasionally populated with unexplained codes that it descends into purposelessness.

There’s more, but in the interests of making a start we could break out payments by what the workbook calls Entity, presumably the agency doing the paying:

Rows: Entity

Values: Entity (Count)

Amount (£) (Sum, taken to two decimals with commas)

Amount (£) (Average, identical formatting)

I get:


Note that London Bus Services accounts for nearly a third of all expenses, even as the London Underground writes almost ten times more checks – or cheques, if you’re the one signing them.

But look again, in the table’s lower reaches. TUBELINES LTD seems to have imposed two variant spellings in the Row Labels, and that’s the data equivalent of fare beating. The restitution: select the Entity column and engineer a Find and Replace:


But substitute Vendor Name for Entity in the pivot table in order to commence a dissection of expenses by supplier instead, and the problem multiplies. Apart from the 8,000 traders with whom Transport for London did business last year, a flickering scan of the names turns up what could be much more of the same, e.g.:


The first entry is surely mistaken, and the last has incurred a superfluous trailing space. Or


Don’t know about those, or these:


And that’s only the W’s, proceeding upward. Thus a responsible and sober breakout of vendor income would bid (pun slightly intended) the investigator to comb the entire gallery of names, should that prove practicable. If it isn’t, don’t try this one at home.

More promising, however, are the prospects for a breakout of the expenses by say, month (remember, the data pull across one fiscal year). Just replace Vendor Name with Clearing Date and group by Months and Years:


Note the modal month – the final one, March, running through the 31st of 2017. Perhaps an accumulation of unspent, budgeted monies at the year’s close spurred the outlays, but only perhaps. Note on the other hand that the number of purchases then are comparatively few – but that very understated figure pumps the average expenditure to a month-leading 33,486.45. It is April, the financial year’s inaugural month, that surpassingly records the most expenses – perhaps because budgetary allotments may have been in place by then and immediate-need purchases were transacted sooner rather than later –  but at the same time with the lowest per-expense average. I sense a story angle there, but reasonable people may choose to disagree.

Because the expense amounts have been sorted highest to lowest, you’ve also likely noticed the cluster of negative expenses gathering at the foot of the sort. Presumably these quantify overpayments, or perhaps late fees and/or other penalties. In any event, if you want to sum the negatives two formulaic approaches come to mind. First, I named the expense range F2:F251873 amnt and played this by-the-book strategy:


But I also kicked out this dedicated array formula:


Either way, I get £-44,928,905.53, and that calls for an awfully large petty cash drawer. Call me the iconoclast, but the array formula actually seems to me to make a bit more “sense”, in spite of its resort to the fabled Ctrl-Shift-Enter keystroke triad; it’s that text-based “<”&0 concatenation in SUMIF that gets me. How the function reads numeric value into a text-concatenated operator – something the array formula manages to avoid – is just one of those things, I suppose. But it isn’t “intuitive” to me, whatever that means.

And for one more journey through the data, might there be an association between disbursements and day of the week? You may  impugn the question as silly or trifling, but I prefer to view it in blue-sky research terms. If you’re still with me one (but not the only) route toward an answer is to careen into the next-free column H, name it Weekday or something like it, and enter in H2:


And copy down (remember that Sunday returns a default value of 1). Next, try

Rows:  Weekday

Values: Weekday (Count)

Weekday (again, this time % of Column Total; note that here too you’ll need to define the field in Count terms first)

I get:


Those numbers are more odd than silly, or at least notable. Monday accounts for 44% of all expenses issued to vendors, with the following days exhibiting a strangely alternating up-down jitter.

Now those outcomes count the number of expenses administered, not their aggregate value. Drop Amount (£) into Values too and cast it into to % of Column Total terms and:



That’s pretty notable too. The Monday – and Friday – expense counts are by no means proportioned to their monetary outgoings (that 0.00% sum for day 7, or Saturday, is an illusory, two-decimal round-off).

Why the Monday-Friday inversions? Fare question.







Burlington, Vermont Property Sales: Buying into the Data

24 Nov

House hunting? Then why not think about setting up stakes in Burlington, Vermont? Bucolic but sophisticated, a university town festooned with postcard-perfect New England autumnal foliage, lovely this time of year… and the very city in which Ben and Jerry’s Ice Cream was founded.

What more could you want? But the real reason I’m so keen on Burlington – a city I’ve never actually visited, after all – is its spreadsheet-archived history of property transactions, a veritably sociological, 100-plus-year account of the residential flux of its citizenry, available to you on Burlington’s open data site, and here, too:

Burlington Property Sales

All of which begs a first question about those 100-plus years: how complete is the sheet? Once you’ve done the column auto-fit thing, if you sort the data A to Z by column G – Sale_Date, whose holdings presently appear before us in text mode, you’ll observe that the three transactions in the data set lifted to the top of the sort are dated January 1, 1800, and succeeded by another seven sales, each of which was apparently conducted on July 14, 1897. Add the exactly twelve property purchases agreed upon in the 50s and the 21 more ratified in the following decade, and the market is starting to look a little sluggish, isn’t it?

I’m no developer, but I’m a speculator; and I’m speculating that the sales record as we have it must be partial, for whatever reason. I’m also conjecturing that the Sales Prices (column H) amounting to $0 denote a property bestowal of sorts, perhaps from a charitable donor to a governmental entity. In fact, I put that $0 conjecture to the Burlington open data site, and await their reply.

And in light of those sporadic sale representations it might be sensible to confine our analytical looks to the records commencing with 1990, because purchases in the 70s number a none-too-profuse 41, and the sales count for the 80s – 137 – appears questionably scant as well; and so it seems to me that a first analytic go-round, at least, might sensibly push off at 1990, the inception of the decade in which the data uncork a quantum leap in activity. But needless to say, a reportorial consideration of the property sales would force a closer look into the pre-1990 lacunae.

That’s one issue. For another, inspect the Sale_Date entries in column G. As adverstised, each and every one of these appears to be textual, encumbered by the T00:00:00.000Z suffix that entraps the real, immanent dates in their cells. But a fix is available (really more than one is out there): select column G and apply this Find and Replace:


Replacing the suffix with nothing at all frees the actual, quantifiable dates within each cell, and even restores the results to my American regional date settings, in spite of their year-first formats.

But you may notice nine recalcitrants to the above workaround – the sale dates preceding 1900. They’ve shed their T00:00:00.000Z appendages but remain steadfastly textual; and that’s because Excel can’t directly process, or quantify, dates falling before January 1, 1900 (for some suggestions for allaying the problem look here). In light of the discussion above, however, that sticking point won’t adhere here, because our intention is to ignore sales predating 1990 anyway. But at the same time leaving those insistent textual entries in the data set will effectively stymie any attempt to group the dates in a pivot table; throw even one text datum into a date mix and you just can’t group it, or at least I can’t. Thus if you click on any date in the Sale_Date field and sort Oldest to Newest, the nine textual charlatans drop to the bottom of the column, whereupon one can thread a blank row immediately above them, and break them off from the remainder of the data set.

The larger question – albeit an academic one here – is to ask about a strategy for contending with say, a few thousand pre-1900 dates among the data, had they been there.  With a number that large, we would want to use them – but then what? The simplest tack, I think: not to run the Find and Replace, then, but rather draw up a new field in the next-available N column, call it something like Year, and enter in N2:


and copy down N.

The reason we’d look past the Find and Replace here is because the above formula will return the pre-1900 postings as simple numbers, e.g. 1800; and 1800 doesn’t represent the year 1800. It stands in fact for December 4 1904, the 1800th day of the sequence commencing with the date baseline January 1, 1900 (and we’ll ignore Excel’s February 29, 1900 willful error taken up in the article linked above). In the interests of consistency, we’d want to regard the all the data in G one way or the other: either as bona fide dates or unvarnished numbers – and if we found ourselves working with pre-1900 dates we’d have to resort to the latter plan.

But again, because in our case the pre-1900 data are so negligible we’re ignoring them, and reverting to the actual date-making Find and Replace scheme. All of which at last enables the answer to an obvious research question:  how many Burlington properties have been purchased by year?

Start with this pivot table:

Rows: Sale_Date

Values: Sale_Date (count, grouped by Years. Click the Filter button by Row Labels, tick Date Filters, and enter


I get:


Note that the above Grand Total and its record complement date only from 1990, but nevertheless encompass 28790 of the 29032 entries packing the entire data set (note in addition that the 2017 data take the sales through September 15). The activity curve trends this way and that, with the numbers peaking in 2003 and scaling down erratically in the intervening 14 years (though all this of course assumes the data are complete for the respective years in hand, an unsubstantiated but not implausible guess).

We next might want to pursue a natural follow-up: an association of average purchase prices by year. But here too some forethought is needed – because by admitting the raft of $0 “purchases” into the calculations the averages would skew badly. We could thus apply a simple fix: take over the next free column (probably N; remember that the earlier data settlement in that column was hypothetical), call it Nonzero, and enter in N2:


H stores the Sales_Price data, and the “”, or length-less text double-quotes, will be ignored in any average computation. Once we expand the pivot table data source to N we can garnish the existing table with Nonzero, Average, and figured to two decimal points. I get:


Unlike actual sales, the average prices proceed more or less upwards, an unsurprising ascent syncopated by the spike in 2012. Filter that year in the data set and you’ll find four identically described and dated purchases of $6,485,000, conducted on December 18 of that year by the Bobbin Mill Building Company and sold to the Burlington Housing Authority. A sweeping property acquisition by the powers that be, or a data-entry redundancy? I don’t know. And I haven’t come across any Land_Use and Deed_Type legends that would decode the entries in those fields, and that could inspire some additional breakouts of the data.

C’mon, Burlington, help me out here; give me those answers and there’s a Double Caramel Brownie Ben and Jerry’s cup in it for you. With an extra scoop – my treat.

World Series Winners and Losers: Playing the Percentages

13 Nov

Don’t be fooled by last week’s denouement of the World Series; the baseball season never really ends. The looks at the data from what was once boomed as the National Pastime just don’t stop, including some looks at the World Series itself, e.g. a survey put together by Joel Pozin of the regular-season winning percentages of Series participants dating back to the first series in 1903. It’s here:

Regular Season Winning Percentages of World Series Champions vs. Losers_ 1903-2017 Joel Pozin

The survey in fact contributes one of three small, public-domain Series’ datasets Pozin makes available on the collaboration-fostering site (you’ll need to sign up for free for access to the other two workbooks; note that the percentage data for 1904 and 1994 aren’t there, because the World Series weren’t contested those years. In addition, I’ve appended percentage win-percentage data for the 2017 season to the sheet here.)

The other two workbooks recount the Series winner and loser percentages in their discrete sheets, but they do something else as well: they bare the formulas that return the team winning percentages, formulas that do a slightly different math from that performed by Major League Baseball’s number crunchers. A winning percentage, after all, emerges from a rather simple act of division: Wins/All games played. But Pozin has taken account of the mini class of games that, for whatever reason, culminated in a tie or draw, games that baseball officialdom simply bars from the win-loss calculation. Pozin, on the other hand, admits them to the All-games-played denominator, and assigns a .5 for each tie to the numerator. Pozin’s outcomes thus don’t invariably comport with the canonical percentages, though the differences of course aren’t game-changing, so to speak. But differences they are.

Those formulas themselves are interesting, however. On the loser sheet, for example, the 1903 Series runner-up Pittsburgh Pirates won 91 games, lost 49, and tied one, those respective accumulations committed to cells C2:E2 in the losers’ worksheet. The formula in F2 then declares:


(Note that the sheet featuring Series winners formulates its denominators this way instead, e.g.: (C2+D2+E2) ). The single tied game recorded in E2 is halved and added to the win figure in C2 to build the formula’s numerator; but in addition, the rounding of the result to three decimals quantifies the value in F2 to exactly what we see – .649, or .6490000.

But one could opine that the cause of exactitude could have been slightly better served with


followed by a formatting of the result to three decimals, thus preserving the quotient’s precision. The ROUND function forces a substantive pullback in precision – because after applying ROUND, the number you see is truly and only .649. But does my nit-pick here matter? Maybe.

And while we’re deliberating about things formatting, the winning percentages expressed in the workbooks in their Excel-default, 0.649 terms could be made to assume the baseball standard .649 deportment per this custom format:


Now back to the winners and losers in the master sheet I’ve offered for download. A simple inaugural inquiry would have us calculate and compare the average winning percentage of the winners and losers. Rounded off to the usual three decimals I get .619 and .614 respectively, a dissimilitude that is neither great nor surprising. World Series competitors, after all, are the champions of their respective leagues, and so could be regarded as more-or-less equivalently skilled. And while of course only one team can win, the best-of-seven-game motif (in fact four series proceeded on a best-of-nine basis) could be ruled as too brief to define the truly superior squad.

But additional comparisons may point to other interesting disparities. If we pivot table and group the winning percentages in say, ten-year tranches:

Rows: Year

Values: ChampWinRatio (Average)

LoserWinRatio (Average)

(Remember that no Series was played in 1904 and 1994, and that the custom format we commended above must be reintroduced to the pivot table if you want it in play here. In addition, of course, the 2013-2022 tranche, forced by our grouping instruction to embody the ten-year amplitude, comprises only five years’ worth of data).

I get:


Note the rather decided scale-down in winning percentages set in motion during the 1973-1982 tranche. Do these smallish but apparently real curtailments hint at a press toward parity among baseball’s teams that dulled the advantage of elite clubs? Perhaps the advent of free agency following the 1975 season, in which teams’ contractual hold on their players was relaxed, played its part in smoothing the distribution of talent.

But another, if distantly related, exposition of the trend could also be proposed. Baseball rolled out a post-regular-season playoff system in 1969, one that now qualifies ten of its 30 teams each season; and that broadened inclusiveness overwrites any guarantee that the clubs with the best regular-season records will find themselves in the fall classic.  The 1973 National League champion New York Mets, to call up the extreme example, beached up in the Series with a regular-season winning percentage of .509. But they won the playoff.

Now let’s return to my quibble about the deployment of the ROUND function, and my counter-suggestion for simply calculating win percentages without it and formatting the numbers to three decimals instead. Working with Joel Pozen’s rounded figures, we can write an array formula that’ll count the number of World Series victors whose regular-season percentage exceeded that of the losers each year:


The formula assigns the value of 1 to each value in the C column – the one containing Series winners’ percentages – that tops the corresponding value in E, the losers’ column, and then adds all the 1’s (note: the formula can surely be written in alternative ways). I get 57, meaning that according to Pozin’s percentages a jot more than half of all the 113 World Series wins went to the team with the higher regular-season percentage.  Again, not a shocker, but worth demonstrating.

Now if we array-calculate the number of Series winners with the lower of the regular-season winning percentages:


I get 53 – but there are 113 Series for which to account here, and 57 plus 53 computes to 110.   It turns out then then that in three Series – the ones played in 1949, 1958, and 2013 – the competing teams appear to have achieved the same regular-season win percentage.

And for two seasons, 1949 and 2013, the winner-loser identity is inarguable – the teams in those years had precisely the same numbers of wins and losses. But if we actuate my formulaic alternative, in which the drawn-game half-win-credit is retained but the ROUND function is shed for the reasons advanced above, we learn that the 1958 winner New York Yankees played to a .596774 percentage (rounded to six decimals), because they played a tie game that year; but the losing Milwaukee Braves steamed into the Series with a .597402. Seen that way, the 1958 Braves are the 54th losing team to best the winner’s regular-season mark.

The difference here of course is hair-splittingly slim. But if your head-of-the-class high school average was .000628 points greater than that of the runner-up, would you be willing to share your valedictorian honors by agreeing to a round-off?

Girls, Boys, and NYC Math Scores: An Addendum

31 Oct

Sloth has its limits. I had, if you must know, entertained a couple of questions about the way in which I managed the New York City math test data in my previous post, but place momentum in the service of lassitude and I decided  that for the time being I couldn’t be bothered. Now I’m bothered.

What’s bothering me, at last, is a concern we’ve certainly voiced before in other posts – namely, the fact that my score analyses to date have ascribed equal weight to each record in the data set, irrespective of the number of students populating the listed schools. That forced parity thus imparts disproportionate influence to smaller schools. But because the data set is so sizable, I had originally assumed that a proper weighting of the student populations would exert only a small corrective effect on the test averages.

That assumption isn’t terribly untoward; but because the girls’ test averages evince only slight – if tenacious – superiorities over the boys’, a second, suitably weighted look at the scores might the responsible view to take via a calculated field, in the interests of learning if the results might materially change as a consequence.

To start, then, I visited next-available column S, titled it RealScores (of course the name is your call), and entered, in S2:


 The formula simply multiplies every school’s mean score in G by the number of its students posted to F, returning the total number of score points, as it were, to be divided by the total number of students identified by a pivot table breakout (e.g, Borough).

 The IF statement in which the formula is couched replaces each “s” entry (see the previous post below for explication) with a textual cipher (“”), and not 0. That latter numeric would have been processed by any average, thus artificially depressing the results. And the absence of a test result assuredly does not strike an equivalent to zero, as any student will tell you.  

And those “s” rows release an interesting subtlety into the equation. The calculated field, which I’ve I advanced above (I’ve called it actscore) is written thusly:


That is, the field simply divides the RealScores field we instituted earlier by the total number of students tested (the Number Tested field). The “s” records contain no scores – but they continue to cite student numbers in Number Tested. Unaccompanied as they are by no scores, those numbers thus add ballast to the denominator in our calculated field, and in turn artificially drag down the averages.

The way out – one way out – is to sort the Mean Scale Score (in G) Smallest to Largest, if you’ve clicked on a number in the field. The s entries should fall to the bottom of the field, the first s landing in row 36045. Hang a left to F36045, enter a 0 there, and copy it down the remainder of the column. Now the calculated field will add nothing from the s records – literally – to the Number Tested denominator, obviating the downward pull of the averages to emerge.

Got all that? In any case, now we can go on to construct the same pivot tables that featured in the preceding post, and compare the new recalculated results here with those in that post.

Averages by Gender and Year:


Averages by Borough and Gender:


One we didn’t directly bring to the previous posts, Averages by Grade and Gender (we had added Year to the mix there):


The signal conclusion here is the preservation, per all permutations, of the girls’ edge. The averages are all a touch higher this time around, however, a concomitant of higher scores that appear to obtain in the larger schools – the ones contributing larger numbers to the averages.

Now I’m not so bothered. But I’m still missing a sock.

Girls, Boys, and NY City Math Scores: No Mean Feat

27 Oct

Are boys better at math than girls? The question is hard-wired for controversy of course, but stammer a first response by trotting out SAT (Scholastic Aptitude Test) scores and the answer is yes. Boys consistently outscore girls by roughly 30 points a year on the nationally-administered exam (national in the US, that is), an extraordinarily dogged interval that bucks the ebb and flows of SAT score aggregates:


(Screen shot source: the above-linked web site, reproducing the chart in turn from the College Board that conducts the SAT.)

And, as the linked site above continues, the advantage enforces itself even as high school girls outperform boys academically overall. (None of this suggests that conclusions are necessary or unitary; see this demurral, for example.)

Is that the last word on the subject? Of course it isn’t, and for additional testimony we can swing our protractors over to the New York City Open Data site and its compilation of New York State Math Test results by gender for city schools, reported for the years 2013-17 (the test is described here). Click the Data Dictionary link to call up a workbook offering the data and supplementary worksheet tabs clarifying their fields.

Next click the Master tab and 47,000 rows worth of score data are yours for the analyzing, including at least three parameters just begging for your attention: student grade, year of exam, and gender, which the sheet curiously entitles Category. The test scores themselves – scaled to enable cross-grade comparisons – fill column G, a normalization that would thus appear to support some rather global aggregations, e.g. average score for all students by year or gender, or even borough (that parameter is in fact concealed in the DBN field in A and must be dredged from the school code entries, a chore we’ve performed in a previous post).

You’ll note, however, that 485 records, a touch more than 1% of them all, substitute an “s” code in lieu of actual test scores, a replacement impelled by privacy considerations (see the discussion in cell B12 of the DatasetInfo tab). In fact, we’ve encountered this stricture before in a different demographic context; see this post, for example. But in any case, those alpha proxies will simply be ignored by any computation, thus freeing the 99% of the number-bearing entries for any closer, quantifying look.

But there’s another perturbation roiling the data that the formulas won’t automatically overlook: the All Grades posts in the Grade field that, in effect, subtotal schools’ math averages by year and gender, e.g., the scores for all girls at P.S. 015 in 2013 across all grades. There are more than 11,000 such records among the data and I’d allow they could, and probably should, be deleted, because they perpetrate the old double-count snag that must be cleared. Again, any and all aggregating tasks performed upon the score data could be made to emerge from a pivot table, and the promise of eliminating 187,000 cells from the worksheet should prove irresistibly appealing.

But I am mustering a show of resistance, or ambivalence, to any plan for expurgating some of the Level fields in columns H through Q. These count the number of students per school per year and per class, whose testing outcomes position them atop one of four levels of accomplishment defined by New York State, and as detailed in the ColumnInfo sheet (the higher the number the greater the proficiency). The Level % fields – for example, could have, under other spreadsheet-organizational circumstances, been easily derived from the N-field data via a standard pivot-table sited % of Column/Row Total – had all the level scores been assigned their own record beneath the same, circumscribing field. But the data don’t behave that way, and so perhaps the sheet is doing us a favor by folding the % fields into the data set (in fact, had all the Level_N numbers been assembled beneath a single field heading, the Number Tested field could have likewise been bid a farewell, because a pivot table would return numbers tested as a sub or grand total). So in light of the prevailing realities, leaving all those fields alone might be a prudent thing.

But I think the time has come for us to actually do something with the data. For a broad-stroked inaugural look, we could average all test scores by gender and year, via this pivot table:

Rows: Year

Columns: Category

Values: Mean Scale Score (Average, figured to two decimal points)

I get:


No female shortfall here. On the contrary, girls’ average scores top boys for each of the available years, by more than a point each year. (On the other hand, we also need to inspect a nuance dotting the large student numbers here: the obvious fact that many of the scores issue from the same students. After all, a third-grade student in school A in 2013 likely reappeared as a fourth-grade student in school A in 2014.)

But is there any evidence of a progressive slimming of the girls’ edge with age? SAT scores, after all, chart a significantly older demographic: the cohort of university-eligible students, the one that affirms that insistent 30-point male differential. Our data, on the other hand, emanate from test-takers hailing from grades three through eight, bidding us to wonder: Does a male margin begin to pull away any time within our data? An answer calls for an introduction of the Grade parameter to the table, but because a three-variable output (Year, Category, and Grade) will begin to clutter the table, I’d opt to cast Grade into a Slicer and successively tick the six grades for which the numbers are available:


Click through the grades and the girl score superiority prevails throughout, indeed broadening unevenly across the older grades and cresting with a 3.07-point average excess in grade 6.

We could also examine borough-wide variation in the numbers, by extracting the respective borough codes from the DBN data in column A. The codes occupy the third position in each code entry, and so if we wend our way to the-next-free-column R and name it Borough, we can enter in R2:


Copy down R, and the borough indicators are isolated:

K- Brooklyn (for Kings County)
R-Richmond (Staten Island)
X-The Bronx

Then draw up this pivot table:

Rows: Borough

Columns: Category

Values: Mean Scale Score (Average, to two decimals)

I get:


Here both gender and borough disparities mark the data, the former most conspicuous in the Bronx. Among the boroughs Queens rises to the top, its average score separating itself from fifth-place the Bronx by more than 20 points. Indeed, the Queens boy average exceeds every other girl’s mean, save that of Queens, of course.

These findings contribute but a sprinkling of pixels to the larger picture, but contributory they are. Accounting for the relative male math predominance at later phases of the educational process serves up a famous analytical challenge; winners of the near-impossible William Lowell Putnam Mathematical Competition, for example – 120-point university-level exam on which the median score is often 0 – are nearly all male, but I don’t know what that means, either.

All of which begs the next question: I’m a guy – so why is it when I divide my number of socks by 2 I always get a remainder of 1?

US Visa Data, Part 2: An Excel Application

16 Oct

Promenade down the wage_rate_of_pay_from field lining the H-1B visa data set (the visas have rather newsworthy of late; see, for example this New York Times piece) and you’ll likely saunter away with the impression that the jobs attaching to the H-1Bs are impressively compensated (the partner wage_rate_of_pay_to field that reports the apparent upper ranges among the proposed salaries is more spottily filled). These sound like pretty good, if impermanent gigs, and as such there might be something to learn by say, breaking out wages by state – once you try your hand at cinching a loose end or two that, bad metaphor notwithstanding, poses a knotty problem demanding your expert attention. (You’ll also learn that a very small proportion of visa applications – around 2% – aren’t seeking H-1Bs at all, but rather variant visas tied to country agreements with Chile, Australia, and Singapore.)

You’ll discover that the wage_unit_of_pay field in fact stocks a number of units, or demarcations of time, by which pay is calculated, e.g., Bi-Weekly, Year, etc. The problem points to a banal but integral data-quality matter. The pay figures in wage_rate_of_pay overwhelmingly key themselves to the Year unit – more than 93%, in fact – but more than 32,000 visas stipulate an hourly compensation, and some of these report wage units of pay that clearly can’t be clocked against 1/24th of day. Auto-filter wage_unit_of_pay for Hour, and you’ll bare 55 pay figures equalling or exceeding $48,000 – and if those rates are hourly, only one question remains to be asked: where can I download an application form? And the far smaller Bi-Weekly tranche likewise hands up a little cache of implausible compensations, as do the few Month position (can we definitively say that the $265,000 slot entertained by the Texas IPS PLLC firm in San Antonio is truly monthly? That application was withdrawn, by the way.) The Week pay data seem at least imaginable, however).

It would appear that these sore thumbs really mean to convey yearly salaries, and I know of no graceful means for redressing, apart from substituting Year in the suspect cells, and that isn’t graceful. Ok; I suppose some field-wide IF statement, e.g. if the number in wage_unit_of_pay exceeds 10000 AND the unit reads Bi-Weekly OR Hour, then enter Year in lieu of Hour with an accompanying Copy>Paste Values atop wage_unit_of_pay might placate the purists, but I’m not sure practicality would be best served that way.

In reality, of course, we can’t know if even the plausible compensation figures are right, and in view of the enormousness of the data set a press toward simplicity might bid us to merely work with the Year item, with a pivot table that goes something like this:

Row: employer_state

Values: wage_rate_of_pay_from (Average)

wage_rate_of_pay_from (again, this time Count)

Slicer: wage_unit_of_pay (select Year)

Sort the averages largest to smallest, and I get in excerpt:


Nice work, if you can get it. That’s Alaska heading the average salary hierarchy, though its visa-owning workforce is diminutive. Among the heavy hitters, Washington state (WA) – home of Microsoft, for one thing – checks in with an average emolument of $113,139; and indeed, an auto-filter tells me that over 4,000 of the Washington visas were requested by that very firm (you may happen to note, by the way, that one Microsoft-attesting record spells its headquarters city Redmnond, another data-quality eyebrow raiser. And there’s those 14 state blanks, too, though they sure pay well, if anonymously). The overall Year salary offer – and remember we’re working with the wage_rate_of_pay_from, the lower of the two salary tiers: over $87,000.

But inspection of the averages will affirm considerable interstate variation – e.g. major employer Texas, whose $79,823.51 mean thrusts it down to 46th place, though largest hirer California looks good with an average over $102,000. Accounting for the dispersions across the salary band might make for a story worth writing.
And if you’re interested in principal employers, try this table:

Rows: employer_name (Filter, Top 10)

Values: employer_name (Count). Sort largest to smallest.

I get:


Tech consultant giant Infosys and Tata rank first and second among the visa-sponsoring firms; Microsoft barely makes the top 10. But pad the table with the wage_rate_of_pay_from field, and again deploy a Slicer and tick its Year item, and we see:

v3Big bucks emanating from Redmond, no matter how you spell it.

And since I’ve lazed past the discrepant-time-unit problem by confining my pivot tabling to Year figures only, let me at least consider a workaround that would more-or-less reconcile the Year, Bi-Weekly, etc. units, provided all their salary data were in working order, so to speak. In order to say, map all the salary proposals to the Year baseline, I’d draw up this lookup table:


The idea here calls for looking up the wage_unit_of_pay for all cells, and multiplying the associated pay figure by the appropriate looked-up value. These, of course, are standardized educated guesses, assuming for example that a 2000-hour annual work investment is to be multiplied by the hourly rate and that a bi-weekly term comprises 25 paydays.

Those suppositions could, for any given visa case, be way off, but absent meaningful supplementary information, they’d have to do.

But for the reasons detailed above we didn’t go that route anyway. I don’t know about you, but I’m happy enough with 494,672 records.

US Visa Data, Part 1: An Excel Application

4 Oct

If you want to be a legal alien in the United States you’ll need a visa. If you’re technically skilled the visa of choice is the H-1B, described by the Public Enigma open data site as”…a non-immigrant visa that allows U.S. companies and organizations to temporarily employ foreign workers in specialty occupations.”

If you want to learn more about the speciality cadre join Enigma for free, carry out the requisite click-throughs (the H-1B data are advertised on their home page) and you’ll find yourself here:


Opt for the 2017 edition, click the ensuing Open in Data Viewer button, and then click in turn the Export Dataset button centered of the trio of links below (though you can’t get this far unless you join Enigma and sign in):


Then break for tea time while the data mows a swath through your RAM and as last overtakes your screen; and with 528,000 visa application records amassing 154 MB, that’s a lot of Liptons.

Yes, you’ll have to perform the usual column auto-fit necessities, and you may want to think about which fields could be safely sheared from the data set, in the interests of a healthful file downsizing. (You may also want to rename the character-ridden worksheet tab.) It seems to me we could do quite nicely -at least – without employer_phone, employer_phone_ext, agent_attorney_name, agent_attorney_city, agent_attorney_state, and the wholly desolate pw_wage_source_year. These truncations have the effect of scaling my workbook down to a svelte 86 MB. (You may want to retain employer_country, though; several nations other than the United States are listed therein.)

Once having thinned the file we can go to work, starting with a simple resume of number of visa requests by state. My pivot table looks like this:

Rows: Employer_state

Values: total_workers (sum; sort largest to smallest)

total_workers (again, here % of Column Total)

You may also want to filter out the very small number (25) of blank records.

Note, by the way, that total_workers doesn’t always register one worker per visa request; a great many of the applications record multiple applicants.

My pivot table looks like this, in excerpt:


High-population, hi-tech California submits nearly a quarter of all of the H1B applications, with the top three states, including Pennsylvania and Texas, handing in 52% of them all.  Grand total of H1Bs: over a million.

(Note that the output comprises 57 rows, because the application data counts the District of Columbia (DC) the nation’s capital, not accorded state status), and US territories: Puerto Rico (PR), the Virgin Islands (VI), Guam (GU), Micronesia (FM), the Northern Mariana Islands (MP), and American Samoa (AS).)

And what proportion of application requests were approved? The pivot table should furnish the answer:

Rows:  case_status

Values: case_status (count, % of Column Totals).

total_workers (sum, % of Column Totals)

I get:


We find nearly 89% of the applications win governmental approval, with another 7% or so securing the okay even as the submission was withdrawn, for whatever reason. The certifications in turn accredit an even larger set of actual applicants; including the certified withdrawals, over 96% of the individuals seeking the H1B visa acquired it, at least for the time frame within which we’re working. Thus we see, latter-day controversy and clamor over immigration notwithstanding, that these visa applicants, ones in possession of specialist skill sets, are almost always shown the open door – remembering at the same time, however, that by definition they aren’t long-term immigrants.

We can next quickly recapitulate the average wait time distancing an H1B visa application from its decision date, via a pretty straightforward array formula asking us in effect to simply subtract all submission from decision dates, stored in the case_submitted and decision_date fields respectively. In the interests of key-stroke economy I named the fields sub and dec, and entered, somewhere:


That pert expression subtracts each submission from its corresponding decision date and averages them all. I get an average wait of 31.21 days per application, though a scan of individual waits discloses very substantial variation.

I then wondered about the comparative handful of application denials. Do they evince notably different wait times, in view of the fact that these entries culminate in rejection? For an answer I named the case_status field stat and entered this array formula, again somewhere:


The formula nests an IF statement into the mix, figuring the application wait times for those entries in case_status that read DENIED. In any case, I get a denial average wait of 4.06 days, a far swifter set of judgments. Perhaps those applications triggering denials are so egregiously ill-suited that the decision becomes easy – though that’s merely plausible speculation.

In the meantime, I’m getting excited about filing my own visa request. After all, my specialist skill set is formidable – how many left-handed spreadsheet devotees have so few Twitter followers? I’m stoked.

But I just remembered. I’m ineligible for a visa; I’m a US citizen.