US Visa Data, Part 2: An Excel Application

16 Oct

Promenade down the wage_rate_of_pay_from field lining the H-1B visa data set (the visas have rather newsworthy of late; see, for example this New York Times piece) and you’ll likely saunter away with the impression that the jobs attaching to the H-1Bs are impressively compensated (the partner wage_rate_of_pay_to field that reports the apparent upper ranges among the proposed salaries is more spottily filled). These sound like pretty good, if impermanent gigs, and as such there might be something to learn by say, breaking out wages by state – once you try your hand at cinching a loose end or two that, bad metaphor notwithstanding, poses a knotty problem demanding your expert attention. (You’ll also learn that a very small proportion of visa applications – around 2% – aren’t seeking H-1Bs at all, but rather variant visas tied to country agreements with Chile, Australia, and Singapore.)

You’ll discover that the wage_unit_of_pay field in fact stocks a number of units, or demarcations of time, by which pay is calculated, e.g., Bi-Weekly, Year, etc. The problem points to a banal but integral data-quality matter. The pay figures in wage_rate_of_pay overwhelmingly key themselves to the Year unit – more than 93%, in fact – but more than 32,000 visas stipulate an hourly compensation, and some of these report wage units of pay that clearly can’t be clocked against 1/24th of day. Auto-filter wage_unit_of_pay for Hour, and you’ll bare 55 pay figures equalling or exceeding $48,000 – and if those rates are hourly, only one question remains to be asked: where can I download an application form? And the far smaller Bi-Weekly tranche likewise hands up a little cache of implausible compensations, as do the few Month position (can we definitively say that the $265,000 slot entertained by the Texas IPS PLLC firm in San Antonio is truly monthly? That application was withdrawn, by the way.) The Week pay data seem at least imaginable, however).

It would appear that these sore thumbs really mean to convey yearly salaries, and I know of no graceful means for redressing, apart from substituting Year in the suspect cells, and that isn’t graceful. Ok; I suppose some field-wide IF statement, e.g. if the number in wage_unit_of_pay exceeds 10000 AND the unit reads Bi-Weekly OR Hour, then enter Year in lieu of Hour with an accompanying Copy>Paste Values atop wage_unit_of_pay might placate the purists, but I’m not sure practicality would be best served that way.

In reality, of course, we can’t know if even the plausible compensation figures are right, and in view of the enormousness of the data set a press toward simplicity might bid us to merely work with the Year item, with a pivot table that goes something like this:

Row: employer_state

Values: wage_rate_of_pay_from (Average)

wage_rate_of_pay_from (again, this time Count)

Slicer: wage_unit_of_pay (select Year)

Sort the averages largest to smallest, and I get in excerpt:


Nice work, if you can get it. That’s Alaska heading the average salary hierarchy, though its visa-owning workforce is diminutive. Among the heavy hitters, Washington state (WA) – home of Microsoft, for one thing – checks in with an average emolument of $113,139; and indeed, an auto-filter tells me that over 4,000 of the Washington visas were requested by that very firm (you may happen to note, by the way, that one Microsoft-attesting record spells its headquarters city Redmnond, another data-quality eyebrow raiser. And there’s those 14 state blanks, too, though they sure pay well, if anonymously). The overall Year salary offer – and remember we’re working with the wage_rate_of_pay_from, the lower of the two salary tiers: over $87,000.

But inspection of the averages will affirm considerable interstate variation – e.g. major employer Texas, whose $79,823.51 mean thrusts it down to 46th place, though largest hirer California looks good with an average over $102,000. Accounting for the dispersions across the salary band might make for a story worth writing.
And if you’re interested in principal employers, try this table:

Rows: employer_name (Filter, Top 10)

Values: employer_name (Count). Sort largest to smallest.

I get:


Tech consultant giant Infosys and Tata rank first and second among the visa-sponsoring firms; Microsoft barely makes the top 10. But pad the table with the wage_rate_of_pay_from field, and again deploy a Slicer and tick its Year item, and we see:

v3Big bucks emanating from Redmond, no matter how you spell it.

And since I’ve lazed past the discrepant-time-unit problem by confining my pivot tabling to Year figures only, let me at least consider a workaround that would more-or-less reconcile the Year, Bi-Weekly, etc. units, provided all their salary data were in working order, so to speak. In order to say, map all the salary proposals to the Year baseline, I’d draw up this lookup table:


The idea here calls for looking up the wage_unit_of_pay for all cells, and multiplying the associated pay figure by the appropriate looked-up value. These, of course, are standardized educated guesses, assuming for example that a 2000-hour annual work investment is to be multiplied by the hourly rate and that a bi-weekly term comprises 25 paydays.

Those suppositions could, for any given visa case, be way off, but absent meaningful supplementary information, they’d have to do.

But for the reasons detailed above we didn’t go that route anyway. I don’t know about you, but I’m happy enough with 494,672 records.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: