US Visa Data, Part 2: An Excel Application

16 Oct

Promenade down the wage_rate_of_pay_from field lining the H-1B visa data set and you’ll likely saunter away with the impression that the jobs attaching to the H-1Bs are impressively compensated (the partner wage_rate_of_pay_to field that reports the apparent upper ranges among the proposed salaries is more spottily filled). These sound like pretty good, if impermanent gigs, and as such there might be something to learn by say, breaking out wages by state – once you try your hand at cinching a loose end or two that, bad metaphor notwithstanding, poses a knotty problem demanding your expert attention. (You’ll also learn that a very small proportion of visa applications – around 2% – aren’t seeking H-1Bs at all, but rather variant visas tied to country agreements with Chile, Australia, and Singapore.)

You’ll discover that the wage_unit_of_pay field in fact stocks a number of units, or demarcations of time, by which pay is calculated, e.g., Bi-Weekly, Year, etc. The problem points to a banal but integral data-quality matter. The pay figures in wage_rate_of_pay overwhelmingly key themselves to the Year unit – more than 93%, in fact – but more than 32,000 visas stipulate an hourly compensation, and some of these report wage units of pay that clearly can’t be clocked against 1/24th of day. Auto-filter wage_unit_of_pay for Hour, and you’ll bare 55 pay figures equalling or exceeding $48,000 – and if those rates are hourly, only one question remains to be asked: where can I download an application form? And the far smaller Bi-Weekly tranche likewise hands up a little cache of implausible compensations, as do the few Month position (can we definitively say that the $265,000 slot entertained by the Texas IPS PLLC firm in San Antonio is truly monthly? That application was withdrawn, by the way.) The Week pay data seem at least imaginable, however).

It would appear that these sore thumbs really mean to convey yearly salaries, and I know of no graceful means for redressing, apart from substituting Year in the suspect cells, and that isn’t graceful. Ok; I suppose some field-wide IF statement, e.g. if the number in wage_unit_of_pay exceeds 10000 AND the unit reads Bi-Weekly OR Hour, then enter Year in lieu of Hour with an accompanying Copy>Paste Values atop wage_unit_of_pay might placate the purists, but I’m not sure practicality would be best served that way.

In reality, of course, we can’t know if even the plausible compensation figures are right, and in view of the enormousness of the data set a press toward simplicity might bid us to merely work with the Year item, with a pivot table that goes something like this:

Row: employer_state

Values: wage_rate_of_pay_from (Average)

wage_rate_of_pay_from (again, this time Count)

Slicer: wage_unit_of_pay (select Year)

Sort the averages largest to smallest, and I get in excerpt:

visas1

Nice work, if you can get it. That’s Alaska heading the average salary hierarchy, though its visa-owning workforce is diminutive. Among the heavy hitters, Washington state (WA) – home of Microsoft, for one thing – checks in with an average emolument of $113,139; and indeed, an auto-filter tells me that over 4,000 of the Washington visas were requested by that very firm (you may happen to note, by the way, that one Microsoft-attesting record spells its headquarters city Redmnond, another data-quality eyebrow raiser. And there’s those 14 state blanks, too, though they sure pay well, if anonymously). The overall Year salary offer – and remember we’re working with the wage_rate_of_pay_from, the lower of the two salary tiers: over $87,000.

But inspection of the averages will affirm considerable interstate variation – e.g. major employer Texas, whose $79,823.51 mean thrusts it down to 46th place, though largest hirer California looks good with an average over $102,000. Accounting for the dispersions across the salary band might make for a story worth writing.
And if you’re interested in principal employers, try this table:

Rows: employer_name (Filter, Top 10)

Values: employer_name (Count). Sort largest to smallest.

I get:

v2

Tech consultant giant Infosys and Tata rank first and second among the visa-sponsoring firms; Microsoft barely makes the top 10. But pad the table with the wage_rate_of_pay_from field, and again deploy a Slicer and tick its Year item, and we see:

v3Big bucks emanating from Redmond, no matter how you spell it.

And since I’ve lazed past the discrepant-time-unit problem by confining my pivot tabling to Year figures only, let me at least consider a workaround that would more-or-less reconcile the Year, Bi-Weekly, etc. units, provided all their salary data were in working order, so to speak. In order to say, map all the salary proposals to the Year baseline, I’d draw up this lookup table:

visas2

The idea here calls for looking up the wage_unit_of_pay for all cells, and multiplying the associated pay figure by the appropriate looked-up value. These, of course, are standardized educated guesses, assuming for example that a 2000-hour annual work investment is to be multiplied by the hourly rate and that a bi-weekly term comprises 25 paydays.

Those suppositions could, for any given visa case, be way off, but absent meaningful supplementary information, they’d have to do.

But for the reasons detailed above we didn’t go that route anyway. I don’t know about you, but I’m happy enough with 494,672 records.

Advertisements

US Visa Data, Part 1: An Excel Application

4 Oct

If you want to be a legal alien in the United States you’ll need a visa. If you’re technically skilled the visa of choice is the H-1B, described by the Public Enigma open data site as”…a non-immigrant visa that allows U.S. companies and organizations to temporarily employ foreign workers in specialty occupations.”

If you want to learn more about the speciality cadre join Enigma for free, carry out the requisite click-throughs (the H-1B data are advertised on their home page) and you’ll find yourself here:

visa1

Opt for the 2017 edition, click the ensuing Open in Data Viewer button, and then click in turn the Export Dataset button centered of the trio of links below (though you can’t get this far unless you join Enigma and sign in):

visa2

Then break for tea time while the data mows a swath through your RAM and as last overtakes your screen; and with 528,000 visa application records amassing 154 MB, that’s a lot of Liptons.

Yes, you’ll have to perform the usual column auto-fit necessities, and you may want to think about which fields could be safely sheared from the data set, in the interests of a healthful file downsizing. (You may also want to rename the character-ridden worksheet tab.) It seems to me we could do quite nicely -at least – without employer_phone, employer_phone_ext, agent_attorney_name, agent_attorney_city, agent_attorney_state, and the wholly desolate pw_wage_source_year. These truncations have the effect of scaling my workbook down to a svelte 86 MB. (You may want to retain employer_country, though; several nations other than the United States are listed therein.)

Once having thinned the file we can go to work, starting with a simple resume of number of visa requests by state. My pivot table looks like this:

Rows: Employer_state

Values: total_workers (sum; sort largest to smallest)

total_workers (again, here % of Column Total)

You may also want to filter out the very small number (25) of blank records.

Note, by the way, that total_workers doesn’t always register one worker per visa request; a great many of the applications record multiple applicants.

My pivot table looks like this, in excerpt:

visa3

High-population, hi-tech California submits nearly a quarter of all of the H1B applications, with the top three states, including Pennsylvania and Texas, handing in 52% of them all.  Grand total of H1Bs: over a million.

(Note that the output comprises 57 rows, because the application data counts the District of Columbia (DC) the nation’s capital, not accorded state status), and US territories: Puerto Rico (PR), the Virgin Islands (VI), Guam (GU), Micronesia (FM), the Northern Mariana Islands (MP), and American Samoa (AS).)

And what proportion of application requests were approved? The pivot table should furnish the answer:

Rows:  case_status

Values: case_status (count, % of Column Totals).

total_workers (sum, % of Column Totals)

I get:

visa4

We find nearly 89% of the applications win governmental approval, with another 7% or so securing the okay even as the submission was withdrawn, for whatever reason. The certifications in turn accredit an even larger set of actual applicants; including the certified withdrawals, over 96% of the individuals seeking the H1B visa acquired it, at least for the time frame within which we’re working. Thus we see, latter-day controversy and clamor over immigration notwithstanding, that these visa applicants, ones in possession of specialist skill sets, are almost always shown the open door – remembering at the same time, however, that by definition they aren’t long-term immigrants.

We can next quickly recapitulate the average wait time distancing an H1B visa application from its decision date, via a pretty straightforward array formula asking us in effect to simply subtract all submission from decision dates, stored in the case_submitted and decision_date fields respectively. In the interests of key-stroke economy I named the fields sub and dec, and entered, somewhere:

{=AVERAGE(dec-sub}

That pert expression subtracts each submission from its corresponding decision date and averages them all. I get an average wait of 31.21 days per application, though a scan of individual waits discloses very substantial variation.

I then wondered about the comparative handful of application denials. Do they evince notably different wait times, in view of the fact that these entries culminate in rejection? For an answer I named the case_status field stat and entered this array formula, again somewhere:

{=AVERAGE(IF(stat=”DENIED”,dec-sub))}

The formula nests an IF statement into the mix, figuring the application wait times for those entries in case_status that read DENIED. In any case, I get a denial average wait of 4.06 days, a far swifter set of judgments. Perhaps those applications triggering denials are so egregiously ill-suited that the decision becomes easy – though that’s merely plausible speculation.

In the meantime, I’m getting excited about filing my own visa request. After all, my specialist skill set is formidable – how many left-handed spreadsheet devotees have so few Twitter followers? I’m stoked.

But I just remembered. I’m ineligible for a visa; I’m a US citizen.

America’s Schools, Part 2: Eyeing the Pupil Data

19 Sep

The question with which we wound up the previous post bears repeating, so let me repeat it: does a cognizance of the data quality issues that compromise the US school dataset inflict a failing grade upon the the workbook? That is, are we right to proceed with a look at the data just the same and give them a proper hearing, or should we bail out and take our pivot tables elsewhere?

I’d allow that the answer I submitted in that post bears repeating too – i.e., something like a yes. It seems to me that an analysis should properly go ahead, because the weight of the records – all 97,000 or so of the usable ones (let’s remember that number) – should, in virtue of its very size, synergize a plausible summary of the shape and quantity of America’s schools. Again, of course, the numbers are “wrong”; but they’re invariably wrong for any demographic take built on US census data, which after all date from 2010. Still, some general picture of the country’s institutional landscape should be there for the limning, if we’re persuaded.

We can commence, then, by performing a simple count and break out of schools by state:

Rows: STATE

Values: STATE (Sorted, Largest to Smallest)          

I get, in excerpt:

school1

 You’re not likely to be surprised by the above. American’s most populous state, California, predictably registers the most schools, though a check of Wikipedia’s state population totals furnished both for 2010 census aggregates and 2016 estimates points to a relation between population and school numbers that isn’t quite linear. Note in addition that our pivot table makes no place for the 28 schools in American Samoa, and in this case it could have; you’ll recall that we provisionally banished that territory’s records from the data because of the unavailability of its student totals. But here because we’re counting school names alone, the AS data could have been accommodated – simply because its school names, at least, are indeed listed.

Next, we could just as easily consider the average number of students per school by state, but before pressing ahead entertain a pre-verification surmise: namely, do the more populous states average more students per school? That’s a sensible conjecture, but one that must by no means invariably follow; after all, it’s possible that the sparser states simply maintain fewer schools, whose average size compares with the denser ones.

Let’s see. You can leave STATE alone in Values, and add to it ENROLLMENT, ticking Average. Then sort the results by ENROLLMENT. I get, in part:

school2

That’s Guam at the head of the class, its 40 schools averaging 781 students per institution (again, the bundle of 55 “STATEs” owes its hyper-count to the inclusion of American territories). At bottom sits the scantly populated Montana, whose per-school average of 171 makes it easy for teachers to space their students at exam time, if nothing else.

The disparities here, then, affirm a most mixed set of students-per-school averages, recalling the question if the numbers of schools by state indeed correlate with average school size. A rough read on that relationship becomes available with CORREL, in my case looking like this:

=CORREL(B4:B58,C4:C58)

And that expression evaluates to .3117, mustering a loose but meaningful association between the two variables – that is, as the number of schools across the states ascend, so does average class size. At least slightly.

And what of teacher-student ratios across the states? Again, the incipient question, one we’ve submitted several times in the past, needs to be asked and answered: does one calculate the ratios by according equal weight to every school irrespective of absolute school size, or does one realize a grand average of all of a state’s students divided by all its teachers (apparently defined here as Full Time, and so perhaps  deeming two half-time instructors as one full-timer) instead?

In fact, both alternatives are credible. There may be good analytical reason to treat schools as equally-sized entities in the averages, in the same way, for example, that we would could compare the standards of living of residents in different countries. And weighting 100 and 1000-student schools equivalently can’t be worse than granting one vote to every country in the United Nations General Assembly or apportioning two US Senators to each state, can it?

But before we tangle with the numbers we need to broaden our look at the FT_TEACHER field, because a measurable batch of records there reports faculty complements at zero or in the negative numbers. Sort the field Largest to Smallest, and lower a blank row atop the first zero. But you know that routine.

Now if we play the above option, by in effect averaging each student-faculty ratio, we need to title the next available column (where that is depends on how many fields you deleted in last post) StuStaffRatio or something like it, and simply enter a formula in row 2 dividing ENROLLMENT by FT_TEACHER and copy down (again, column references may vary). Back in the pivot table, you need to rewrite the range coordinates of the source data set (in PivotTable Tools>Analyze>Change Data Source.), and follow up by recruiting StuStaffRation into Values, averaging the results to two decimal points. I get, in excerpt, after sorting the field Largest to Smallest:

school3

The inter-state variation is considerable, and it’s most noteworthy that, large school sizes notwithstanding, Guam has managed to insure a pleasingly low (13.75) student-to-teacher proportion – so low you can’t see it in the screen shot – with California faring worst, at 23.76.

Now a suitable null hypothesis might foretell a perfect correlation between average enrollments and faculty sizes, irrespective of the absolute state numbers involved. Thus alerted to that prospect, we could write

=CORREL(C4:C58,D4:D58)

I get .4076, pretty serviceable by social-scientific standards, but surely not a 1. In other words, some positive association obtains between school student sizes and their teaching teams.

And speaking of school numbers, if you sort the ENROLLMENT field Largest to Smallest you’ll discover the honors going to the Ohio Virtual Academy, an online institution that serves Kindergartners through 12th graders and 11640 students – somewhere (its governmental web page counts 11278 as of the 2014-15 school year).

That means there’s some crossing-guard monitors in Ohio with a whole lot of time on their hands.

America’s Schools, Part 1: Some Truants Among the Data

8 Sep

The segue is inadvertent, but calling up a census of America’s schools look right after our look at New York school attendance data with makes for a deft transitioning, if I may say so myself, and I think I’ve just granted permission to do so.

This nationwide listing -105,000 institutions strong – is curiously archived in a site putatively devoted to information about hurricane Harvey. I’m not sure about the placement, but that’s where it is.

And it’s pretty big, as you’d expect, counting 105,087 schools in its fold and pitching 24 MB at your hard drive, after a download and a save as an Excel workbook. (Note: the data unroll stats for the country’s public – that is, government operated – schools only. The very large number of private and sectarian institutions diffused across the US are thus excluded from the inventory.) And if you run a simple SUM at the base of column AC, the field storing student enrollment numbers, you’ll wind up with 50,038,887, and that’s pretty big, too.

But of course, that number can’t get it exactly right. For one thing, the overview introducing to the workbook tells us that the data feature “…all Public elementary and secondary education facilities in the United States as defined by…National Center for Education Statistics…for the 2012-2013 year”. And since then a few hundred thousand little Justins and Caitlins will have moved on to other venues, to be replaced by littler Treys and Emmas – and the turnover just can’t be equivalent. Moreover the (apparent) Source Dates recorded in X track back to 2009 in many cases, though I don’t completely know how those dates are to be squared with the 2012-2013 reporting period.

Now apart from the as-usual column autofits in which the dataset obliges you, you may also want to shear those fields likely not to figure in any analysis, though that of course is something of a judgement call. In view of the virtual equivalence of the X and Y data in A and B with those in the LATITUDE and LONGITUDE parameters in S and T, I’d do away with the former pair. I’d also mothball ADDRESS2 (and maybe ADDRESS, too – will you need their contents?) I’d surely dispense with the NAICS_CODE entries, as each and every cell among them declaims the same 611110. And I think VAL_METHOD, VAL_DATE, SOURCE (storing basic information about the school committed to web sites), and probably SHELTER_ID could be asked to leave as well, lightening my workbook by about 5.3 MB all told. On the other hand, WEBSITE appears to have done nothing but clone the contents of SOURCE and as such could assumedly be dispatched as well, but I’ve since learned that the sites offer up some useful corroborating information about the schools, and so I’d retain it. But a field I would assuredly not delete, in spite of my early determination to do so, is COUNTRY. I had misled myself into believing the field comprised nothing but the USA legend, but in fact it entertains a smattering of other geopolitical references, e.g. GU for Guam, PR for Puerto Rico, and ASM for what I take to be American Samoa, for example.

I’m also not sure all the Manhattan schools (the ones in New York county, that is) display their correct zip codes for what it’s worth, and it might be worth something. The Beacon High School on West 61st Street is zip-coded 10022, even as it belongs, or belonged, to 10023 (though that wrong zip code informs a review of the school by US News and World Report); but the error may be excused by an updated reality: the Beacon School moved to West 44th Street in 2015, calling the timeliness of our data into a reiterated question. I’m equally uncertain why the Growing Up Green Charter School in Long Island City, Queens is mapped into New York county.

More pause-giving, perhaps, are the 1773 schools discovered inside New York City’s five counties – New York, Queens, the Bronx, Brooklyn (Kings County), and Richmond (or Staten Island; note that a Richmond county appears in several states in addition to New York). You’ll recall that our posts on New York’s attendance data, drawn from the city’s open data site, numbered about 1590 institutions. Thus any story-monger would need to be research the discrepancy, but in any case it is clear that the dataset before us errs on the side of inclusiveness.

But a lengthier pause punctuates a Largest-to-Smallest sort of the ENROLLENT field. Drop down to the lowest reaches of the sort and you’ll find 1186 schools registering a population of 0, another 1462 reporting -1, 4493 sighting -2 persons on their premises, and 91 more submitting a contingent of -9. Moreover, you’ll have to think about the 5399 schools counting a POPULATION (a composite of the ENROLMENT and FT_TEACHER fields) of -999. It’s not too adventurous to suggest that these have been appointed stand-ins for NA.

In addition, we need to think about the schools declaring only 1 or 2 students on their rolls. Consider for example the Marion Technical Institute in Ocala Florida and its 1 student and 34 full-time teachers. Visit its web site, however, and we encounter a more current student enrollment of 3 and a FTE (full-time equivalent) instructional complement of 37 (as of the 2015-16 school year), not very far from what our database maintains. But at the same time many of the 1-student schools are accompanied by FT_TEACHER values of 1 or 0 as well, and these microscopic demographics demand scrutiny. The web site for Bald Rock Community Day school in Berry Creek, California, for example, reveals no enrolment/teacher information, for example.

What to do, then? It seems to me that any school disclosing a negative or zero enrollment – and now sorting the ENROLLMENT field highest-to-lowest will jam all of these to the bottom of the data set – be disowned from the data set via our standard interpolation of a blank row atop 97407, where the first zero figure sits. We’ve thus preserved these curious entries for subsequent use should their other fields prove material.

And all that begs the larger question tramping, or trampling, through the data: How much time, effort, and money should be properly outlaid in order to support the vetting of 100,000 records? Multiple answers could be proposed, but there’s a follow-on question, too: In light of the issues encountered above, hould the data in the public schools workbook should be analysed at all?

Well, if we’ve come this far, why not?

NYC School Attendance Data, Part 2: What I’ve Learned

23 Aug

Once we’ve decided we’re pleased with the New York City school attendance data in their current, emended state (per last week’s post), we can move on to ask some obvious but edifying questions about what we’ve found.

First, a breakout of attendance percentages by day of the week is something we – and certainly Board of Education officials – will want to see. In that connection, we again need to decide if we want to break out the attendance percentages arrayed in the %_OF_ATTD_TAKEN field, and/or the numbers we derived with our ActualTotals calculated field, the latter according numerical parity to each and every student; and as such, it seems to me that ActualTotals is fitter for purpose here (of course we could deploy both fields, but let me err on the side of presentational tidiness here).

But in the course of tooling through and sorting the data by the above %_OF_ATTD_TAKEN, I met up with a few additional complications. Sort that field Smallest to Largest, and you’ll have gathered a large number of records reporting days on which absolutely no students attended their respective schools – 7,245 to be exact; and while an accounting for these nullities can’t be developed directly from the dataset, we could be facing an instance of mass, errant data entry, and/or evidence of a requirement to furnish a daily record for a day on which classes weren’t held. And in fact, over 14,000 records attest to attendance levels beneath 50% on their days, and I don’t know what that means either. It all justifies a concerted look.

But in the interests of drawing a line somewhere, let’s sort %_OF_ATTD_TAKEN Largest to Smallest and split the data above row 513796 – the first to bear a 0 attendance percentage – with a blank row, thus preserving an operative, remaining dataset of 513974 records. But still, I’d submit that more thinking needs to be done about the low-attendance data.

Returning now to our day-of-the-week concerns, the pivot table that follows is rather straightforward:

Rows: Day

Values: ActualTotals

I get:

attend1

(Note that you’d likely want to rename that default Sum of ActualTotals header, because the calculated field formula itself comprises an average, in effect – NumberStudents/REGISTER*100. You’ll also want to know that calculated fields gray out the Summarize Values option, and thus invariably and only sum their data. Remember also that 2 signifies Monday.)

I for one was surprised by the near-constancy of the above figures. I would have assumed that the centripetal pull of the fringes of the week – Monday and Friday – would have induced a cohort of no-shows larger than the ones we see, though attendance indeed slinks back a bit on those two days. But near-constancy does feature strikingly in the middle of the week.

And what of comparative attendance rates by borough? Remember we manufactured a Borough field last week, and so:

Rows: Borough

Values: ActualTotals

I get:

attend2

By way of reorientation, those initials point to these boroughs:

K – Brooklyn

M – Manhattan

Q – Queens

R – Richmond (Staten Island)

X – The Bronx

The disparities here are instructive. Queens students are the literally most attentive, with Bronx students the least. Of course, these outcomes call for a close drilldown into the contributory values – e.g., economic class, ethnicity, and more – that can’t be performed here.

We can next try to learn something about attendance rates by month, understanding that the data encompass two school years. Try

Rows: Date (grouped by Months only)

Values: ActualTotals

I get:

attend3

The school year of course commences in September, with those early days perhaps instilling a nascent, if impermanent, ardor for heading to class. We see that attendance peaks in October, and begins to incline toward the overall average in December.

The question needs to be asked about June, or Junes, in which the attendance aggregate crashes to 85.21%, deteriorating 4.69% from the preceding May(s). While explanations do not volunteer themselves from the data, an obvious surmise rises to the surface – namely, that students beholding the year’s finish line, and perhaps having completed all material schoolwork and exams, may have decided to grab a few discretionary absences here and there. It’s been known to happen.

But let’s get back to those zero-attendance days and their polar opposite, days in which every student on a school’s roster appeared, or at least was there at 4 pm. The data show 1641 records in which each and every enrollee in the referenced institution was there, a count that includes 31 days’ worth of school code 02M475, i.e. Manhattan’s renowned Stuyvesant High School; a pretty extraordinary feat, in view of the school’s complement of around 3,300. And while we’re distributing kudos, mark down September 21, 2016, the day on which all 3,965 of Staten Island’s Tottenville High School registrants showed up, and June 24 of that year – a Friday, no less – on which the entire, 3,806-strong enrollment of Queens’ Forest Hills High School settled into their home rooms. But ok; you could insist that these laudable numbers should likewise be subjected to a round or two of scrutiny, and you’d probably be right.

Now for a bit of granularity, we could simply calculate the average daily attendance rates for each school and sort the results, and at the same time get some sense whether attendance correlates with school size as well. It could look something like this:

Rows: SCHOOL_DBN

Values: REGISTER (Average, to two decimals)

%_OF_ATTD_TAKEN (Average, to two decimals)

Remember first of all that a given school’s enrollment is by no means constant, swinging lightly both within and across school years. Remember as well that because you can’t average calculated field totals, I’ve reverted to the %_OF_ATTD_TAKEN field that’s native to the data set.

Sort by %_OF_ATTD_TAKEN from Largest to Smallest and I get, in excerpt:

attend4

That’s the Christa McAuliffe School (named after the astronaut who died in the Challenger explosion) in Bensonhurst, Brooklyn on the valedictorian’s podium, followed very closely by the Vincent D. Grippo school, (physically close to the McAuliffe school, too). And if you’re wondering, I find Stuyvesant High School in the 907th position, with its daily attendance average put at 90.87. Tottenville High, interestingly enough, pulls in at 1155 out of the 1590 schools, its average figuring to a below-average 87.43. At the low end, the Research and Service High School in Brooklyn’s Crown Heights reports a disturbing 41.84% reading, topped slightly by the Bronx’s Crotona Academy High School (remember that you can learn more about each school by entering its code in Google). These readings merit a more determined look, too.

And for that correlation between school size and attendance: because my pivot data set out on row 4, I could enter, somewhere:

=CORREL(B4:B1593,C4:1593)

I get .170, a small positive association between the parameters. That is, with increased school size comes a trifling increment in attendance rates.

But if you want to learn more about correlations you’ll have to come to class.

NYC Attendance Data, Part 1: Some Necessary Homework

14 Aug

This open data thing is cool; they’ve even drummed up a report, available to every inhabitant on planet earth give or take a few censors, on my very own elementary school, P.S. 165 in Queens, New York City.

No; you won’t find my name large-typed anywhere in the record of 165’s distinguished alumni, but this is an academic status report, after all, and I’m prepared to forgive the omission. The larger point – I think – is that data – attendance data – about P.S. 165, and all of its 1600 or so companion institutions comprising the New York’s public school system, are present and accounted for here. Click the Download link, proceed to the CSV for Excel option (there’s a Europe regional-formatted version there, too), and wait for its 524,000 records to find their place in your virtual lecture hall.

The worksheet roll-calls attendance numbers for the days filling the September-June school years 2015-16 and 2016-17, and don’t be fooled by its understated five fields; at least three additional parameters can be easily synthesized from those five, and to productive effect. (Note: Apart from the need to autofit the field columns, keep in mind that the entries in the REGISTER field tabulate the school’s official enrollment as of the given day, and not the number of students who actually made their way to class.)

But the data do require a touch of vetting before the analysis precedes. Row 455543, if you’re keeping score, is in effect something of a blank, save a cryptic notation in the A column that won’t further our reporting interests. As such, the row should be, er…expelled. But another, more troublesome set of entries needs to be confronted and ultimately unseated from the dataset.

Start by running this pivot table at the records:

Row: DATE (ungroup if you’re on Excel 2016)

Values: REGISTER (sum)

Sort Largest to Smallest. I get, in excerpt:

att1

The two uppermost dates – January 13, 2016 and February 2 of that year – exhibit suspiciously large register totals, just about doubling the other sums. It appears as if the records for that pair of dates have simply, if unaccountably, been duplicated – and as such one-half of these have to be sent on their way.

And that sounds to me like a call for Remove Duplicates. Click its trusty button on the Data ribbon, and tick:

att2

Follow through, and you’ll be told that 3172 duplicate values – which sounds precisely duplicative – have been found and shed from the dataset.

Then inspect the lower reaches of that original sort and you’ll come upon more than 20 days, the great bulk of which are dated somewhere in June (i.e. the last month of a school year), with aggregate totals less than 800,000 – substantially beneath the totals above them. I suspect some reporting shortfall is at work, and indeed the legend attaching to the data on the Open Data site states the numbers may not be final “…as schools continue to submit data after preliminary report is generated”. We’re also left to explain April 18 of this year, for example, a date on which we’re told 819 students actually appeared, on what was the last day of the spring recess. Presumably they were there for some good educational or administrative reason – or the record is there by mistake (those 819 come from two schools, the Bard High School Early College institution in Queens and the School of Integrated Learning in Brooklyn).

Now we can go about deriving those three fields I advertised above. I’d first head into column F, title it Day, and enter in F2:

=WEEKDAY(B2)

And copy down. We’ve certainly seen this before; WEEKDAY will tease out the day of the week from a date-bearing cell (with 1 denoting Sunday by default), and so presages any look at attendance figures by day.

Next, we could entertain an interest in breaking out attendance data by each of New York’s five boroughs – these the object of a neat, single-lettered code in the SCHOOL_DBN field, squirreled in the third position of each code:

K – Brooklyn (Brooklyn is officially named Kings County)

M – Manhattan

Q – Queens

R- Richmond (aka Staten Island)

X – Bronx

Thus we could name G1 Borough and enter in G2:

=MID(C2,3,1)

That is, the above expression collects one character from C2, starting (and finishing) at the third character in the cell – in the case of C2, X, or Bronx. (By the way, if you enter a school’s code in Google you’ll be brought to its web page, including P.S. 165 and 25Q425, or John Bowne High School, another of my academic ports of call). Then copy down G.

The final of our three supplementary fields would simply return the actual number of students who appeared in a given school on a given day, or at least the ones who were there to be counted at 4 pm; and while of course the dataset already reports schools’ daily attendance percentages, information that might tell us what we want to know, we’re again made to revisit an old but material problem: by  manipulating percentages alone we in effect weight each school identically, irrespective of student body size. And in fact that treatment might come to serve certain analytical purposes rather well, because there may be good reason to regard schools as equivalently important units. But for the same money we may want to have it both ways, by compiling stats that give each school its proportionate due. Thus we can title H NumberStudents and enter, in H2:

=ROUND(E2*D2/100,0)

If that expression seems needlessly ornate, keep in mind that the values holding down the %_OF_ATTD_TAKEN field in D aren’t percentages, but rather percentages multiplied by 100; as such they need to be cut down to size by a like decrement. And the ROUND addendum means to quash any decimals that don’t quite do the results justice. After all, if 84.2% of a school’s 1,042 charges show up, should we let the Board of Education know that 905.15 of them have been sighted in the building? I’d bet that .15 has at least foot out the door.

But do these definitional niceties matter anyway? We can find out by first setting a pivot table in motion and churning out a calculated field, which I’ll call ActualTotals:

att3

(And no, you won’t need parentheses around the two fields.) As for the 100 multiplier, it will put the results in alignment with those %_OF_ATTD_TAKEN numbers, which again exhibit a similar order of magnitude – that is, the latter comprise percentage data times 100.

The pivot table looks then like this:

Rows: SCHOOL_YEAR

Values: %_OF_ATTD_TAKEN (average, to two decimals)

ActualTotals (average, two decimals)

I get:

att2

The differences are small but real.

But ok, class; let’s take recess. Just remember to be back by…4 pm.

 

Hit or Miss : Baseball’s Home Run and Strikeout Profusions

31 Jul

See the ball, hit the ball, advised the aphorist Pete Rose, with his winsome epigrammatic flair. And in fact, in the course of his day job as a major league baseball player, Mr. Rose took his reductive prescription to heart: the record shows he hit the ball 12,910 times, 4,256 of which resulted in that outcome curiously, and redundantly, called a… hit.

To be sure, Rose missed the ball – or struck out – 1,143 times, but as a percentage of his 14,053 times at bat, his 8.13% worth of misses is impressively low. And Rose would be interested to know that major league batters are missing the ball more often than at any time in the history of his chosen profession.

Batters in 2016 struck out 23.5% of all their at-bats, or .235 per the game’s three-digit formatting heritage (so don’t ask me why earned-run averages are squeezed to merely two decimals, though I have a theory); and the strike-out proportion thus far for this year (courtesy of Baseball-Reference.com) has ticked up to .241.

But that narrative of futility has been thickened by a sub-plot: major-league batters are also hitting more home runs than ever, and the dialectic of hit-or-miss isn’t quite the contradiction it seems. After all, you’ll agree it’s easy to miss a 98-mile-per-hour fastball – all the more so if your neurons are forced to contend with a 78-mile-per-hour curve ball instead. Surprise. But manage to make contact with that fastball and the results could go a long way (see this consideration of the home-run/strikeout relationship in the New York Times, with particular attention paid to the home-run half of the equation).

To learn a bit more about the home-run/strike out antimony it would be a good idea to consult Sean Lahman’s free, go-to Baseball Database, as we have in earlier posts (look here, for example. Remember that Mr. Lahman will happily attend contributions.) Click the 2016 – comma-delimited version link, and another click upon the resulting zip file will empty its spreadsheet files into its folder. Then call up the Teams workbook. (Note Lahman advises that the files work best in a relational setting, but our analytic interests here can be sufficiently served by Teams all by itself.) Once we’ve gotten here we can throw a few simple calculated fields into a pivot table to productive effect, and join them up with a couple of applications of the CORREL function for our information.

To start, we could drum up yearly strikeout percentages – that is, strikeouts divided by at bats, and then for presentation purposes proceed to group the outcomes by say, bins of five years:

Rows: yearID

I’d group the years thusly:

hr1

I’ve earmarked the above span because some of the pre-1910 years have no strikeout data, and because the 1912-2016 interval comprises 105 years, yielding 21 equally-sized tranches.

Next I could devise this calculated field, which I’ve called sopct:

 

hr2

And garb the resulting values with this custom format:

hr3

And filter out the <1912 data.

When the dust settles I get:

hr4

The ascendancy of strikeouts is clear. Note the distinctly impermanent fallback in the strikeout percentages in the 1972-1981 era, a predictable and intended consequence of the lowering of the pitching mound in 1969 and its literal levelling of the playing field for hitters. But nevertheless the trend soon recovered its arc.

We could next calculate a similar field for home runs, calling it hrpct:

hr5

Applying a similar grouping and formatting to the field, I get:

hr6

The trendline is considerably flatter here, and indeed seems to have peaked during the 1997-2001 tranche – not surprisingly, because it was during that fin de siècle epoch that steroid use among ballplayers apparently peaked as well, thus fomenting, among other mighty anomalies, Barry Bonds’ surreal 73 home runs in 2001, and in 476 at bats.  Ungroup the yearID field momentarily and treat the numbers to a four-decimal format, and you’ll discover a home-run high of .0340 in 2000 – but note the .0339 for last year as well, succeeded in turn by this year’s all-time, relatively drug-free .0368 (again, check Baseball Reference for the to-the-minute totals).

Now what then about the association between strikeouts and home runs? Baseball common sense would predict an appreciably positive correlation between the two; the harder the swings at those fastballs, one could suppose, the more misses – but along with these flows the expectation of more long balls, too, once and if bat actually meets ball.

To palliate our curiosity, we can trot out both sopct and hrpct into Values, and continue to leave yearID ungrouped. With the strikeout and home run values stationing themselves into columns B and C and the year 1910 making itself home on row 43 – a positioning that by extension locks the 2016 data into row 149 – I can enter, somewhere:

=CORREL(B43:B149,C43:B149)

That expression returns a most, almost unnervingly impressive, .845, shouting the view that, as strikeouts go, so do home runs – and virtually in lockstep.

We can then go on ask about the relationship between strikeouts and batting average. Might it follow that, with the increasing, overall failure to hit the ball, averages should falter correspondingly – simply because fewer balls will have been put into play? You can’t get a hit if you don’t hit the ball, it seems to me. Or do we allow that because hard swingers hit the ball harder (when they do) – that a hard-hit ball is harder to catch?

We can check that one out. We can mint still another calculated field, which I’ll call baagg (for batting average aggregate):

hr7

Replace hrpct in Values with baagg, leave sopct in place, and because the operative ranges don’t change the CORREL we authored above should rewrite itself, yielding -.477 – a notable inverse association (that’s a minus sign preceding the value, not a dash). That is, as strikeouts mount, batting averages do tend to respond downwards. And while it’s true that -.477 isn’t .845, most social scientists would be thrilled to even have that number crunched upon their desk.

And I suspect Pete Rose is smiling, smugly, too.