Class Attendance, Part 2: British Elites and Their Schools

12 Sep

Now that we’ve righted those faux schooling percentages – that is, the ranges of school-attendance-reporting, whole numbers posing as fractions of 1 – of the achievers enrolled in the Elite Britain study’s spreadsheet –a few additional nuts and bolts need to be tightened in turn (and it goes without saying that the rectifications described in last week’s post work for the Profession by school type sheet in the workbook, too).

First, since we’ve moved past the percent misrepresentations, the field-heading (%) indicators dotting columns E through H need to go. Second, I’m not crazy about the Percentage known field; its informational contribution is trifling, because if necessary those percentages could be derived (Population known/Population size, after all), and a little design parsimony won’t impoverish the narrative.

One noteworthy stat suggests itself: the proportion of university attendees who walked through the ivied doors of Russell Group institutions. These “leading” schools – 24 in number (out of a couple of hundred in the country all told; how the global count tallies depends on designations of institutes and the like) – seemed to have commanded the lion’s share of elites’ tuitions; by simply dividing the summed Russell Group by Any University fields we arrive at 67.55% of all those in the survey who plied some manner of higher education, correlative evidence of an “elite” clustering. On the other hand, of course, correlations don’t prove a relation of necessity between A and B.  In theory, at least, a modestly-backgrounded young person could rightfully earn a place at an estimable school, an attainment which by itself would not serve to clinch the “elite” thesis. Speaking in the abstract, a duly deserved seat somewhere in the Russell Group would simply and only validate the achiever’s just desert. An equitably framed reward system that succeeds in promoting the best and brightest and assembling these in better schools (however understood) could be said to be doing nothing but its job; but of course, our data can’t by themselves point to that ideal.  Sociological robustness would call for research traveling back to the square ones – the economic class inceptions of the 4,200 biographies numbered here, before untrammelled merit could be properly ushered into the equation.

Of course, a purely methodological question jabs at all of the above: the disproportions of individual profession counts yanking that 67.55% in…proportion to their numbers. Residents of the house of Lords and Parliament contribute 31% of the survey complement, but in fact their Russell Group/Any university figure of 70% doesn’t badly skew the overall ratio. And if instead we divide all Russell Groupers by the entire Population known, the aggregate 55.85% (that’s sans the House numbers) compares most comparably with the Houses’  57.73%; not much difference there at all.

And a final consideration might qualify as something of a non-finding, notable precisely in virtue of its apparent uneventfulness. As we have them, the data don’t seem, to fall back upon a perfectly dreadful modifier, particularly actionable.  If anything, they resemble a pivot table, fait accompli – broken out, kind of, into their fields and items (and that’s a nasty equivocation to be sure) in a kind of done-deal finality. As such one could wonder where the data could be taken next, and I’m not quite prepared to signal a direction. Is there such a thing as a spreadsheet that has nothing, or nothing else, to teach us? Maybe, but only maybe, and the matter isn’t wholly captive to a smallness in the numbers that might stunt the analysis. One of the issues here asks after the prospects for aggregation that the data hold out, as the presently stand. One could, for example, perhaps dispatch a pivot table to group the professions and the associated numbers by their respective sizes, but I’m not sure what profit would accumulate thereby. Do university percentages respond to sector largeness or smallness? You’d have to look and hard before you’d put your investigative standing behind anything other than the null hypothesis here.

What might work is a coding scheme that would align sectors by some shared nature, and let the school numbers fan out along accordingly; but that plan too would have to be preceded by some deep thinking.

So here, then, is one of the larger questions: take a pivot table, the product of some concerted, reasonably well-devised intention, and treat it as an initial data set. Where do you take it next?

I think that’s a fair question – at least a reasonably well-devised one.

Class Attendance: British Elites and Their Schools, Part 1

5 Sep

Who put the higher in hierarchy? I don’t know, but the reality of class and its near-ubiquitous sidekick, the ruling elite(s), have been giving sociologists something to write about for the past 150 years or so. And if the definition-mongers can’t decide what class really means, the very tenacity of the debate has entrenched the term deeply in the discourse.

And if the stereotypes have it even half-right, nowhere does the beguilements of class charm its masses so seductively as in  England; and a recent Guardian spreadsheet, its data ported from a study by the Social Mobility and Child Poverty Commission, advances education as a proxy attestation for the imponderable class. Here it is:

 Background of different professions

The workbook’s two sheets plant their feet on two educational rungs – (pre-university) school and university types – and see how their distributions break across a set of moving-and-shaking professions (which aren’t identical in the sheets, by the way). The samples are small, but the universes from which they’re drawn are transparently described in the Elitist Britain methodology addendum, likewise available at the above link (as is the Methodology – Annex spreadsheet, which actually names the individuals surveyed). And the samples are largely just that – fractions of the smallish universe pools, for example, BBC Executives, clarified by the addendum as “Senior BBC Executives  on the BBC website for transparency purposes”. The spreadsheet numbers 187 of the functionaries, but the Profession by school type tab finds actual information for but 93 of them (and 125 on the Professional by higher education sheet).

At least it’s all up-front, and the workbook means to put an evidentiary face on British elitism by tracking the higher achievers on its radar and their schools.  And once you get past the perennial column auto-fits and confront the Profession by higher education figures, the point is more-or-less made. 59% of the Cabinet members for which the data obtain, for example, were Oxbridge attendees, with 77% having found their way into a Russell Group school (the Group is a consortium of “better” universities). Still, you’ll find interesting if expectable variation, e.g. the 0% Oxbridge representation among pop stars (defined as “UK artists who had one of the top 40 selling albums of 2010, 2011, 2012 or 2013″), although 27% of these did misspend their youth at a Russell Group school (although it also appears that attendance at, rather than graduation from, is counted here).

Again note the Percentage known data are exactly that, proportions of members of the sample for whom educational data have made themselves available (with the rather curious exception – in both sheets – of Select Committee Chairs, entered as 85, and not .85. Simplest tack, here: replace with .85 and commit to the Percentage format) . And the attending fields purport to convey attendance data in percentage terms, too – but they don’t.  They’re rather delivered in whole-number terms, and the 91 in cell E2 in the Profession by higher education sheet, for instance, won’t register as the putative .91 but as a number 100 times greater instead; and if you want to work with numbers in lieu of the percentages – and see say, a 114 in E2 (a rounded 91% of the 125 BBC execs for whom information availed) – well, then, you can’t multiply E2 by C2, because you’ll get 11,375. Now the remedial spreadsheet math isn’t difficult by itself, but it has to be done somewhere; and here’s what I’d do.

In a blank cell – I’m working in J2, for a reason – enter $C2/100 and copy it (yes, I do plan on explaining). Then select E2:H28, the range of to-be-percentaged numbers, and click Paste > Paste Special > Multiply:

 back1

 And OK.

And here’s what’s happened in view, and in review, of our intention: Remember we want the 91 – that ostensible percentage – in E2 to culminate in a 114 – which, after all, is 91% of its denominator, the 125 BBC executives for whom university-attendance data are on hand. Thus we want in effect to be able to multiply 125 by .91, and the strategy we’ve blueprinted here makes its first move by dividing the number in C2 – that 125 – by 100; the resulting 1.25 is then to be multiplied by the 91 in E2, realizing that 113.75 for starters. What the Paste Special orchestration does then, is copy and paste $C2/100 atop E2:H28 – but instead of the standard-issue paste, it multiplies every value in the range by its respective C-columned value. Thus the Pop star, Russell-Grouped 27% percent –  again, really the number 27 as it first appears –is multiplied by the 72 Stars in C14 for whom the educational information is available, and 72/100*27 gets you 19.44 actual Pop stars – and if you’re wondering about that .44 of a Star, hang on.

Now it’s true – an odd, unsettlingly circular cast tinctures the Paste Special Multiply routine, pasting as it does a formula (and it is a formula – don’t look for pasted values this time) that names the selfsame cell into which it’s pasted, but that’s how it works (as do some formulas imparting conditional formats, by the way), and because it works you’re spared the conventional way around the whole problem:  making yourself at home in I2 (in the first unoccupied column at hand) and entering

 =E2/100*$C2

thereby recalling the aforementioned 113.75, copying the above across I2:L28, and finally aiming a Copy > Paste Special > Values back onto E2:H28, and deleting the now-duplicative I2:L28.

And now for that .44 and kindred decimals. Fact-checkers tend to discourage sentences reporting that “113.75 BBC executives attended some university”, although I can think of quite a few pop stars whose talent quotients fail to rise to integer level. What we need here is a numerically real, as opposed to a format-dappled round-off, and if we want to get there without incurring still more formulaic overhead there’s a venerable, but faintly recognized, option in there waiting for our call.

 

First, select the now-heavily-decimaled E2:H28 and format the data to zero decimals. But again, that refinement merely streamlines the numbers’ display, and doesn’t streamline their values. But watch this: click File > Options (remember – I’m in the 2010 release) > Advanced > and scroll here:

 back2

There it is – Set precision as displayed – one of the only, if not the only, format-motored command that actually re-quantifies the value of a number as per its present onscreen appearance. Format the decimals in 113.75 away until you see 114, turn Set precision as displayed on, and what’s left is a real 114. Multiply it by 2, for example, and you’ll get an authentic, usable, 228, not 227.5.

A last spreadsheet-grammatical point here: as parenthetically observed above, the Paste Special > Multiply script pastes formulas, not hard-coded values, into its targeted cells, e.g.

back3

But this seems to work, too:

=91*$C2/100

And indeed, this very matter happened to have found its way into last week’s post. Any thoughts about those Excel-spawned parentheses?

 

English-Wales Name Data, Part 2: Sum Interesting Formulas

28 Aug

My bookmark tucked its ribbon here:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

Last week’s post, you recall, paused at the above teaser, a solitary, unexplained but efficacious formula for coaxing a weighted-average length for 2013’s top 100 boys’ and girls’ birth names from the data. Now I need to reward your patience by explaining.

A weighted average, after all, insists that a proper, proportioned due be paid every name’s incidence across the birth cohort, and the formula in our inbox appears to due just that. It enlists the SUMPRODUCT function to the task, a predominantly simple mechanism that in the first instance works this way:

Suppose I’m presented with this range of transactions in A1:C5:

(It’s clear I have no idea what these comestibles really cost.) What I want to know, of course, is what the bill comes to in toto, and the standard means toward that end for is pretty evident: enter a bank of formulas, each multiplying every Unit Sold by its allied Cost per unit, and add the four purchases.  But SUMPRODUCT lets us unify the process:

=SUMPRODUCT(B2:B5,C2:C5)

That is: simply enter the paired ranges, or arrays, as they’re known in these parts, and SUMPRODUCT does the rest – it multiplies each Units Sold by its collinear Cost per unit – B2 by C2, B3 by C3, etc. and then sums them all to boot.

But what are we hoping to do? Again, we’re aspiring to learn the weighted average length of all the birth names in our workbook, and SUMPRODUCT will consummate that hope – provided we do a bit of a syntactical rethink.

Observe that nowhere among our five extant England-Wales-name workbook fields will you find name length data, and we even politely declined the offer of the suitable, supplementary Length field that was tendered in the previous post. Again, we want to see if we can figure average name length without a length field’s well-meaning intercession.

So get back to our bookmarked SUMPRODUCT, which is, and is not, essentially coterminous with the unassuming apples and oranges computation. You’ll note first of all that both SUMPRODUCTS reference two arrays, but our average-length-finder couches the first of these with a LEN, one that stuffs a 402-row range inside its bulging parentheses, even as our previous meet-ups with LEN in previous posts sized the contents of but one cell per formula. And the very fact that we can grant LEN this superimposition atop an array reference at all is, to me at least, imperfectly obvious. But you can.  But we also need to understand in turn that SUMPRODUCT is an off-the-shelf embodiment of an array formula, a class of expressions that power through multiple calculations in but one go, a capability that in fact was made clear by the more rudimentary fruit-and-veg example, when you think about it – and you should (for an instructive introduction to array formulas look here).

Thus the LEN(D2:D403) excerpt measures the length of each name in the D column – and then multiplies each length by the corresponding number of children bearing that name (reported in the E column) – and then adds all those results…and divides that grand total by the sum of all the births. And that’s a weighted average. Here’s the formula again:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

Thus the calculated length of D2 is multiplied by the associated number of births in E2, as are the like values in D3 and E3, and so on, all the way down to D403 and E403. These are then all added, turning out an aggregate length in characters. That number – which happens to be 2,238,510 – next submits to the divisor comprising the number of all births – 343,022; and 2,238,510/343,022 confers a weighted name-length average of 6.25285. Got that?

Of course it makes sense – it has to, or else let me be the first signatory to your class action suit against Microsoft. But array formulas do require a think, because the kind of syntactical sense they espouse doesn’t always tally with standard user expectations.

But I for one was perplexed by a more generic thought-provoker, one that brought me back in touch with that ancient mathematical fundament, the order of operations. It turns out, for example, that the SUMPRODUCT under our microscope could have been alternatively written this way:

=SUMPRODUCT(LEN(D2:D403),E2:E403)/SUM(E2:E403)

Look closely, because that variant hasn’t slithered out of a photocopier. It is different from our earlier take, in virtue of the parenthesis (or bracket, depending on your IP address) right-flanking the E2:E403 segment before the divisor symbol. Here, then

 

(LEN(D2:D403,E2:E403)

 

are regarded of a piece, and indeed, the entire SUMPRODUCT formula ends there:

 

=SUMPRODUCT(LEN(D2:D403),E2:403)

 

and the /SUM(E2:E403) simply tacks itself discretely to the above taking the SUMPRODUCT result further, but from the outside, as it were.

 

But everything you see in our original SUMPRODUCT belongs to it; nothing there is external to the function as it is written. Here, SUM(E2:E403) appears to button itself to E2:E403 alone, and then divide it; but apart from the fact that that fraction doesn’t seem to make much sense, it isn’t what’s happening anyway. Consider the difference between

 

=6*(4/2) and =(6*4)/2

 

Guess what – there is none. Either way you get 12.

 

P.S. If you’re a name devotee, the US Social Security site Popular Baby Names page issues a free pass to an enormous holding of birth name data dating back to 1883 in easy-to-convert text file mode, apparently listing every first name given to at least five babies each year. And they have state-by-state breakouts, too.

England-Wales Birth Names, 2013, Part 1: A Length-wise look

21 Aug

There are two kinds of spreadsheets, incarnating two states of readiness – the first, all dressed up with nowhere to go but right before your got-to-know gaze – and the second, an unprocessed, unpretty, whole-fibered object whose monotonic, pedal-pointed rows ask you to hold on tight to your mouse and do something with them.

State number 2 conceals a dare: if you want to really understand what’s going on with the data, you need to bring something to the show.  Stare at a sheet rattling 200,000 rows in your face and try to make some sense out of it without busting a move at the data; there is a difference, after all, between a reader and a user.

Whew – I feel better for having said that, and while I’m stoked let’s click at the brace of workbooks migrated to the public domain by the UK Office for National Statistics, counting the gender-parsed top 100 birth names in 2013 for England and Wales.

The lists made some news in England last week, but even though we’ve worked with baby name data before (e.g., my April 10 and October 24, 2013 posts) there are additional considerations to be broached. Look, for example, at the Table 1 – Top 100 girls, E&W tab in the girl-specific book (in excerpt):

 ons1

Pretty? Maybe, but look again. Why, I feel professionally impelled to ask, are the 100 names split down the middle, thus forging what are in effect two parameters comprising precisely the same kind of information, along with a doubling-over every other field? Even a sit-back reader might ask the same question; and for those who want to actually do some business here, the sheet as it stands just can’t and won’t do, and that’s quite apart from the impertinent blank row atop row 8, which should be made impermanent. And while I’m at it be advised that the Change Since… numbers attending the names are textual to a fare thee well, with authentically texted number signs in lieu of the numerically formatted kind, the occasional asterisk, and all kinds of superfluous spaces in there too. (But I’m leaving the Change data aside for the purposes of the discussion, but for the record one could reclaim their arithmetic standing through a regimen of finds and replaces and TRIM functions.)

And given the presentational morass above here’s what I do, in the interests of accessibility and maximized analytical puissance:  Copy each set of Rank, Name, and Count columns from the Top 100 England and Wales tabs on both the boys and girls workbooks all into a new sheet, and then open two columns to which gender and country identifiers are to be coded, e.g.

ons2

(I’ve also invested the copied data with the workbook default Calibri 11-point font for consistency’s sake.) You should wind up with 402 names (and not 400, because of some same-ranked names).

Now you can get back in touch with your inner user.  Remembering of course that the data enroll only the top 100 baby names into its global denominator, we could look for example at the inter-country relation between names, understanding that overall ,Welsh babies account for about 5.3% of all births (the ratios are almost identical between boys and girls). Try something like this:

Row Labels: Name

Colum Labels: Country

Values: Number (by % of Row Total)

Filter: Gender

In excerpt and filtering for Girls I get:

ons3

 

 

You should be looking here for Welsh proportions departing in either direction from around 5.3%; thus Brooke features relatively more often among Wales births, but then look at Alexis, its 33 babies all born in Wales.

Now substitute Rank for Number in the Values area (by Sum):

ons4

 

Amelia’s trans-national appeal is clear, but you’ll note significant rank disparities up and down the results.

Now think about this: might a notable difference between average name lengths obtain between gender and/or country? It might, but you’ll be quick to call the absence of a name-length metric among the source data to my attention, and you’ll be right to do so. How, then, could name-length differentials be appraised?

And that request is a multitudinous one; and the most elemental reply to it would have us assign a LEN formula alongside each name, total the accumulation of them all, and divide it by the number of names, presumably under the steam of a COUNTA. But that won’t work, even if you’re happy with the idea. It won’t work because by having copied the England and Wales names from their associated sheets, many names thus appear twice, and as such enforce a small but measurable skew upon the calculation. The alternative here, then: bang out a pivot table, and assuming you’ve brought that Length field into the data fold:

Row Labels: Name

Values: Length (Average)

The table naturally reports each name once inside the Row Label column, because that’s what Row Labels do – return every item in the Labelled field uniquely. The necessary reversion to the Average operation above owes to the fact that, even as Amelia appears but once among the Row Labels, Values defaults to a sum of her two name lengths in the original data set – the one for England, the other for Wales, or a summed length of 12. And now you can play through the formula proposed in the previous paragraph (and if you do, I’d keep away from those GETPIVOTDATA references that’ll beset the expression if you click on the cell references; I’d just type the refs in standard mode).

That should work, at least mathematically. But you may be happier with a weighted assay of the name lengths that commensurately honor the variable numeric contribution of the names. After all, average the lengths of Abigail – the choice for 1191 girls – and Alys, the name adopted by 59 sets of parents, and it comes to 5.5. But is that how you want the average to be understood? And if you don’t, and you’ve been won over by the prospect of a weighted average, can that end be pursued with one formula, and without bothering to institute a Length field in the data set?

The answer is yes, and the formula – at least one such formula – looks something like this:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

 (assuming the Names occupy D, with the number of births moving in next door to E).  And so it seems to me then that we need a part two, so think about what you see above. But remember you can always speak with me during office hours. I think they’re Wednesdays 11:12:30 and by appointment.

 

 

Listen Up: New York Noise Complaint Data, Part 2

14 Aug

Now that we’ve implicated Manhattan as the noise complaint capital of New York City – if not the world – it’s time to add some counts to the arraignment. We could for example drill down complaints by the zip (or postal) codes of city’s five counties, or boroughs, for the sake of intra-borough precision – something like

Row Labels: incident Zip

Borough (shown in Tabular format, with the subtotals suppressed)

Values:  Borough

Proceed to filter out the blank zip codes, and note the smattering of codes, e.g., 10463 and 11237, drawing their outlines around parts of two boroughs (and remember that by clicking anywhere in Incident Zip and following with PivotTable Tools > Options > Field Settings (in the Active Field button group) > Layout & Print > Repeat Item Labels, you can fill the Incident Zip blank cells with their codes).  That 83 code in receipt of 126 complaints, doesn’t square with the U.S. five-digit protocol and seems to reference points in and around Central Park.

Then right-click in Incident Zip, click Filter > Top 10, and ask for say, the 15 prime code offenders -and you’ll have to sort these; because Top 10 won’t do so on its own.

And for a few words about that sort, as I (presently) understand it. Because we’re effect conducting a two-field (Incident Zip and Borough) breakout of the Values, a direct click in and sort of those Values won’t happen here, as it would in a conventional data set. And that’s apparently because the two fields lock their Value up, such that the Value can’t be independently sorted among its peers. The value (4450) for Incident Zip/Borough 10025/Manhattan, for example, is only sorted against itself – that very, singular concatenation of 10025 and Manhattan, and not the remainder of the values, which sort against their own unique Incident Zip and Borough. You thus need rather to click in the superordinate Row Label– in our case Incident Zip – and right-click on Sort and More Sort Options:

 n1

 

We do want to sort Count of Borough descendingly (highest to lowest, even though the dialog box insists on Z to A). But we’re doing so only after having clicked somewhere among the controlling Incident Zip Row Labels. And no, it’s not obvious. And to thicken the matter, were only one field to populate Row Labels, you would be able to click directly into the Value area and sort it. And no, it’s not obvious.

Anyway, you’ll go on to note the not-unpredictable Manhattan predominance among the complaint bearers, but here’s where it helps to know a bit about New York neighborhoods – because most of the areas keyed to those clamorous zip codes are residential, and not the commercial beehives you’d expect to find ruling the noise hierarchy. Zip code leader 10032, for example, maps to Washington Heights in the borough’s upper latitudes, right where the island begins to literally thin; 10002, on the other hand, or other side, pinpoints the fabled Lower East Side, no more renowned for its commuter and business-driven demographic than 10032. Indeed – only zips 10016 and 10019 could be properly sited in Manhattan’s teeming center, and the Brooklyn-interloping 11211 plots itself in only-in-New-York Williamsburg, that incomprehensibly polygot Chassidic-Hispanic-uber hipster turf on the other side of the river. More hipster heavy partying there than heavy industry.

If we’re dealing then with an anomaly of sorts – residential noise chart-toppers – we’d do well to cast for a wider breakout, by complaint types. Those data here are doubly categorized, both in Complaint Type and Descriptor terms, with the former grouping the more particularized citations in the latter field. But Complaint Type encrypts a great many of its entries as just plain Noise, not affording much aid and comfort to the analysis, and as such we’ll have to go with the more closely-itemized Descriptor field.  And because the Descriptors are so numerous and will shove many of its columns off screen, let’s swing Incident Zip and Borough to Column Labels and lower Descriptor into Row Labels, e.g.

n2

 

Note those large Loud Music/Party numbers. For substantiation, jostle the data by % of Column Total and take particular note of 10016’s Loud Music/Party count, amounting to 18.68% of all its complaints, the smallest such fraction among the 15 zip codes. Again, 10016’s perimeters take us just beneath the Empire State Building and the United Nations, that is, very near a relatively (and it is all relative) daytime-top-heavy, tourist and worker-intensive territory. On the other hand, 10016’s proportion of all its complaints filed against Construction Before/After Hours – 32.77% – takes it to the head of that class, again attesting to its commercial locus.

But if you’re looking to party, you’ll find the buzz in Manhattan’s 10034, way up in Inwood, a ‘hood pushing even farther north than Washington Heights. The Loud Music/Party count screams out 60.62% of all the complaints leveled there.

Now for another look, one that pitches Descriptors to day of the week. Pile into the next available free column, call it Weekday, and enter in row 2:

=WEEKDAY(B2)

(And while you’re at it, you’d probably want to Paste the Values atop the formula results. And remember, 1 means Sunday.)

Then:

Row Labels: Weekday (Count)

Weekday (again, this time by % of Column Total)

Slicer – Descriptor

Click through the Slicer’s uproarious inventory, e.g., Banging/Pounding, Loud Television, News Gathering (you’ve got me on that one), Jack Hammering, and the like, and observe both the absolute and relative distribution of each descriptor across the week. And the variation can be intriguing, though the 16,000 barking dog complaints track even-handedly, as do all the Bangings and Poundings, e.g.:

n3

n4

 

 

 

 

But Car and Truck Music protestations exhibit weekend sensitivity:

n5

 

 

 

 

 

Well that’s New York – its folks seem to be either partying or complaining. And maybe the complainers are complaining because they weren’t invited to the party.

Listen Up: New York Noise Complaint Data, Part 1

8 Aug

They call it the city that never sleeps for a reason – New Yorkers can’t, not with that immutable racket banging a collective tattoo on the eardrums of the Apple’s yawning hordes. New York’s Open Data noise complaint spreadsheet makes for interesting downloading and reading, but try to grab a quiet seat in the library first before you give it a look. You can get it here (the workbook, not the seat), after clicking the laconically-named Noise link (listing 116, the one captioned NYC Noise complaints for 2012).

Download it as a CSV (perhaps the speedier mode), but save it as a duly certified Excel workbook; and sized at 62MB, make sure that seat is well padded during your wait, while your hard drive reels from the impact.

Interesting, not a little amusing, but yet occasionally reproachable, the data indeed require a look, and a well-considered one. For example – given their peculiar informational relation to time sensitivity, complaint records would serve investigators smashingly with date and time stamps, and the Created Date field in B comprises the former only; and while the Due Date field entries are in fact properly time-laden, the meaning  of a due date is unclear (to me at least), and incomplete besides.

And don’t confuse the City and Borough fields, the latter strictly admitting just the names of New York’s five counties (boroughs). The City field, however, granularizes its Queens references, drilling only these down to their neighbourhood names (a common New York postal practice, by the way) and hence subverting any inter-borough comparability.

Moreover, columns AA through AW don’t seem to belong here, having either been abandoned, headed by extraneous field names, and/or stocked with the Unspecified code. As such I’d delete the whole lot, and show the door as well to that very last Location field, which textualizes latitude and longitude information that’s already there, and in far more usable form in the immediately preceding columns.  Overall byte saving as a result: about 18MB. And I don’t suppose you need reminding about the ubiquitous column autofit.

Now you can replace your hard hat with a thinking cap; there are data in there, after all. Start basic: how were complaints distributed across 2012’s months? The pivot table prescription is clear:

Row Labels: Created Date (grouped both by Months and Years, that latter qualification a probe for misfit records dated a year other than 2012).

I get:

noise1

Hold on. We’ve been told that our original 62 million bytes devoted themselves to 2012 data, and we’re seeing an awful lot of months joining in from following two years as well. Let’s see how many seditious records we’re dealing with, by dragging Created Date (perhaps kicking and screaming) into Values. I get:

noise2

In other words, we’ve been fooled, kind of. The data in fact begin in May 2012, but roll on nevertheless all the way to this very week. And that’s fine with me, but let’s just understand the time span we’re really dealing with.

One distributional fact calls attention to itself right away. Note the apparent step-up in complaints for 2014; and for a more fluent realization of the same, drag the Years button into Column Labels (yes; Excel lets you break the Months and Years strata of the same grouped field into its two sub-parameters, although you’ll note that what is effect Months remains labelled as Created Date, a bit opaquely):

noise3

 

We see that this year’s May, June, and July complaint numbers push well ahead of their prior monthly counterparts (and August looks to be similarly paced), setting off the usual methodological caution: is the upsurge a creature of a sharpened acuity in New York’s recording tools and/or a heightened litigiousness among New Yorkers, or a “real” pumping up of the volume by New York’s noisemakers?

Now for another standard take, try associating complaint numbers by borough:

Row Labels: Borough

Values: Borough (Count, as it must)

Borough (again, by Show Values As > % of Column Total):

noise4

Remember that Manhattan’s residential population of 1.6 million accounts for less than 20% of New York City’s total, but its effective daytime midweek population trampolines to around 3.6 million. Thus again a reliable reading of complaint times would do wonders for the analysis. For a first pass at the problem, while conceding the uncertainty assailing the Due Date field, try

Row Labels: Due Date (remember, the field contains many blanks. Group by Hour)

Column Labels: Borough

Values: Count of Due Date

I get (in excerpt):

noise5

 

Note the Manhattan totals, particularly between 5 and 7 AM. Boisterous clubbers in, or out there? Late-shifters tuning into their boom boxes at quitting time?  I don’t know, but next switch these numbers to % of Row Total mode, filtering out the blanks and turning off Grand Totals (which necessarily will add to 100%):

noise6

 

With 43% of all the complaints, the early-morning proportions don’t seem to incriminate Manhattan.  But on the other hand, would anyone want to allow that Manhattan’s 3 AM population stands at 3.6 million?

Getting There: UK Worker-Commute Data

1 Aug

You say you can’t get there from here? You better find a way, my friend, because your job is there and staying here won’t pay the rent; and the UK’s Office for National Statistic’s data on work commutes tell us much about the daily to-ing and fro-ing that takes its labor force to its appointed rounds. Download all of that here, by clicking on the Commuting Patterns link.

http://www.ons.gov.uk/ons/datasets-and-tables/index.html?pageSize=50&sortBy=none&sortDirection=none&newquery=commuting+flows

(If necessary, you’ll want to save the file in xlsx mode. Note there are other official looks at commutation data out there, e.g. here, and the numbers elsewhere may differ. As usual, the methodological fine print is key.)

The workbook discretely devotes its several sheets to incoming and outgoing commuting flows from and to about 380 local education authorities (that’s a UK-specific entity, in London’s case to its 33 boroughs), and for the years 2010 and 2011. The numbers before us were extrapolated from the country’s Annual Population Survey data, and as such are beholden to the confidence intervals the book describes in its Metadata sheet. Moreover (and thanks to the ONS’s Alexa Bradley on this), that sheet brings some significant estimates issues to our attention.

The Inward commuting sheets (Table 6, for example), drape workplace destinations down column A, with the localities from which workers head there named in B. The Outward data flip the parameters, commending Place of Residence to A instead. But it occurred to me that the Inward/Outward numbers could be brought together in a single sheet, preparing us to study each locality comparatively for its I/O movement as a result. With those marching orders in mind I tried this move for the 2011 data: I copied and pasted the Outward flow data in Table 8 to and beneath the Inward rows (Table 6), deleting the newly expendable header row in 9190. I then pushed the A column aside via an Insert, entering an o for outward in A9190 (now home to a usable record, after all) and copying from there down the remainder of the column. Back in A5, the row upholding first Inward record, I entered i and copied down (double-clicking the fill handle will take the copy only as far as the empty cells, and won’t touch all those Os below.), naming the late-coming field Inward/Outward, or something like that. The plan then was to pivot-table the data by locality, proceeding to break these out by the workers emanating from, and steaming toward, each one.

Now we need to think about that data graft. After all – the Inward/Outward sheets aren’t structural equivalents. Columns A and B in the one sheet have been mirror-imaged as B and A in the other; and so my copy-and-paste appears to have admixed residence data with destinations, and vice versa.

The allegation is true, and at the very least seems to expose itself to the charge of bad form. But in fact the ostensible hybridization of the Workplace field is nothing more than, for example, the multiple listing of a student’s name in a data set strung to a nearby column naming various academic subjects and a third field setting forth respective test scores. Don’t be disoriented, then, by the Ins and Outs and their literal semantic opposition. Understood in spreadsheet field-item terms, In and Out are equivalent to Sociology and Physics – different items, and that’s all.

But if you’re not mollified yet humor me and try this pivot table:

Row Labels: Workplace

Column Labels: Inward/Outward

Values: Number of Commuters (Sum)

(I’d turn off Grand Totals here.)

Yes, the Workplace field is rife with Places of Residence data, but again, the Is and Os keep each directional type at arm’s length.

Once put into play, the data draw some most notable, and in at least one case colossal, symmetries. Look at the numbers for the City of London and its minute residential population, for example: An inflow of 616,719, an outflow of 1,303. Broadland, on the other hand, a community in Norwich north of London on the UK’s east coast, “loses” far more commuters than it gains, mustering an I/O split of 32,000 to 60,000. My home authority of Barnet is likewise top-heavy with the outs. Unsurprisingly, larger cities receive more commuters than they give.

And as such, precision might be better served and captured by a simple inward-to-outward ratio, in which a place’s Inward number, for example, could be divided by its Outward. But while the math is simple, making it happen in this pivot table requires a bit of a step back.

What’s needed – though as always I’m prepared to be persuaded by possible alternatives, too – is a Calculated Item, not to be confused with a Calculated Field, an instrument we’ve dusted off on several occasions including here. Because we want to divide a locality’s inward numbers by its outward – that is, fractionate a particular item in a field with another item in the same field – we’re in effect hatching a brand-new item, and one that appears not at all in the source data set.

To start hatching, click in the Inward/Outward column label strip:

io1

Next, click PivotTable Tools > Options > Fields, Items, & Sets > Calculated Item:

 io2

Substitute a field name if you wish for Formula 1 (i.e., In-Out Ratio; you can introduce a legal space here)

Double-click the i, thereby lodging that item name in the Formula space; then enter a / in there, double-click the o, and click OK, seeing to it that each locality’s Inward total gets divided by its Outward companion. You should see, in excerpt:

io3

 

 

Note that I’ve formatted the numbers to two decimal places – that is all the numbers, whether I’ve wanted to or not, and even the Inward and Outward numbers for which the places are utterly dilatory. That’s because the formatting here is field-wide, and since i, o, and In-Out Ratio are items sprouting in the same field, the decimals are indiscriminately applied.

And note the City of London’s In-Out ratio of 473 to 1. A nice place to work, but folks just don’t seem to want to live there.

 

Follow

Get every new post delivered to your Inbox.

Join 154 other followers