English-Wales Name Data, Part 2: Sum Interesting Formulas

28 Aug

My bookmark tucked its ribbon here:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

Last week’s post, you recall, paused at the above teaser, a solitary, unexplained but efficacious formula for coaxing a weighted-average length for 2013’s top 100 boys’ and girls’ birth names from the data. Now I need to reward your patience by explaining.

A weighted average, after all, insists that a proper, proportioned due be paid every name’s incidence across the birth cohort, and the formula in our inbox appears to due just that. It enlists the SUMPRODUCT function to the task, a predominantly simple mechanism that in the first instance works this way:

Suppose I’m presented with this range of transactions in A1:C5:

(It’s clear I have no idea what these comestibles really cost.) What I want to know, of course, is what the bill comes to in toto, and the standard means toward that end for is pretty evident: enter a bank of formulas, each multiplying every Unit Sold by its allied Cost per unit, and add the four purchases.  But SUMPRODUCT lets us unify the process:

=SUMPRODUCT(B2:B5,C2:C5)

That is: simply enter the paired ranges, or arrays, as they’re known in these parts, and SUMPRODUCT does the rest – it multiplies each Units Sold by its collinear Cost per unit – B2 by C2, B3 by C3, etc. and then sums them all to boot.

But what are we hoping to do? Again, we’re aspiring to learn the weighted average length of all the birth names in our workbook, and SUMPRODUCT will consummate that hope – provided we do a bit of a syntactical rethink.

Observe that nowhere among our five extant England-Wales-name workbook fields will you find name length data, and we even politely declined the offer of the suitable, supplementary Length field that was tendered in the previous post. Again, we want to see if we can figure average name length without a length field’s well-meaning intercession.

So get back to our bookmarked SUMPRODUCT, which is, and is not, essentially coterminous with the unassuming apples and oranges computation. You’ll note first of all that both SUMPRODUCTS reference two arrays, but our average-length-finder couches the first of these with a LEN, one that stuffs a 402-row range inside its bulging parentheses, even as our previous meet-ups with LEN in previous posts sized the contents of but one cell per formula. And the very fact that we can grant LEN this superimposition atop an array reference at all is, to me at least, imperfectly obvious. But you can.  But we also need to understand in turn that SUMPRODUCT is an off-the-shelf embodiment of an array formula, a class of expressions that power through multiple calculations in but one go, a capability that in fact was made clear by the more rudimentary fruit-and-veg example, when you think about it – and you should (for an instructive introduction to array formulas look here).

Thus the LEN(D2:D403) excerpt measures the length of each name in the D column – and then multiplies each length by the corresponding number of children bearing that name (reported in the E column) – and then adds all those results…and divides that grand total by the sum of all the births. And that’s a weighted average. Here’s the formula again:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

Thus the calculated length of D2 is multiplied by the associated number of births in E2, as are the like values in D3 and E3, and so on, all the way down to D403 and E403. These are then all added, turning out an aggregate length in characters. That number – which happens to be 2,238,510 – next submits to the divisor comprising the number of all births – 343,022; and 2,238,510/343,022 confers a weighted name-length average of 6.25285. Got that?

Of course it makes sense – it has to, or else let me be the first signatory to your class action suit against Microsoft. But array formulas do require a think, because the kind of syntactical sense they espouse doesn’t always tally with standard user expectations.

But I for one was perplexed by a more generic thought-provoker, one that brought me back in touch with that ancient mathematical fundament, the order of operations. It turns out, for example, that the SUMPRODUCT under our microscope could have been alternatively written this way:

=SUMPRODUCT(LEN(D2:D403),E2:E403)/SUM(E2:E403)

Look closely, because that variant hasn’t slithered out of a photocopier. It is different from our earlier take, in virtue of the parenthesis (or bracket, depending on your IP address) right-flanking the E2:E403 segment before the divisor symbol. Here, then

 

(LEN(D2:D403,E2:E403)

 

are regarded of a piece, and indeed, the entire SUMPRODUCT formula ends there:

 

=SUMPRODUCT(LEN(D2:D403),E2:403)

 

and the /SUM(E2:E403) simply tacks itself discretely to the above taking the SUMPRODUCT result further, but from the outside, as it were.

 

But everything you see in our original SUMPRODUCT belongs to it; nothing there is external to the function as it is written. Here, SUM(E2:E403) appears to button itself to E2:E403 alone, and then divide it; but apart from the fact that that fraction doesn’t seem to make much sense, it isn’t what’s happening anyway. Consider the difference between

 

=6*(4/2) and =(6*4)/2

 

Guess what – there is none. Either way you get 12.

 

P.S. If you’re a name devotee, the US Social Security site Popular Baby Names page issues a free pass to an enormous holding of birth name data dating back to 1883 in easy-to-convert text file mode, apparently listing every first name given to at least five babies each year. And they have state-by-state breakouts, too.

England-Wales Birth Names, 2013, Part 1: A Length-wise look

21 Aug

There are two kinds of spreadsheets, incarnating two states of readiness – the first, all dressed up with nowhere to go but right before your got-to-know gaze – and the second, an unprocessed, unpretty, whole-fibered object whose monotonic, pedal-pointed rows ask you to hold on tight to your mouse and do something with them.

State number 2 conceals a dare: if you want to really understand what’s going on with the data, you need to bring something to the show.  Stare at a sheet rattling 200,000 rows in your face and try to make some sense out of it without busting a move at the data; there is a difference, after all, between a reader and a user.

Whew – I feel better for having said that, and while I’m stoked let’s click at the brace of workbooks migrated to the public domain by the UK Office for National Statistics, counting the gender-parsed top 100 birth names in 2013 for England and Wales.

The lists made some news in England last week, but even though we’ve worked with baby name data before (e.g., my April 10 and October 24, 2013 posts) there are additional considerations to be broached. Look, for example, at the Table 1 – Top 100 girls, E&W tab in the girl-specific book (in excerpt):

 ons1

Pretty? Maybe, but look again. Why, I feel professionally impelled to ask, are the 100 names split down the middle, thus forging what are in effect two parameters comprising precisely the same kind of information, along with a doubling-over every other field? Even a sit-back reader might ask the same question; and for those who want to actually do some business here, the sheet as it stands just can’t and won’t do, and that’s quite apart from the impertinent blank row atop row 8, which should be made impermanent. And while I’m at it be advised that the Change Since… numbers attending the names are textual to a fare thee well, with authentically texted number signs in lieu of the numerically formatted kind, the occasional asterisk, and all kinds of superfluous spaces in there too. (But I’m leaving the Change data aside for the purposes of the discussion, but for the record one could reclaim their arithmetic standing through a regimen of finds and replaces and TRIM functions.)

And given the presentational morass above here’s what I do, in the interests of accessibility and maximized analytical puissance:  Copy each set of Rank, Name, and Count columns from the Top 100 England and Wales tabs on both the boys and girls workbooks all into a new sheet, and then open two columns to which gender and country identifiers are to be coded, e.g.

ons2

(I’ve also invested the copied data with the workbook default Calibri 11-point font for consistency’s sake.) You should wind up with 402 names (and not 400, because of some same-ranked names).

Now you can get back in touch with your inner user.  Remembering of course that the data enroll only the top 100 baby names into its global denominator, we could look for example at the inter-country relation between names, understanding that overall ,Welsh babies account for about 5.3% of all births (the ratios are almost identical between boys and girls). Try something like this:

Row Labels: Name

Colum Labels: Country

Values: Number (by % of Row Total)

Filter: Gender

In excerpt and filtering for Girls I get:

ons3

 

 

You should be looking here for Welsh proportions departing in either direction from around 5.3%; thus Brooke features relatively more often among Wales births, but then look at Alexis, its 33 babies all born in Wales.

Now substitute Rank for Number in the Values area (by Sum):

ons4

 

Amelia’s trans-national appeal is clear, but you’ll note significant rank disparities up and down the results.

Now think about this: might a notable difference between average name lengths obtain between gender and/or country? It might, but you’ll be quick to call the absence of a name-length metric among the source data to my attention, and you’ll be right to do so. How, then, could name-length differentials be appraised?

And that request is a multitudinous one; and the most elemental reply to it would have us assign a LEN formula alongside each name, total the accumulation of them all, and divide it by the number of names, presumably under the steam of a COUNTA. But that won’t work, even if you’re happy with the idea. It won’t work because by having copied the England and Wales names from their associated sheets, many names thus appear twice, and as such enforce a small but measurable skew upon the calculation. The alternative here, then: bang out a pivot table, and assuming you’ve brought that Length field into the data fold:

Row Labels: Name

Values: Length (Average)

The table naturally reports each name once inside the Row Label column, because that’s what Row Labels do – return every item in the Labelled field uniquely. The necessary reversion to the Average operation above owes to the fact that, even as Amelia appears but once among the Row Labels, Values defaults to a sum of her two name lengths in the original data set – the one for England, the other for Wales, or a summed length of 12. And now you can play through the formula proposed in the previous paragraph (and if you do, I’d keep away from those GETPIVOTDATA references that’ll beset the expression if you click on the cell references; I’d just type the refs in standard mode).

That should work, at least mathematically. But you may be happier with a weighted assay of the name lengths that commensurately honor the variable numeric contribution of the names. After all, average the lengths of Abigail – the choice for 1191 girls – and Alys, the name adopted by 59 sets of parents, and it comes to 5.5. But is that how you want the average to be understood? And if you don’t, and you’ve been won over by the prospect of a weighted average, can that end be pursued with one formula, and without bothering to institute a Length field in the data set?

The answer is yes, and the formula – at least one such formula – looks something like this:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

 (assuming the Names occupy D, with the number of births moving in next door to E).  And so it seems to me then that we need a part two, so think about what you see above. But remember you can always speak with me during office hours. I think they’re Wednesdays 11:12:30 and by appointment.

 

 

Listen Up: New York Noise Complaint Data, Part 2

14 Aug

Now that we’ve implicated Manhattan as the noise complaint capital of New York City – if not the world – it’s time to add some counts to the arraignment. We could for example drill down complaints by the zip (or postal) codes of city’s five counties, or boroughs, for the sake of intra-borough precision – something like

Row Labels: incident Zip

Borough (shown in Tabular format, with the subtotals suppressed)

Values:  Borough

Proceed to filter out the blank zip codes, and note the smattering of codes, e.g., 10463 and 11237, drawing their outlines around parts of two boroughs (and remember that by clicking anywhere in Incident Zip and following with PivotTable Tools > Options > Field Settings (in the Active Field button group) > Layout & Print > Repeat Item Labels, you can fill the Incident Zip blank cells with their codes).  That 83 code in receipt of 126 complaints, doesn’t square with the U.S. five-digit protocol and seems to reference points in and around Central Park.

Then right-click in Incident Zip, click Filter > Top 10, and ask for say, the 15 prime code offenders -and you’ll have to sort these; because Top 10 won’t do so on its own.

And for a few words about that sort, as I (presently) understand it. Because we’re effect conducting a two-field (Incident Zip and Borough) breakout of the Values, a direct click in and sort of those Values won’t happen here, as it would in a conventional data set. And that’s apparently because the two fields lock their Value up, such that the Value can’t be independently sorted among its peers. The value (4450) for Incident Zip/Borough 10025/Manhattan, for example, is only sorted against itself – that very, singular concatenation of 10025 and Manhattan, and not the remainder of the values, which sort against their own unique Incident Zip and Borough. You thus need rather to click in the superordinate Row Label– in our case Incident Zip – and right-click on Sort and More Sort Options:

 n1

 

We do want to sort Count of Borough descendingly (highest to lowest, even though the dialog box insists on Z to A). But we’re doing so only after having clicked somewhere among the controlling Incident Zip Row Labels. And no, it’s not obvious. And to thicken the matter, were only one field to populate Row Labels, you would be able to click directly into the Value area and sort it. And no, it’s not obvious.

Anyway, you’ll go on to note the not-unpredictable Manhattan predominance among the complaint bearers, but here’s where it helps to know a bit about New York neighborhoods – because most of the areas keyed to those clamorous zip codes are residential, and not the commercial beehives you’d expect to find ruling the noise hierarchy. Zip code leader 10032, for example, maps to Washington Heights in the borough’s upper latitudes, right where the island begins to literally thin; 10002, on the other hand, or other side, pinpoints the fabled Lower East Side, no more renowned for its commuter and business-driven demographic than 10032. Indeed – only zips 10016 and 10019 could be properly sited in Manhattan’s teeming center, and the Brooklyn-interloping 11211 plots itself in only-in-New-York Williamsburg, that incomprehensibly polygot Chassidic-Hispanic-uber hipster turf on the other side of the river. More hipster heavy partying there than heavy industry.

If we’re dealing then with an anomaly of sorts – residential noise chart-toppers – we’d do well to cast for a wider breakout, by complaint types. Those data here are doubly categorized, both in Complaint Type and Descriptor terms, with the former grouping the more particularized citations in the latter field. But Complaint Type encrypts a great many of its entries as just plain Noise, not affording much aid and comfort to the analysis, and as such we’ll have to go with the more closely-itemized Descriptor field.  And because the Descriptors are so numerous and will shove many of its columns off screen, let’s swing Incident Zip and Borough to Column Labels and lower Descriptor into Row Labels, e.g.

n2

 

Note those large Loud Music/Party numbers. For substantiation, jostle the data by % of Column Total and take particular note of 10016’s Loud Music/Party count, amounting to 18.68% of all its complaints, the smallest such fraction among the 15 zip codes. Again, 10016’s perimeters take us just beneath the Empire State Building and the United Nations, that is, very near a relatively (and it is all relative) daytime-top-heavy, tourist and worker-intensive territory. On the other hand, 10016’s proportion of all its complaints filed against Construction Before/After Hours – 32.77% – takes it to the head of that class, again attesting to its commercial locus.

But if you’re looking to party, you’ll find the buzz in Manhattan’s 10034, way up in Inwood, a ‘hood pushing even farther north than Washington Heights. The Loud Music/Party count screams out 60.62% of all the complaints leveled there.

Now for another look, one that pitches Descriptors to day of the week. Pile into the next available free column, call it Weekday, and enter in row 2:

=WEEKDAY(B2)

(And while you’re at it, you’d probably want to Paste the Values atop the formula results. And remember, 1 means Sunday.)

Then:

Row Labels: Weekday (Count)

Weekday (again, this time by % of Column Total)

Slicer – Descriptor

Click through the Slicer’s uproarious inventory, e.g., Banging/Pounding, Loud Television, News Gathering (you’ve got me on that one), Jack Hammering, and the like, and observe both the absolute and relative distribution of each descriptor across the week. And the variation can be intriguing, though the 16,000 barking dog complaints track even-handedly, as do all the Bangings and Poundings, e.g.:

n3

n4

 

 

 

 

But Car and Truck Music protestations exhibit weekend sensitivity:

n5

 

 

 

 

 

Well that’s New York – its folks seem to be either partying or complaining. And maybe the complainers are complaining because they weren’t invited to the party.

Listen Up: New York Noise Complaint Data, Part 1

8 Aug

They call it the city that never sleeps for a reason – New Yorkers can’t, not with that immutable racket banging a collective tattoo on the eardrums of the Apple’s yawning hordes. New York’s Open Data noise complaint spreadsheet makes for interesting downloading and reading, but try to grab a quiet seat in the library first before you give it a look. You can get it here (the workbook, not the seat), after clicking the laconically-named Noise link (listing 116, the one captioned NYC Noise complaints for 2012).

Download it as a CSV (perhaps the speedier mode), but save it as a duly certified Excel workbook; and sized at 62MB, make sure that seat is well padded during your wait, while your hard drive reels from the impact.

Interesting, not a little amusing, but yet occasionally reproachable, the data indeed require a look, and a well-considered one. For example – given their peculiar informational relation to time sensitivity, complaint records would serve investigators smashingly with date and time stamps, and the Created Date field in B comprises the former only; and while the Due Date field entries are in fact properly time-laden, the meaning  of a due date is unclear (to me at least), and incomplete besides.

And don’t confuse the City and Borough fields, the latter strictly admitting just the names of New York’s five counties (boroughs). The City field, however, granularizes its Queens references, drilling only these down to their neighbourhood names (a common New York postal practice, by the way) and hence subverting any inter-borough comparability.

Moreover, columns AA through AW don’t seem to belong here, having either been abandoned, headed by extraneous field names, and/or stocked with the Unspecified code. As such I’d delete the whole lot, and show the door as well to that very last Location field, which textualizes latitude and longitude information that’s already there, and in far more usable form in the immediately preceding columns.  Overall byte saving as a result: about 18MB. And I don’t suppose you need reminding about the ubiquitous column autofit.

Now you can replace your hard hat with a thinking cap; there are data in there, after all. Start basic: how were complaints distributed across 2012’s months? The pivot table prescription is clear:

Row Labels: Created Date (grouped both by Months and Years, that latter qualification a probe for misfit records dated a year other than 2012).

I get:

noise1

Hold on. We’ve been told that our original 62 million bytes devoted themselves to 2012 data, and we’re seeing an awful lot of months joining in from following two years as well. Let’s see how many seditious records we’re dealing with, by dragging Created Date (perhaps kicking and screaming) into Values. I get:

noise2

In other words, we’ve been fooled, kind of. The data in fact begin in May 2012, but roll on nevertheless all the way to this very week. And that’s fine with me, but let’s just understand the time span we’re really dealing with.

One distributional fact calls attention to itself right away. Note the apparent step-up in complaints for 2014; and for a more fluent realization of the same, drag the Years button into Column Labels (yes; Excel lets you break the Months and Years strata of the same grouped field into its two sub-parameters, although you’ll note that what is effect Months remains labelled as Created Date, a bit opaquely):

noise3

 

We see that this year’s May, June, and July complaint numbers push well ahead of their prior monthly counterparts (and August looks to be similarly paced), setting off the usual methodological caution: is the upsurge a creature of a sharpened acuity in New York’s recording tools and/or a heightened litigiousness among New Yorkers, or a “real” pumping up of the volume by New York’s noisemakers?

Now for another standard take, try associating complaint numbers by borough:

Row Labels: Borough

Values: Borough (Count, as it must)

Borough (again, by Show Values As > % of Column Total):

noise4

Remember that Manhattan’s residential population of 1.6 million accounts for less than 20% of New York City’s total, but its effective daytime midweek population trampolines to around 3.6 million. Thus again a reliable reading of complaint times would do wonders for the analysis. For a first pass at the problem, while conceding the uncertainty assailing the Due Date field, try

Row Labels: Due Date (remember, the field contains many blanks. Group by Hour)

Column Labels: Borough

Values: Count of Due Date

I get (in excerpt):

noise5

 

Note the Manhattan totals, particularly between 5 and 7 AM. Boisterous clubbers in, or out there? Late-shifters tuning into their boom boxes at quitting time?  I don’t know, but next switch these numbers to % of Row Total mode, filtering out the blanks and turning off Grand Totals (which necessarily will add to 100%):

noise6

 

With 43% of all the complaints, the early-morning proportions don’t seem to incriminate Manhattan.  But on the other hand, would anyone want to allow that Manhattan’s 3 AM population stands at 3.6 million?

Getting There: UK Worker-Commute Data

1 Aug

You say you can’t get there from here? You better find a way, my friend, because your job is there and staying here won’t pay the rent; and the UK’s Office for National Statistic’s data on work commutes tell us much about the daily to-ing and fro-ing that takes its labor force to its appointed rounds. Download all of that here, by clicking on the Commuting Patterns link.

http://www.ons.gov.uk/ons/datasets-and-tables/index.html?pageSize=50&sortBy=none&sortDirection=none&newquery=commuting+flows

(If necessary, you’ll want to save the file in xlsx mode. Note there are other official looks at commutation data out there, e.g. here, and the numbers elsewhere may differ. As usual, the methodological fine print is key.)

The workbook discretely devotes its several sheets to incoming and outgoing commuting flows from and to about 380 local education authorities (that’s a UK-specific entity, in London’s case to its 33 boroughs), and for the years 2010 and 2011. The numbers before us were extrapolated from the country’s Annual Population Survey data, and as such are beholden to the confidence intervals the book describes in its Metadata sheet. Moreover (and thanks to the ONS’s Alexa Bradley on this), that sheet brings some significant estimates issues to our attention.

The Inward commuting sheets (Table 6, for example), drape workplace destinations down column A, with the localities from which workers head there named in B. The Outward data flip the parameters, commending Place of Residence to A instead. But it occurred to me that the Inward/Outward numbers could be brought together in a single sheet, preparing us to study each locality comparatively for its I/O movement as a result. With those marching orders in mind I tried this move for the 2011 data: I copied and pasted the Outward flow data in Table 8 to and beneath the Inward rows (Table 6), deleting the newly expendable header row in 9190. I then pushed the A column aside via an Insert, entering an o for outward in A9190 (now home to a usable record, after all) and copying from there down the remainder of the column. Back in A5, the row upholding first Inward record, I entered i and copied down (double-clicking the fill handle will take the copy only as far as the empty cells, and won’t touch all those Os below.), naming the late-coming field Inward/Outward, or something like that. The plan then was to pivot-table the data by locality, proceeding to break these out by the workers emanating from, and steaming toward, each one.

Now we need to think about that data graft. After all – the Inward/Outward sheets aren’t structural equivalents. Columns A and B in the one sheet have been mirror-imaged as B and A in the other; and so my copy-and-paste appears to have admixed residence data with destinations, and vice versa.

The allegation is true, and at the very least seems to expose itself to the charge of bad form. But in fact the ostensible hybridization of the Workplace field is nothing more than, for example, the multiple listing of a student’s name in a data set strung to a nearby column naming various academic subjects and a third field setting forth respective test scores. Don’t be disoriented, then, by the Ins and Outs and their literal semantic opposition. Understood in spreadsheet field-item terms, In and Out are equivalent to Sociology and Physics – different items, and that’s all.

But if you’re not mollified yet humor me and try this pivot table:

Row Labels: Workplace

Column Labels: Inward/Outward

Values: Number of Commuters (Sum)

(I’d turn off Grand Totals here.)

Yes, the Workplace field is rife with Places of Residence data, but again, the Is and Os keep each directional type at arm’s length.

Once put into play, the data draw some most notable, and in at least one case colossal, symmetries. Look at the numbers for the City of London and its minute residential population, for example: An inflow of 616,719, an outflow of 1,303. Broadland, on the other hand, a community in Norwich north of London on the UK’s east coast, “loses” far more commuters than it gains, mustering an I/O split of 32,000 to 60,000. My home authority of Barnet is likewise top-heavy with the outs. Unsurprisingly, larger cities receive more commuters than they give.

And as such, precision might be better served and captured by a simple inward-to-outward ratio, in which a place’s Inward number, for example, could be divided by its Outward. But while the math is simple, making it happen in this pivot table requires a bit of a step back.

What’s needed – though as always I’m prepared to be persuaded by possible alternatives, too – is a Calculated Item, not to be confused with a Calculated Field, an instrument we’ve dusted off on several occasions including here. Because we want to divide a locality’s inward numbers by its outward – that is, fractionate a particular item in a field with another item in the same field – we’re in effect hatching a brand-new item, and one that appears not at all in the source data set.

To start hatching, click in the Inward/Outward column label strip:

io1

Next, click PivotTable Tools > Options > Fields, Items, & Sets > Calculated Item:

 io2

Substitute a field name if you wish for Formula 1 (i.e., In-Out Ratio; you can introduce a legal space here)

Double-click the i, thereby lodging that item name in the Formula space; then enter a / in there, double-click the o, and click OK, seeing to it that each locality’s Inward total gets divided by its Outward companion. You should see, in excerpt:

io3

 

 

Note that I’ve formatted the numbers to two decimal places – that is all the numbers, whether I’ve wanted to or not, and even the Inward and Outward numbers for which the places are utterly dilatory. That’s because the formatting here is field-wide, and since i, o, and In-Out Ratio are items sprouting in the same field, the decimals are indiscriminately applied.

And note the City of London’s In-Out ratio of 473 to 1. A nice place to work, but folks just don’t seem to want to live there.

 

Time (and Record) Management: OECD Data

25 Jul

The OECD clearly has a lot of time on its hands – after all, think how long it takes to say The Organisation for Economic Co-operation and Development. But time likewise figures in its research remit, here keying a survey of the time management habits of denizens of its member nations (at least most of them, along with representations from China, India, and South Africa, designated OECD partners), and recapitulated here:

OECD_1564_TUSupdatePortal

 

The data – and it’s the Total sheet I’m reviewing – spread out some neat cross-national takes on how respondents apportion their daily round, normalized for each nation’s activities to the day’s 1440 minutes (and I assume normalization means reconciling the inter-national variance in activity definitions; see the Activity category tab). You’ll be interested to know, for example, that South Africans appear to be able to devote 562 minutes a day or night – nine hours and 22 minutes – to sleeping, and that while earmarking but 463 minutes to that circadian necessity we’re left to conclude that the Japanese and Norwegians seem to be drinking the most coffee. It is also noteworthy that respondents from Turkey – the only preponderantly Muslim country in the national roster – spend the most time in religious activity (row 32), at 19 minutes (South Africans come in one minute behind). It’s all interesting, but I can’t resist the geekish aside to the effect that the row 32 numbers are variously positioned – with some middle-aligned, (along the vertical axis), and others right (and you might very well also want to pump up the 8-point text prevailing throughout).

But at the same time the numbers urge a familiar question upon the plenary: For whom is the spreadsheet intended – a public of passive, if interested, readers, or that A-Team of caffeinated deep thinkers who want nothing better than to do something with the data? If you’re siding with the latter insurgents, and want to act upon what it is you’re viewing here with your canteen of slicers (and Slicers) and dicers, then the data – and we’ve seen this before – have to be reshaped. That’s a question we’ve asked in so many words before, and it requires asking here as well.

Let’s see. For starters, we need to do something about row 3, harboring as it does survey date identifiers impersonating as a row of data. And a similar reprisal need be practiced upon those purple subtotal rows, which should be deleted. Leaving them in place and proceeding to drop the data as they stand into a pivot table will do nothing less than double any summed results, and yes, we’ve seen this before (e.g., my January 24, 2013 post).

On the other hand stripping those purple rows will also deprive the sheet of their associated headings, e.g., Paid Work or Study, Unpaid Work. But again, rows of data should be of a piece; insinuating rows of titles into the average daily minutes is tantamount to playing checkers, when in fact we’re playing chess. Moreover, the Men & Wom field in the A column sufficiently identifies the activities subject to the minutes-per-day estimates. But there’s more to be said about this.

But before we expound the larger issue betokened above, swing over to column AC and its swath of computations averaging the numbers lining up to their immediate left. Try cell AC4:

ti1

And try explaining it. Note all the references in the expression are alphabetically contiguous; and given its cellular continuity how do we defend the formulation above, when we have

=AVERAGE(C4:AB4)

at our disposal? I can’t answer my own question.

And indeed – those averages ignore the three partner nations on the other side of AC, thus plunking the wrong kind of data amid the time-estimate numbers. And the blank AD column has to go.

But there’s that larger issue, reprising an earlier question, the one I asked and attempted to resolve in my August 22, 2013 post. The Total sheet confers field status upon tufts of data that should more rightly conceive of themselves as items relenting to a larger, governing field. Thus, for example, the field-defined member countries in the sheet should be made to submit to the controlling aegis of a Country field. It seems to me, then, that a reconstructed data set would wheel out records looking something like this:

ti2

Again, umbrella fields on the order of Gender or Country are far mightier enablers of the kinds of grouping and ordering feats that pivot tables perform, and well justify the necessary surcharge the additional data entry would exact (again, turn back to August 22).

For example, the data reforms I’m (again) proposing would free a pivot tabler to rank national time outlays by any activity, by grabbing the Rank Largest to Smallest marble from the Show Values As bag of alternatives. The point is that Rank Largest to Smallest can do what it does only among items planted in the same field – and not between data settled in different ones.

So if you’re wearing the A-Team colors, why not reorganize the data here and shop it to the OECD – if you have the time?

Hire Education: U.S. Tuition Stats, Part 2

17 Jul

Now that last week’s spade work on the U.S. university tuition data has softened the ground for the next set of field trips (don’t worry; the puns and the metaphors stop here) we can proceed.

What, then, about some pivot-tabled, statewide breakouts of the tuition numbers? Start straightforwardly:

Row Labels:  State

Values: 2012-13 Tuition and fees (count)

2012013 Tuition and fees (again, this time by Average, restyled by currency format). Then sort this column highest to lowest. In excerpt I get

tui1

(Don’t rub your eyes at the math, or at the smallness of the text, for that matter. The 59 “States” grants a visitor’s permit here to various U.S. territories, e.g. the Virgin Islands and Puerto Rico, with its 98 institutions).

Rhode Island, Vermont, and Massachusetts, three states jigsawed into the New England region in the U.S. northeast, head the sort. But because tuitions ebb and flow by sector type that parameter needs a hearing. Drag away 2012-13 Tuition and fees (count) lift Sector name into Column Labels, and sort it all by the 4-year, private not-for-profit item. I get (again, in part):

tui2

This time four New England states – Rhode Island, Massachusetts, Connecticut, and Vermont – spearhead the list (trust me, if you can’t make them out up there). Old states (they’re all part of the original 13), old money, or something like that.

Then recalculate the data by Count, restoring the numbers to their General format. Note the refurbished results continue to hold to the 4-year, private not-for-profit sort, only this time hoisting New York to the top. Now show the values as % of Row Total. Don’t move away from the 4-year column and scroll down to Rhode Island and its 66.67% proportion, Vermont’s 68.42% and Massachusetts’s 60.00%. (Note: you can’t here sort the percent data by largest to smallest, because the item “really” continues to comprise the university count. % of Row Total merely masks the actual data as it was brought to the table.) Compare those states’ shares to those of other states, and the overall tuition averages can be more revealingly understood as a creature of the distributions of institution types across states.

But our workbook contains other data, too. The NetPrice worksheet rolls out real-world cost figures, performing subtractions from the stated tuitions as wrought by various modes of student assistance (again, review the Introduction tab for definitional enlightenment). You’ll need to note, though, that NetPrice carries 2011-12 outlays, and as such can’t truly be compared to the Tuitions sheet data, circa 2012-13. Note in addition the 130+ plus schools for which no net price data avail, and a smaller number of truant institutions – about 70 – showing blanks in their Percent receiving grant aid cells.  (And don’t overlook those four schools flouting negative net prices, meaning they’re paying you to go there.) And I can’t quite tell if the net prices we see here typify only those students who have actually received aid, or all the attendees at a school – even the ones who’ve footed their entire bill.

Nevertheless, there’s a lot to work with, starting, for example, with a breakout of the percent of students receiving aid by sector, along with average net sector price:

Row Labels:  Sector name

Values: Percent receiving grant aid 2011-12 (average, with appropriate formatting)

2011-12 Net price (Average, and suitable formatting)

tui3

An impressively large proportion of all students, then, receive some manner of outright aid; (the Net price description in the workbook’s Introduction sheet suggests the numbers above denote granted, as opposed to loaned, monies), with 4-year public institution enrollees the least likely to win a subsidy, perhaps because their tuitions are relatively low-ended to begin with. Still, students at the less than 2-year private not-for-profits and public school receive aid more often, and yet pay less tuition. And remember again that if you’re keen to compare net with initial tuitions, you’ll have to somehow ford the reporting-year divide distancing the Tuition and Netprice sheets.

Then simply sort the net prices from largest to smallest, a move that levitates the Aviator College of Aeronautical Science and Technology in Florida to the top floor with its net price of $71,492. You’ll remember the list-topping Landmark College from last week’s post, but that school asserts to its superlative standing only among its 4-year not-for-profit peers. The 2-year, for-profit Aviator College is the dearest of the dear; and what’s odd is that its 2012-13 tuition – presumably the figure assessed before any financial relief comes to the rescue – comes to $63,073, curiously lower. Of course, the numbers emanate from different school years to be sure, but a deeper look seems in order here, and while you’re at it, check out California’s American University of Health Sciences, the net price runner-up at $66,743. Its tuition for 2012-13: a nickel-and-diming $30,700. Explanations, anyone?

And for another set of thought-provokers, be mindful of the NetPriceChange sheet’s two comparison years, 2009-10 and 2011-12. It’s not unreasonable to wonder about that missing year in between, along with the sheet’s roster of 6843 schools, far outpacing the 4269 with which we worked in the earlier sheets. The respective sector populations by sheet:

tuii4

 tui6

 

It’s sector 9 – the Less than 2-year, private for-profit institutions niche – that pulls far away from the same-numbered Tuition/Net Price entry . Someone needs to account for the vastly different counts here. And one of those 2-year schools, by the way, is Florida’s Eternity Cosmetology School,

But with a name like that, are you sure they didn’t mean Cosmology?

 

 

 

Follow

Get every new post delivered to your Inbox.

Join 154 other followers