# English-Wales Name Data, Part 2: Sum Interesting Formulas

28 Aug

My bookmark tucked its ribbon here:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

Last week’s post, you recall, paused at the above teaser, a solitary, unexplained but efficacious formula for coaxing a weighted-average length for 2013’s top 100 boys’ and girls’ birth names from the data. Now I need to reward your patience by explaining.

A weighted average, after all, insists that a proper, proportioned due be paid every name’s incidence across the birth cohort, and the formula in our inbox appears to due just that. It enlists the SUMPRODUCT function to the task, a predominantly simple mechanism that in the first instance works this way:

Suppose I’m presented with this range of transactions in A1:C5:

(It’s clear I have no idea what these comestibles really cost.) What I want to know, of course, is what the bill comes to in toto, and the standard means toward that end for is pretty evident: enter a bank of formulas, each multiplying every Unit Sold by its allied Cost per unit, and add the four purchases.  But SUMPRODUCT lets us unify the process:

=SUMPRODUCT(B2:B5,C2:C5)

That is: simply enter the paired ranges, or arrays, as they’re known in these parts, and SUMPRODUCT does the rest – it multiplies each Units Sold by its collinear Cost per unit – B2 by C2, B3 by C3, etc. and then sums them all to boot.

But what are we hoping to do? Again, we’re aspiring to learn the weighted average length of all the birth names in our workbook, and SUMPRODUCT will consummate that hope – provided we do a bit of a syntactical rethink.

Observe that nowhere among our five extant England-Wales-name workbook fields will you find name length data, and we even politely declined the offer of the suitable, supplementary Length field that was tendered in the previous post. Again, we want to see if we can figure average name length without a length field’s well-meaning intercession.

So get back to our bookmarked SUMPRODUCT, which is, and is not, essentially coterminous with the unassuming apples and oranges computation. You’ll note first of all that both SUMPRODUCTS reference two arrays, but our average-length-finder couches the first of these with a LEN, one that stuffs a 402-row range inside its bulging parentheses, even as our previous meet-ups with LEN in previous posts sized the contents of but one cell per formula. And the very fact that we can grant LEN this superimposition atop an array reference at all is, to me at least, imperfectly obvious. But you can.  But we also need to understand in turn that SUMPRODUCT is an off-the-shelf embodiment of an array formula, a class of expressions that power through multiple calculations in but one go, a capability that in fact was made clear by the more rudimentary fruit-and-veg example, when you think about it – and you should (for an instructive introduction to array formulas look here).

Thus the LEN(D2:D403) excerpt measures the length of each name in the D column – and then multiplies each length by the corresponding number of children bearing that name (reported in the E column) – and then adds all those results…and divides that grand total by the sum of all the births. And that’s a weighted average. Here’s the formula again:

=SUMPRODUCT(LEN(D2:D403),E2:E403/SUM(E2:E403))

Thus the calculated length of D2 is multiplied by the associated number of births in E2, as are the like values in D3 and E3, and so on, all the way down to D403 and E403. These are then all added, turning out an aggregate length in characters. That number – which happens to be 2,238,510 – next submits to the divisor comprising the number of all births – 343,022; and 2,238,510/343,022 confers a weighted name-length average of 6.25285. Got that?

Of course it makes sense – it has to, or else let me be the first signatory to your class action suit against Microsoft. But array formulas do require a think, because the kind of syntactical sense they espouse doesn’t always tally with standard user expectations.

But I for one was perplexed by a more generic thought-provoker, one that brought me back in touch with that ancient mathematical fundament, the order of operations. It turns out, for example, that the SUMPRODUCT under our microscope could have been alternatively written this way:

=SUMPRODUCT(LEN(D2:D403),E2:E403)/SUM(E2:E403)

Look closely, because that variant hasn’t slithered out of a photocopier. It is different from our earlier take, in virtue of the parenthesis (or bracket, depending on your IP address) right-flanking the E2:E403 segment before the divisor symbol. Here, then

(LEN(D2:D403,E2:E403)

are regarded of a piece, and indeed, the entire SUMPRODUCT formula ends there:

=SUMPRODUCT(LEN(D2:D403),E2:403)

and the /SUM(E2:E403) simply tacks itself discretely to the above taking the SUMPRODUCT result further, but from the outside, as it were.

But everything you see in our original SUMPRODUCT belongs to it; nothing there is external to the function as it is written. Here, SUM(E2:E403) appears to button itself to E2:E403 alone, and then divide it; but apart from the fact that that fraction doesn’t seem to make much sense, it isn’t what’s happening anyway. Consider the difference between

=6*(4/2) and =(6*4)/2

Guess what – there is none. Either way you get 12.

P.S. If you’re a name devotee, the US Social Security site Popular Baby Names page issues a free pass to an enormous holding of birth name data dating back to 1883 in easy-to-convert text file mode, apparently listing every first name given to at least five babies each year. And they have state-by-state breakouts, too.