You can’t get blood from a stone, but can you get a story from a list? Presumably the answer verges encouragingly near the affirmative, else I’d never have posed the question and risked the worldwide derision sure to accompany a “no”. But still, the answer depends in large measure on what the list is about, and a recent spreadsheet posting by the Guardian enumerating the 100 all-time best-selling books in the UK might be harboring a story or two beneath its descending sort order, provided we’re prepared to pull back the palimpsest and sharpen the lenses. You can download the list here in Excel form:
The data as we see them are rather straightforward, and perhaps already have their stories to tell, albeit unsurprising ones. You’re not astonished by Harry Potter having imperialized six of the top eight positions; and neither are you thunderstruck by E. L. James’ latter-day incursion into the list (I count 150 shades so far, but check my math).
Still we could do well to ask more of the data. I am reminded of the theme sounded by devotees of digital humanities (I am thinking of Stanley Fish’s Times Opinionator piece posted earlier this year), to the effect that if you spin the data through sufficient permutations something interesting is bound to emerge sooner or later; and while that premise recalls the monkey-in-front-of-the-typewriter secretarial pool, there’s probably something to the notion. So what sort of permutations are available here for the spinning?
Let’s see. For one thing you could break out the data by publishers and tabulate their respective title counts, along with the aggregate sales for each. Turn the ignition on a pivot table and
Drag Publisher to the Row Labels area
Drag Publisher to the Values area
Drag Volume Sales to Values. You should see:
We see Random House tops the hierarchy with 19 titles (that in addition to the solitary Childrens Books entry under another imprint), followed by Penguin, Transworld and Bloomsbury in their respective silver and bronze positions. Yet Bloomsbury – owner of the Harry Potter franchise – rules the volume parameter.
Next click anywhere in the Sum of Volume Sales field (fields can be renamed, by the way; just click in the field title cell and type something else, an emendation which will not rename the data source field, however) and click PivotTable Tools >Options>Summarize Values By >Average, and you’ll get:
If you’re not happy with the prevailing formatting – and you probably aren’t – you can right-click anywhere in the Volume Sales field, click Value Field Settings, click the Number Format button and choose your refinements. You can then right click anywhere in Count of Publisher, click Sort > Largest to Smallest (yes – as usual, there are other ways of making this happen), bringing you here, depending on your formatting decisions:
Note Bloomsbury’s superior average sales-per-title, additional evidence of the Potter effect. Now how about list appearance by year of publication?
Assuming you’re working with the same pivot table as above, drag Volume Sales away from the Values area, but leave the Publisher field in place. Drag Publication Date to the Row Labels area. Click PivotTable Tools>Options>Group Selection. In the ensuing dialog box click Months off if it’s been selected, and click Years. You should see:
We see 2004 and 2005 accounting for 24% of all titles, at least suggesting that a sustained stretch of book availability need be paved before a book acquires the legs to command top-100 status (note that all three of the 2012 entrants issue from the excitable hand of E.L. James).
But we’re not finished yet. It occurred to me that one could correlate a book’s sales with its interval of availability, namely the number of days having elapsed between its date of publication and August 9, the date in which the Guardian report beached up onto its Datablog. One could reasonably surmise that, all other things being equal, the longer a book’s availability the greater its sales (beholden of course to the understanding that our data are severely skewed – comprising the universe of best-sellers). Verification of my little conjecture requires that we fashion a days-elapsed calculation, which draws us in turn into a slightly more searching look at how spreadsheets regard dates.
The first – and pre-eminent – thing you have to understand about dates is that they are numbers, however garbed and prettified in their cells. A date is a number that registers the number of days separating it from January 1, 1900. Thus
is in actuality 41140 (keep in mind that any date entered in the current calendar year is understood by default to reference that year. Thus one could have entered 8/19 above, to the same quantitative effect). By clicking on any number-bearing cell along with the Number drop-down menu in the Home tab, a diverse (but not comprehensive) array of formatting possibilities descends for your inspection and selection:
The publication dates exhibited in our best-seller spreadsheet embody what Excel terms its Long Date format, which garnishes the cells with the day-of-the-week information, though I fail to see the expository benefit attaching to the data. But that cavil is in a sense irrelevant; all we really need to remember is that the varied number formats in no way impair or adulterate the values of the numbers so formatted. Thus
Sunday, August 19, 2012=8/19/2012=41140
Now back to the correlation question. If we want to associate book sales with their durability – that is, their extent of availability across time – we want then to correlate the data in Publication Date with those in Volume Sales, something we can do pretty easily with the CORREL function.
Before actually going ahead, I’ll reiterate my hypothesis: that sales should be inversely correlated with publication date. In operational terms that means that the earlier the date – i.e., the smaller the date value (as per the above discussion) – the larger the volume sales.
Next select a blank cell and enter
CORREL asks you to enter two ranges (here called arrays, but we’ll look past that technicality here. The ranges can be entered in either order) of identical lengths, whereupon it compares each pair of same-row values (e.g., H2 with D2, H3 with D3, etc.) and synthesizes the correlation. The higher the correlation in either direction, negative or positive, the greater the association between the values in the columns (though none of this would demonstrate causal association). Recall that I’m predicting a negative correlation here – earlier (smaller) dates, higher sales. Again, bear in mind that the ornate formatting imposed upon the Publication Dates doesn’t matter because the dates really are numbers, and will be treated as such.
Complete the expression above and you should realize a value of
And that, my friends, is a decisively low correlation, avowing a minimal association between publication date and sales; and for many analytical purposes a correlation in the .078 range would end the tale then and there. No correlation, no story. But here, given a pairing of variables that could have reasonably been supposed to augur a significant association as a kind of null hypothesis, I would submit that absence of association qualifies as a bit of a pause-giver. Why shouldn’t older titles – older best-sellers – line up with more voluminous sales? Good question, I think.
In any case, what we could really use here are week-by-week sales data, the sort that would empower us to build trajectories and sight trends, the better to bulk up the analysis. Doubtless Nielsen has such data, and just as doubtless they’ll ask you for big bucks in order to get it. So whaddya say we all chip in? You can put me down for a tenner; and maybe we can get J.K. Rowling and E.L. James to drop a few pounds in the kitty, too.