AIDS Data: Spreadsheet Epidemiology, Part 1

22 Nov

In a relatively cheery anticipation of World Aids Day on December 1 a United Nations report extols a range of heartening developments in the struggle against the epidemic, including a 43% plummet in the number of HIV infections incurred by children between 2003 and 2011. In this regard you can call up an accompanying workbook carried by the Guardian, recast into Excel form here:

UNAIDS data 2012

The workbook, in particular its Global and regional trend data sheet, comprises annual data compilations across the years 1999 and 2011 both for particular regions and worldwide, thus affording the analyst/journalist a means for corroborating, questioning, and refining the UN findings. So what is there to be done with the data?

We’ll start simply. You’ll note the data present themselves beneath a number of superordinate headings in row 4 (e.g., Number of people living with HIV) in a tripartite, estimate/low estimate/high estimate breakout – and the estimate does not comprise a simple average of the associated high and low.  Exactly how the estimate was figured, then, is a small but material discrepancy that could be placed before the report writers, particularly in view of the missing methodological discussion from which the report suffers.

In addition, the report finds “There were more than 700,000 fewer new HIV infections globally in 2011 than in 2001” (p. 2), the pertinent data for which appear to hold down cells F17 and F27 in the Global worksheet. Doing the math, I get 683,149, but a corresponding bit of subtraction for the Low estimate column (in cells G17 and G27) does yield 722,800, perhaps the number the report wants you to remember here, but one that doesn’t elect the mid-way estimate (and it should be added that the report doesn’t divulge the means by which the estimates were projected).

You have a few things to think about, then, but let’s get back to those headings in Row 4, which in fact top yet another collection of headings in row 5 (estimate, etc.). That two-header-row redundancy qualifies as a must-avoid, in part because the second of the rows will necessarily be regarded as a first row of data instead. Moreover, those upper header titles pull across merged cells, and that’s way too problematic for conventional data organization. The simple way out? Just insert a row between rows 4 and 5, and content yourself with the headings in what is now row 6.

Next problem. Our headings now appear to suffer from a different kind of redundancy – the recurring estimate/Low estimate/High estimate troika that replays itself across the worksheet. In fact, however, any pivot table cobbled from these data will sensitize itself to the problem, and will append distinguishing numbers to same-named fields:

But neither you nor any prospective readers will be satisfied with those identifiers – they are, after all, laconic in the extreme. What you may need to do is simply, and unglamorously, rewrite the headers, e.g., Low estimate, people living with HIV, High estimate, people living with HIV, etc. Not big fun, but it won’t take that long.

Then you can begin to think about pivot tabling. Note that the data already feature summatory, global totals by year in additional to the regional figures; and while these may (apparently) preempt your need to develop aggregated, worldwide yearly totals, you also need to exclude or filter these from many pivot table renderings in order to avoid totaling the same numbers twice. That is, if you put something like the following table into place:

Row Labels: Region

Column Labels: Year

Values:  Estimate, people living with HIV (my retitling)

your Grand Totals will be duplicative, adding the Global data to that of all the regions.

But don’t touch that dial. Execute the above pivot table. The grand total for 1990, for example, comes to 18,536,551.95. But look at the Global figure – 8,945,107.32. That total isn’t quite precisely one-half the Grand Total, and if you perform the same assay on the other years – that is, multiply each Global total by 2 – the products will always fall short of the respective Grand Totals. I can’t explain the disparities, and don’t find any reconciliation in the UN report. In any event, if you construct the sort of table I describe above, you may want to filter the Global item out of the mix, by clicking the Row Labels down arrow and deselecting Global:

(note: the data have been reformatted to exhibit two decimal points and commas when necessitated. In addition, the Row Label data here initially appeared in Wrap Text mode, which I deactivated by selecting the A column and clicking Wrap Text off, via Home tab > Wrap Text in the Alignment button group.)

The above table neatly right-angles regions and years for any particular column of data (try this for two columns simultaneously, e.g., low and high estimates for people living with HIV, and the table gets forbiddingly dense), but there are more views on the data to be opened here.

For example:  the UN report celebrates that 43% yearly diminution in new HIV infections in children from 2003 to 2011 (p. 27). By aiming Excel’s % Difference From command, squirreled behind the Show Values As button at the data (Pivot Table Tools > Show Values As in the Calculations button group; remember I’m working with the 2010 release), you can baseline the numbers for each region at 2003, and see how the subsequent HIV rates plot against that particular year.

To see what I mean, frame a pivot table and put these fields into play:

Row Labels: Region

(Here you may want to readmit the Global region by clicking the Select All. This inclusionary step will let you juxtapose and view the overall HIV reduction totals relative to particular regions. You may also want to deselect the Western and Central Europe and North America item because its data are missing for most years.)

Column Labels: Year

Values: What I’ve called Estimate New Child. HIV (that is, the field detailing the yearly estimate of new HIV infections in children)

Then click anywhere among the Values data and turn to Pivot Table Tools > Show Values As > % Difference From. In the resulting dialog box click Year in the Base Field field and 1993 as your Base Item. You should see (in excerpted form):

Note the vacant 1993 column, attesting to that year’s baseline status. Because the infection rates for all the other years – on either side of the baseline – are computed against the 1993 numbers, that year can’t be compared to itself.

What this means in turn is that the % Difference From tool can be applied to any year baseline.  Decide on a different year and all the comparative numbers above will change (pre-2010 users can right click anywhere among the values, select Value Field Settings and the Show Values tab, % Difference From, and the appropriate Base Field and Base Item).

That’s unnervingly efficient, and informative. The next bit of enlightenment is presentational – learning to how mini-chart those 22 years’ worth of data via Excel’s Sparklines – and yeah, sorry – you need Excel 2010 for this (but if you don’t, remember that free download beta of the 2013 release). Stay tuned.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: