Having looked some technical issues in the eye in last week’s post and not having blinked, we can begin to consider exactly what it is that Mr. Trump is tweeting about. We’ve already explored the means for running key-word searches across tweets, by calling upon the COUNTIF function and playing a few wild card indicators along the way. Thus if we’re looking for instances of the word “president” down, say, A1:A100, we’d enter
And that expression should realize that count. Note that what Excel reads as *president* is conveyed in the cell by textual elements that are concatenated (though of course you could enter a cell reference to the word “president”).
Once understood, you could enter a collection of key words down a column with accompanying COUNTIFs (and here I don’t mean COUNTIFS, a separate function). For starters I entered these in a blank sheet:
I then named the range of tweets tweets, and copied the inaugural COUNTIF formulas (pun not intended), each of which divided the results by COUNTA(tweets) so as to facilitate percentage readings, and formatted appropriately. In other words, the formulas look something like this:
I got these results:
Mr. Trump likes to tweet his name, apparently, though a comparative check of self-references by the other candidates would have to be conducted as well in order to determine if Trump’s self-preoccupation is singular.
But there’s a problem in those data, nesting in the two cells counting tweets themed around Hillary Clinton. A COUNTIF searching for Hillary will pick that name up in tweets in which that surname appears in isolation, as well as cells that display Hillary Clinton. But if I likewise trigger a COUNTIF for instances of Clinton alone, that tally will likewise count the tweets bearing Hillary Clinton. If, then, I’m scanning three tweets for the names Hillary and Clinton, my combined COUNTIFs will total…four, even as I’m looking at but three tweets all told.
The problem, then, is a double-count, a species of numeric redundancy perfectly capable of derricking mounds of discredit upon your data. There are in fact 164 discrete Trump tweets in my data set that report the name(s) Hillary and/or Clinton; but throw a COUNTIF at Hillary and another one at Clinton, and the total burgeons to 195.
The double-count issue is known, and is taken up by Mike Girvin in his “Ctrl-Shift-Enter: Mastering Array Formulas”, one of the few books dedicating itself to that fearsome feature, in pages 158-159, for you exegetes. But the data and formulaic circumstances he describes here don’t seem to map themselves fitly to ours, and I’ve yet to arrive at what I would extol as an ideal, elegant take on the problem. What I do have – and this works, given the particulars of our count search – is this, assuming I’ve entered the terms Hillary, Clinton, and Hillary Clinton in cells: A1:A3:
In fact the expression is relatively simple; it counts all tweets bearing the term Hillary, adding those to all featuring Clinton, and proceeds to subtract all the instances of Hillary Clinton – because these have been already recognized by the first two COUNTIFs. And as a result the 6.28% Hillary/Clinton tweet aggregate registered above reduces to 5.28%.
Now don’t be fooled, as I was, at least momentarily. You might suppose that because those 31 Hillary Clintons contributed their citations to both the Hillary and Clinton counts you might have to subtract that confounding influence twice. But you don’t, because again we’re inventorying every Trump tweet in which some definitive reference to Hillary Clinton appears, and we want to count each such tweet once. Now since the COUNTIF for Hillary indeed tapped all the Hillary Clinton tweets as well, that latter entry’s numeric contribution has duly noted. But because the COUNTIF for Clinton conscripts all the Hillary Clintons as well, we have to counter this second counting with the subtraction that informs the above formula.
The reason I’m not thrilled with my patchwork solution – which again, after all, works – is because if, for example, other nominal references to Hillary Clinton – e.g., Rodham (her family name), HRod or Gert – walked their way into the tweets in multiple tandems, say Hillary Rodham or HRod Clinton, these two would likewise have to be mustered into any COUNTIF formulation, so as to swerve around the double-count bugaboo crouching in the bushes. An ideal formula, it seems to me, would entertain a series of OR statements that would progressively rule out a second counting of any cell that had already fulfilled a prior OR stipulation. My expression, on the other hand, counts the tweets range three times. But I haven’t attained that stated ideal yet – though maybe you will.
Another point. The previous post wondered if the times recorded on my dataset comported with those in which Trump actually dispatched his tweets – presumably on the East Coast of the States – or mine, in London, the GMT starting point. I put this question to TWDocs last week – unavailingly thus far – but it now seems clear that the times in my set reflect my clock’s position, not that of the tweeter. One tweet in excerpt describes the irrepressible candidate cavorting “Backstage with @jimmyfallon before opening skit – great fun!” Time stamp: 3:57:47 AM, surely by Trump’s reckoning five hours earlier. Given this thoroughgoing discrepancy, then, it might be a good idea to restore the times at which the tweets were actually composed to the data.
Since all the times in my data (and remember the we had drawn up an independent Time field in the previous post) are clocked exactly five hours ahead of Trump’s one would think that each formula in the Date field could be treated to a decrement of five hours or 5/24, the fraction of a day spanned by five hours, e.g. =VALUE(MID(A19,12,8)-5/24. That codicil usually – but doesn’t always – work, because subtracting 5/24 from times falling between 12:00 and 5:00 AM takes us – here goes – to the other side of midnight, and while those ensuing negative numbers might work for an Einsteinian, they won’t compute here. What we need, then, is something like this, understanding that various routes could be steered here. Neatness might be served by this approach: open a new column bestride Time and enter in what is now D4:
I know – everything looks the same in there, but the expression is asking if the time in the partner C (Date) field is sometime earlier than 5:00 AM. If so, the formula adds 19 hours, via a kind of back door to the proper time. After all, add 19 hours to 3:00 AM and you get 10:00 PM, which also happens to be five hours before 3:00 AM, of course (and here we’re interested in times exclusively, not the date on which it was registered). All other times receive the five-hour reduction, after which a copy-down the column followed by a Copy > Paste > Values to the Date field would be in order; now you can delete the D column.
That all seems to work. But of course the Trump campaign is now unfurling its tents in Nevada for the next primary go. Do I need to start thinking about 7/24?