Search results for 'the comeback'

TWDOCS: The Comeback

3 Jul

I still don’t know what Google wants to happen when it asks, or tells, me that I’m Feeling Lucky, but either way I’m not sure if it was luck or an inertially-driven pair of hands that fell upon my crumb-dappled keyboard and tapped twdocs.com into my URL bar – again. I’d been engaged in the Sisyphean pursuit of the incommunicado site for some time, hopefully expounding its error-messaging links for some sign that something else was going to happen next.

And so on and so on until this week, when force of habit yet auto-piloted me back to twdocs’s still-listed address, whereupon the next click… made something actually happen. A spreadsheet, bearing the results of a Twitter keyword search, the very object I had requested vainly from twdocs lo these many months, offered itself to me in the form of a long-lost, happily-attended prompt. Talk about wow factor; viva inertia, I say.

I’m not asking any questions. Twdocs is back, at least for now, streaming near-real time, key-worded tweets your way in their countable, measurable, eminently usable rows.  We’ve been to the site and done that, in October and November 2012, before twdocs went on its unexplained hiatus. That’s when I posted a couple of looks at the site, endeavoring, among other things, to monitor US election-day tweets for mentions of Obama and Romney.

The site is indeed back, and while the keyword search instructions set down in my 2012 posts seem unchanged, a few adjustments have worked their way into the user’s marching orders, starting with the maximum tweet-requisition limit of 800, down from the earlier 1200. Does that pullback matter? I suspect it doesn’t, because in any case you’re not going to net every Tweet out there on a trending theme – even with a 1200-tweet max – and so again, it’s tweet velocity instead – the numbers of tweets per some or other unit of time – that’s going to matter.

And something else. Twdocs now reports its date/time-field data this way:

Thu Jul 03 10:27:05 +0000 2014

But in 2012 that date would have looked like this:

Thu, 03 Jul 2014 10:27:05 +0000

The problem isn’t with the Euro to American date format shift, but rather with the +0000, having usurped the 2014’s previous date-hugging pride of placement. Because in both cases above the dates dial into the text format, I had originally called for a Text-to-Columns maneuver that would salvage real, quantified dates from the expressions; and that measure should continue to work, provided you engage the Do not import column (skip) option with all due savvy; but it’s occurred to me that a simpler, perhaps fleeter move could do much the same. Simply select the time data in their column (that should be B) and boot up two rapid-fire find and replaces, the first searching for the three-character day abbreviation in force, and the blank space following those three characters, e.g.

Thu[blank space]

And replace it with nothing.

The blank space is integral here, as it too stands as a bona fide character that, if ignored, would fail to renovate the entries’ text format. Then find the +0000s, and again replace them with nothing. What’s left in the cells now should submit themselves as quantifiable dates and times to your inspection. (Note that if a twdocs keyword search returns results tweeted through several different days, as might typify a keyword that appears but occasionally across the twittersphere and hence takes its time to accumulate, you may have to run the three-character find-and-replace more than once, in order to replace each day.)

Next, my early twdocs posts promoted a means for timing a given keyword’s velocity (again, October 18) that prevailed upon the SEARCH function and a copy-down-the-column device . Here I’ll advertise what I take to be a more parsimonious take on that chore, using as example a current search for the word Wimbledon, a timely reference, I think .

First, you need to calculate the duration across which the tweets were dashed off. Click in a blank cell – say N1, and name it d for duration, or whatever. Then name the newly-quantified date/time field in B t for time (I’m all for parsimony in range naming, too). Range-name the data in Column A – the one storing the actual tweets – text. Then back in N1 enter:

=(MAX(t)-MIN(t))*24

That figures the fraction of an hour in effect within which the tweets were received.  Then enter Wimbledon in L4 (the cells have of course been chosen in the interests of choosing something, somewhere). But why am I doing that, you’re sure to wonder. After all, the original search put to twdocs was for that very word; why look for Wimbledon again inside the spreadsheet? Good question – but we do need the additional Wimbledon search, in order to tighten the imprecision of twdocs’ own initial foray for the word. As I observed in the earlier posts, not every record retrieved by the site will necessarily contain the desired term (I told you I wasn’t asking any questions), and so we need to press another search for Wimbledon, this one under our own steam. Thus enter in M4:

=COUNTIF(text,”*”&L4&”*”)/d

Note the syntactical demands COUNTIF makes of us here. We’re looking for a word nested in a larger collocation of words – the battery of all our tweets – and so COUNTIF wants us to string-concatenate the word, in effect here:

*Wimbledon*

(and COUNTIF is case-insensitve; you could lower-case the W).

Once COUNTIF is understood its outcome is divided by d, rounded off say to two decimals, and the instances of Wimbledon per hour are duly cited. I got 7334.40, by the way. Look for Wimbledon now and your number should of course be different.

Once that’s working, you can enter as many supplementary search terms as you wish beneath Wimbledon in the L column, copying the associated formula for each term to develop a per-hour read for each.

Of course this latter strategy draws a subset from a subset. Once we’ve asked twdocs to find every instance of Wimbledon and then go on to interrogate those data for say, Murray or Federer, we’re really in effect hunting for tweets that feature both the words Wimbledon and Murray. But it’s impossible to know how many Murray-laden tweets are out there – the ones, that is, that don’t have Wimbledon as well. Thus if we’re trolling for all Murrays or Federers in the first instance, we’d need to put that question back to twdocs.

And if you want to itemize all the tweets containing both Wimbledon and Murray, after you’ve already called up Wimbledon from twdocs, you can count on COUNTIFS, the multi-criteria extension on COUNTIF. Assuming the search terms Wimbledon and Murray claim cells L4 and L5, for example, you’d enter:

=COUNTIFS(text,”*”&L4&”*”,text,”*”&L5&”*”)/d

 That should work. Now will the ballboy please get me my towel?

Hacks of Stacks of Wax: Billboard 100 Data, Part 1

2 Jul

You don’t read big data, you analyze it. No one unrolls themselves into their hammock, reaches for their mint julep, and thrills to that page-turner of a 300,000-row data set they’ve been craving to get at all winter. Big data is meant to revel in its bigness, favoring the curious with its possibilities for aggregated, patterned and macro-leveled largesse, and largeness.

But sometimes the revel is in the details. Now and then a big data set comprises a gigantic compound of molecular bits whose very protons might be of sufficient interest to make you put your julep on hold – and I’m thinking about the 59-years of variously memorable hits filling 309,000 rows of the Billboard top 100 workbook, playing its enormous medley here on the data.world site.

As indicated, the Billboard set recollects its chart-toppers all the way back to August, 1958, and if you’re just bursting to know for exactly how many weeks “She Loves You” oooed its way into the listings – and you probably are – or precisely when the epochal “Rapper’s Delight” first hip-hopped onto the rankings and your consciousness (15, and the week of November 10, 1979, respectively; but remember that the Beatles’ own German cover version “Sie Liebt Dich” also checked in for a week at 97 in June, 1964), you’ve assuredly come to the right place.

I don’t know about you, but I think the Billboard data – all 21.6 megabytes of it (i.e., you’ll have to download it yourself) – makes for a cracking good read – but it’s a spreadsheet, after all, and so some intriguing global findings should be in there, too. But as usual, the data need some inspecting before the work gets underway.

Note, for example, that the Peak Position and Weeks on Chart fields installed in columns I and J are, at least in theory, dispensable; one could derive both findings from a pivot tabling of the songs, subjecting Peak Position to a Min in Values, and then applying the song titles themselves to Values, realizing a count that would deliver a Weeks on Chart equivalent. That sparer approach would relieve the data of a slew of redundant entries, e.g. a song’s peak position appears identically for each week in which it appears.

If you’re wondering about the Instance field and what it means, you’re not alone. I originally supposed that it counts the number of times the same chart-bound song was performed by different artists (I use the term loosely), but that conjecture proved a false take. Rather, Instance seems to number a given version’s non-contiguous revisits to the charts. For example, Nicky Jam’s El Amante – a performer and song whose identities draw a pair of blanks in my uncomprehending mind – exhibits six instances; its debut at position 99 in the week of February 18, 2017 was succeeded by its disappearance the following week, only for the tenacious ditty to stage a three-week comeback dating from the week of March 4. Continuing to loll in the high 90s, El Amante submerged once again, before clambering back into 98 on April 4, etc. It last held its place in the rankings until the week of September 2, 2017, concluding its sixth instance – before it fell back into the oblivion it likely deserved.

Note in addition the SongID field, a unique identifier crafted by a concatenation of the entries in Song and Performer. Slightly curious is the retention of the formulas in their cells; their work has been completed, and could be paved over with a Paste > Values routine, an austerity move that serves to reduce the file’s size to 19.5 MB.

And if you’re wondering what purpose a song id might fulfill – that is, what analytical need would spur the assignment of an id to each song – I can think of at least one, one that returns us to an exigency with which I’ve contended before, and not optimally, as it turns out.

If we want to learn how many discrete songs clambered into the top 100 for any particular year we need – again – to do something about the recurring weekly appearances of the same songs, songs we want to count exactly once. I had expressed a similar wish, for example, in my posts on the Supreme Court Voting data, in which I wanted to count unique cases heard by the Court per year. I developed the count by embedding case data into the Rows area, where of course they’re enumerated but one time each. I then moved to analyze that satellite table instead.

But I’ve since learned that the above exertion is unnecessary, thanks to Excel frontliner Chandoo. He inclined my attention to an unassailably more elegant maneuver, that works like this:

But before I demonstrate, recall what I’m aiming to do: I want to pivot table a tabulation of the number of unique songs crashing the charts by year, and as such a prior step need be enacted upon the data before I set the table – I need to release year information from the WeekID field in B. That intention can be effected in several ways, but in the interest of simplicity I’ll scamper to next-available column K, call it Year, and enter in K2:

=LEFT(B2,4)

And copy down the column. That simple device releases the first four characters from each week id, which in every case offers up the year of the song’s chart entry (WeekId is text-formatted, by the way).

When the venerable Create Pivot Table dialog box opens, tick the Add this data to the Data Model box at its lower left (and someone tell the folks in Redmond it should be these data):

bill1

That tick activates Excel’s Data Model (which first made itself freely available in the 2013 release), an add-in that enables a number of data-querying enhancements, including the potential for building relational pivot tables. But our interest here is in those unique song titles, and so once you’ve executed the tick and the Data Model loads, advance to the pivot table (notice the slightly modified field list drawn up by the Data Model) and earmark Year for the Rows area. Next show SongID into Values, right-click into Summarize Values by, click More Options… scroll down and…

bill2

Wow – Distinct Count; what a concept. Click it, click OK, and I get (in excerpt):

bill3

(Note that the 1958 data are partial, encompassing only the last five months of that year. The 2017 listings extend to the end of October.) Subjected to a rudimentary line chart, the slope looks like this:

bill4

I’m not sure what sociological conclusions beg our attention, but the peak in song numbers in the 60s is marked, as is the decided slump in the ensuing years.

Put it this way: There’s something happening here/What it is ain’t exactly clear.

“For What It’s Worth”; hit the charts the week of January 28, 1967, and stayed there for 15 weeks. Peak position: 7.

The Trump Tweets Part 2: 140 Characters in Search of a Candidate

22 Feb

 

Having looked some technical issues in the eye in last week’s post and not having blinked, we can begin to consider exactly what it is that Mr. Trump is tweeting about. We’ve already explored the means for running key-word searches across tweets, by calling upon the COUNTIF function and playing a few wild card indicators along the way. Thus if we’re looking for instances of the word “president” down, say, A1:A100, we’d enter

=COUNTIF(A1:A100,”*”&”president”&”*”)

And that expression should realize that count. Note that what Excel reads as *president* is conveyed in the cell by textual elements that are concatenated (though of course you could enter a cell reference to the word “president”).

Once understood, you could enter a collection of key words down a column with accompanying COUNTIFs (and here I don’t mean COUNTIFS, a separate function). For starters I entered these in a blank sheet:

tr1

I then named the range of tweets tweets, and copied the inaugural COUNTIF formulas (pun not intended), each of which divided the results by COUNTA(tweets) so as to facilitate percentage readings, and formatted appropriately. In other words, the formulas look something like this:

=COUNTIF(tweets,”*”&A3&”*”)/COUNTA(tweets)

I got these results:

tr2

Mr. Trump likes to tweet his name, apparently, though a comparative check of self-references by the other candidates would have to be conducted as well in order to determine if Trump’s self-preoccupation is singular.

But there’s a problem in those data, nesting in the two cells counting tweets themed around Hillary Clinton. A COUNTIF searching for Hillary will pick that name up in tweets in which that surname appears in isolation, as well as cells that display Hillary Clinton. But if I likewise trigger a COUNTIF for instances of Clinton alone, that tally will likewise count the tweets bearing Hillary Clinton. If, then, I’m scanning three tweets for the names Hillary and Clinton, my combined COUNTIFs will total…four, even as I’m looking at but three tweets all told.

The problem, then, is a double-count, a species of numeric redundancy perfectly capable of derricking mounds of discredit upon your data. There are in fact 164 discrete Trump tweets in my data set that report the name(s) Hillary and/or Clinton; but throw a COUNTIF at Hillary and another one at Clinton, and the total burgeons to 195.

The double-count issue is known, and is taken up by Mike Girvin in his “Ctrl-Shift-Enter: Mastering Array Formulas”, one of the few books dedicating itself to that fearsome feature, in pages 158-159, for you exegetes. But the data and formulaic circumstances he describes here don’t seem to map themselves fitly to ours, and I’ve yet to arrive at what I would extol as an ideal, elegant take on the problem. What I do have – and this works, given the particulars of our count search – is this, assuming I’ve entered the terms Hillary, Clinton, and Hillary Clinton in cells: A1:A3:

=COUNTIF(tweets,”*”&A1&”*”)+COUNTIF(tweets,”*”&A2&”*”)-COUNTIF(tweets(“*”&A3&”*”)*2

In fact the expression is relatively simple; it counts all tweets bearing the term Hillary, adding those to all featuring Clinton, and proceeds to subtract all the instances of Hillary Clinton – because these have been already recognized by the first two COUNTIFs. And as a result the 6.28% Hillary/Clinton tweet aggregate registered above reduces to 5.28%.

Now don’t be fooled, as I was, at least momentarily. You might suppose that because those 31 Hillary Clintons contributed their citations to both the Hillary and Clinton counts you might have to subtract that confounding influence twice. But you don’t, because again we’re inventorying every Trump tweet in which some definitive reference to Hillary Clinton appears, and we want to count each such tweet once. Now since the COUNTIF for Hillary indeed tapped all the Hillary Clinton tweets as well, that latter entry’s numeric contribution has duly noted. But because the COUNTIF for Clinton conscripts all the Hillary Clintons as well, we have to counter this second counting with the subtraction that informs the above formula.

The reason I’m not thrilled with my patchwork solution – which again, after all, works – is because if, for example, other nominal references to Hillary Clinton – e.g., Rodham (her family name), HRod or Gert – walked their way into the tweets in multiple tandems, say Hillary Rodham or HRod Clinton, these two would likewise have to be mustered into any COUNTIF formulation, so as to swerve around the double-count bugaboo crouching in the bushes. An ideal formula, it seems to me, would entertain a series of OR statements that would progressively rule out a second counting of any cell that had already fulfilled a prior OR stipulation. My expression, on the other hand, counts the tweets range three times. But I haven’t attained that stated ideal yet – though maybe you will.

Another point. The previous post wondered if the times recorded on my dataset comported with those in which Trump actually dispatched his tweets – presumably on the East Coast of the States – or mine, in London, the GMT starting point. I put this question to TWDocs last week – unavailingly thus far – but it now seems clear that the times in my set reflect my clock’s position, not that of the tweeter. One tweet in excerpt describes the irrepressible candidate cavorting “Backstage with @jimmyfallon before opening skit – great fun!” Time stamp: 3:57:47 AM, surely by Trump’s reckoning five hours earlier. Given this thoroughgoing discrepancy, then, it might be a good idea to restore the times at which the tweets were actually composed to the data.

Since all the times in my data (and remember the we had drawn up an independent Time field in the previous post) are clocked exactly five hours ahead of Trump’s one would think that each formula in the Date field could be treated to a decrement of five hours or 5/24, the fraction of a day spanned by five hours, e.g. =VALUE(MID(A19,12,8)-5/24. That codicil usually – but doesn’t always – work, because subtracting 5/24 from times falling between 12:00 and 5:00 AM takes us – here goes – to the other side of midnight, and while those ensuing negative numbers might work for an Einsteinian, they won’t compute here. What we need, then, is something like this, understanding that various routes could be steered here. Neatness might be served by this approach: open a new column bestride Time and enter in what is now D4:

=IF(C4<5/24,C4+19/24,C4-5/24)

I know – everything looks the same in there, but the expression is asking if the time in the partner C (Date) field is sometime earlier than 5:00 AM. If so, the formula adds 19 hours, via a kind of back door to the proper time. After all, add 19 hours to 3:00 AM and you get 10:00 PM, which also happens to be five hours before 3:00 AM, of course (and here we’re interested in times exclusively, not the date on which it was registered). All other times receive the five-hour reduction, after which a copy-down the column followed by a Copy > Paste > Values to the Date field would be in order; now you can delete the D column.

That all seems to work. But of course the Trump campaign is now unfurling its tents in Nevada for the next primary go. Do I need to start thinking about 7/24?

The Trump Tweets, Part 1: 140 Characters in Search of a Candidate

11 Feb

On this at least we can agree: that Donald Trump’s conversational style is not of a piece with say, that of Noel Coward or de la Rochefoucauld, not even the printable utterances. On the other hand, those latter epigrammarians were not able to run for the Presidency of the United States, and Mr. Trump is; and as he presses that Constitutional entitlement beyond what some might deem the breaking point it occurred to me the electorate might be productively served by a look at his Tweets, those on-the-record pronouncements issuing from the mind and the smart-phoning hand of the man who would be commander-in-chief (gulp).

And toward that end I returned to an old companion, the twdocs.com site and its Tweet-dredging apparatus, even as it now charges for its excavations. And so after shelling out $7.80 for the privilege, I was presented with a download of the candidate’s last 3106 tweets, those at least through February 11, two subsequent to his win in the New Hampshire primary. (I’m also not sure if I can present you in turn with my finished product for download, as I’ve paid for it. So until my legal department rules otherwise I suppose you’ll have to spend $7.80, too, understanding of course that, depending on the time and day on which you order your tweet collection, it’ll depart slightly from mine )

The first analytical order of business is to do something about the dates holding down the created_at field in the A column:

tweet1

Surprise – they’re not dates yet, at least not by Excel’s formatting lights. All those +0000 2016s get in the way of proper date standing, as do the Wed, Tue etc. day-of-week qualifications that likewise can’t comport with the application’s date-reading defaults. But the task of repacking these now-textual data into usable dates is considerably trickier than I first thought, unless I’m missing something terribly elementary and decisive.
First break open a new column between A and B and call it Date. In what is now B4 enter:

=VALUE(MID(A4,5,15))

The MID puts the tweezer to the text label in A4, culling the 15 characters proceeding from position 5 in the cell. (We’re happy to take advantage of the fact that all the A-column data comprises 30 characters, freeing us to copy the above expression as it stands all the way down B.) And because that MID result remains textual, the VALUE supplement completes A4’s changeover to a duly accredited quantitative entry: 42410.56765. Reformat that handsome value to Short Date mode and the whole process culminates in an eminently usable 2/10/2016. So what’s tricky about that?

Not much, to be sure. The tricky bit doesn’t start playing up until cell B608, whose partner in the A column records the first of Trump’s 2015 tweets. B608 evaluates to 12/31/2016 – and that’s because considered alone, apart from some qualifying year reference, Dec 31 23:21:49 defaults to the current year (and it’s the two-year spread of the data here that works against approach to this problem I offered in this post). Enter 12/31 in any cell, after all, and Excel gives itself no choice but to assume that the implicit year in there is the self-same one in which the datum was entered. The 2015 in A608 won’t help, either, because the label Dec 31 2015 can’t be properly read by VALUE – but Dec 31, 2015 is automatically read as a date.

Thus if upon returning to B4 we try

=VALUE(MID(4,5,6)&”, “&RIGHT(A4,4)

That does seem to work.

That rewrite in effect string-concatenates “Dec 31, ” and “2015”, the latter having been made available through the RIGHT addendum, into Dec 31, 2015, and again Excel can re-evaluate that version into a bona fide date. (Note the “, ” interpolation, complete with the space that follows the comma.) You’ll observe, of course, that we’ve barred the time data from the expression, but we can still pull times out separately if we need them.

While you wait for the dust settles superimpose a Paste > Value from the B data back onto itself in B, and format the field in Short Date terms; now we can group the tweets by chronological units, as per this pivot table, for example:

Row Labels: Date (Group by Months and Years)

Values: Date (Count)

I get

tweet2

The spreadsheet picks up on August 25, thus accounting for that month’s small accumulation, and again the February 2016 numbers halt at the 11th. The large October complement appears to feature many tweets Trump has quoted from other tweeters (these are apparently not categorical retweets), but even so doesn’t that possibility doesn’t explain the extent of quoting activity that month. Note that, even interpreted pro-rata, Trump’s February tweets fall behind January’s velocity, a slowing perhaps a creature of his stepped-up primary campaigning this month.

Now if tweet times-of-day do interest you push through a new column after Date, call it Time, and in what is now cell C4 enter

=VALUE(MID(A4,12,8)

That is, we’re reprising VALUE and MID here, this time recalling the eight time characters starting at position 12 in each cell. Again copy down the column, and Paste > Value atop itself. Then this pivot table should tell you want you want to know:

Row Labels: Time (Group by Hours only; turn off the default Months selection by simply clicking it).

Values: Time (Count)

I get

tweet3

What I think is happening here – and I have to check with TWDocs on this – is that the tweet times have been transposed to my GMT zone, five hours ahead of the East Coast in the States. If I’m right, then the curiously high tweet total for 3AM really harks to 10PM Trump time.

But I’m not sure – let me tweet Mr. Trump and ask him.