Word Counts Refined: Part 1

23 Apr

Last week’s keyword search of the Conservative Manifesto was happy to profit from a quirk of that quest: my apparently inadvertent success at finding a place for each of the words of the Manifesto’s copy-and-pasted text in their very own cells, e.g.

 text1

And once having got that far, we easily go about the task of piling keyword counts of Manifesto themes, by directing a series of COUNTIFs at the word isolates. But almost by definition quirks do not predominate, and the post allowed that most copies-and-pastes of text will, rather, assume this alignment:

text5

In which each line remands its text to a single cell; and if the keyword search is to find its quarry here a Plan B, such as the one I screen-shot last week:

{=SUM(LEN(SUBSTITUTE(TRIM(A2:A100),” “,”  “))-LEN(TRIM(A2:A100)))+COUNTA(A2:A100)}

 

might have to answer the call. But could there be a more economical, user-friendlier Plan C out there, one that would be happy to look no further than the COUNTIF that’s served us so reliably till now? There is, I think, though you’ll need to start the process with another reliable friend – none other than Word.

But, you may want to ask, why do we want to make resort to what is in effect an add-in, however familiar? After all, and as intimated last week, we could bring about the dicing of words we see in the first screen shot above by pumping the data through Excel’s Text to Columns parser, lodging Space as our delimiter. That suggestion is rightly posed, but let us see how the text breakout according to Word serves us better.

First, copy and paste the relevant text into Word. (Note by way of additional preliminary that if the text is PDF-sourced, as is the Conservative Manifesto, you can Save As the doc instead as a bare text file, an healthful austerity measure that should streamline the editing process – because if you’re working with an ornately-designed PDF you’ll probably want to vet the content for pull quotes, e.g.

 text2

and recurring, redundant headers (The Conservative Manifesto 2015) that pull at and skew the aggregate text totals.

Next, select all the text and click (remember, we’re in Word now) Insert > Table > Convert Text to Table. Once you check in, enter 1 in the ensuing dialog box:

 text3

But why?  We want a nice, minimalist 1 up there because we’ll soon want to cast the data into a pivot table, and a single Word table column translates into what is, after all, a single pivot table field. And if you want to know exactly why we’d want to pivot table the words, when we spent all of last week’s post doing nothing of the sort, I hope to establish relevance, as the lawyers say.

And before you take leave of Convert Text to Table click Other beneath the Separate text at heading and type a space in Other’s rectangular field – because the spaces in the document are to act as delimiters of the words on either side of them, chipping each into a table cell. Click OK and let the table pull the text into its unitary column (note: my laptop can’t distill the 30,000 Conservative Manifesto into a one-column table. For one thing, my Number of rows reports a negative number when I earmark but one column for the table, and that can’t happen. If were you determined to word-count the Manifesto or any other very large document via this means you might have to table the text in installments.)

Once the column drills down its pages, we then want to do something about the non-alpha accoutrements attaching like burrs to the words – that is, the sundry punctuations that deal a joker to COUNTIF wild-card searches. By way of review, if we’re counting instances of the word “tax” and we’re confronted with “tax,” a COUNTIF on the order of

=COUNT IF(A1:A100,”tax”)

won’t add “tax,” to its tally. And if you interject the perfectly legal “tax”&”*” instead, you’ll net “taxi,”, but also “taxation” and “taxi”; and that’s probably not what you had in mind either. Rid the text of commas, quotes, and the like, and a clean search for “tax” gets cleaner, and it’s here where Word’s far mightier Find and Replace feature justifies our recourse to that application (for an eye-watering treatment of its  herculean capabilities see Tibor Környei ‘s piece here).

Start up Find and Replace and enter:

text4

Note the Use wildcards ticked box; you need to click the More button (the one that presently shows <<Less, above) in order to find it.

And what’s this bit of cryptography doing? It’s Find and Replace’s way of searching for all non-alpha characters (the [!] signifies not; the [A-z] span looks for every alpha irrespective of case) and replacing them here with nothing; and what’s left are alphas only – the stuff of words. And it should work. Depending on the length of the excerpt with which you’re working, you may want to downsize its operative font substantially, thus manufacturing a smaller complement of pages (remember you’re not interested in reading the words here anyway; they’re to be pasted to Excel. And note also that I see that, for whatever reason, some non-alphas, e.g. [ and an underscore, don’t seem to capitulate to the Find and Replace routine. Because as we’ve seen, [ is a bit of Find and Replace code, it may be interpreted as such and left alone. You can still purge these recalcitrants, though, via a standard Find and Replace – that its, Find [ and replace it with nothing).

And once your column of words has completed its run down the pages, copy and paste the whole thing to a spreadsheet, say to B2 (reserve B1 for a field heading, say Words). Next, make way for an old trick: enter 1 in A2, 2 in A3, jot a field header in A1 (e.g. Number), and run an autofill down the column (though you’ll probably have to drag the fill handle in lieu of the standard double-click of the handle, because the latter fill mode will come to rest alongside any blank cell in B, and blanks there should be). We’re numbering here because we want to dispose of the empty cells in B (ones that have likely received the source text’s Enter/line break codes, transported to the spreadsheet as blanks). Sort B by A to Z and let the blanks drop to the floor of the column. Insert a blank row between the bottom word and first blank, and then sort by the numbers in A, in order to restore the primeaval word sequence of the text.

Now you’re ready to word count via the COUNTIF deployments we described next week. But again, apart from the non-alpha/punctuation sticking point we’ve unstuck here, COUNTIF had already begun to do this work last post. So what’s new? This: a pivot table, an elementary one:

Row Labels: Words

Values: Words

And that rudimentary construction delivers a global word frequency count of each and every word in the document, an as such widens an encompassing look at every potential keyword.  Of course you’ll have to look past the irrepressibly  populous “the”,” and”, “or”, “of”, and kindred gear teeth before you look any substantive keyword in the eye, but that’s part of the cost of doing business.

But there’s something else to consider. We’ve done a pretty good job of readying individual words for the counting. But what if we want to count a pair of them, conjoined in a phrase? We can count “budget” and count “deficit”now, but what about “budget deficit”? And no; it won’t do to count each of the two and simply sum the results, because that totalling won’t speak to instances of their adjacency. So what to do?

I think I have a workaround, so hold on.  I know you love a teaser.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: