Archive | Uncategorized RSS feed for this section

The World’s Airports: Checking In With the Data

9 Apr

Looking for a sure-fire ice-breaker to rustle that awkward silence? Try asking this one: how many airports roll their runways out for flyers worldwide?

I mean, if that question doesn’t electrify your partner into a conversational frenzy then…well I do go on.

Anyway, the answer as of this writing, is 50,863. That’s a pretty big number, gushing over and above a Quora’s respondent’s count of 43,983, tallied in 2016. My source, David Megginson’s Ottowa-based ourairports site, reports the larger figure, and had actually added 20 airports to the count by the time I accessed the data anew (the airports.csv workbook), and after my having performed the analyses below.  By the time you get to the site, who knows?.

The data for the most part seem ready to go, and as such are answerable to a number of standard but useful questions, starting with a pivot-tabled distribution of the types of airports:

Rows: Type

Values: Type

Type (again, % of Column Totals)

I get:

air1

We see that nearly 65% of all airports qualify as small, the sorts at which you’ve likely not disembarked (I’ve discovered a definition of “small” minted by the United States’ Federal Aviation Administration, denoting a facility receiving .05% and .25% of the nation’s boardings, but of course that understanding is US-specific). You also probably haven’t touched down at one of the 20 world’s balloonports, 17 of which get airplay in the US. And I for one am slightly surprised by the substantial representation of heliports in the census. But copters, remember, started hovering long before your mouse did.

You’ve also doubtless duly recorded the count of closed airports, begging the analytical/presentational question whether that not-insubstantial cohort should be tolerated by the dataset, or otherwise grounded. By running a standard data filter on the set, filtering both for “closed” and “CA” in ISO_country, it can be seen that 629, or more than a quarter, of shuttered airports are based in Canada. Whether that figure perpetrates a recording bias by the workbook’s Canadian authorship I don’t know, though I suspect something else is at work here. I suppose that’s where you come in. (ISO country codes are issued by the International Organization for Standardization, and are set forth here.) My inclination is to remove these dormant locales, but only impermanently, via the storied fall-back: run a Find and Replace at “closed” in the C column, replacing every such instance with ZZ, sorting C by A to Z, and threading a blank row between 48541 and 48542, thus ruling the erstwhile closed entries out of the data, but keeping then on call for retrieval if you need them.

We could then break out airport totals by country, for starters:

Rows: ISO_Country

Values: ISO_Country (sorted largest to smallest)

ISO_Country (again, % of Column Totals)

I get:

air2

The United States’ remarkable numerical dominance of the airport count (and that’s Brazil at number 2, a perhaps no less remarkable standing as well) requires another look, or set of looks. If we introduce a Slicer to the table and filter there for medium and large airports the picture changes:

air3

The American proportion slims greatly, reflecting the weeding of small airports from the set. In fact according to Wikipedia only 16 of the world’s 50 busiest airports are US-sited, and if we slice for small_airport instead we get:

air4

America’s rule is restored, trailed again by Brazil.

Now if you wanted to break out American airports only, and by state, you’ll have to direct your attention – and formula-writing acumen – the iso_region field in J. We see that two-character state codes sidle the right side of the region-keying hyphen, e.g. US-NJ. To extract state codes, then, we could try this expression in S2, the first pertinent cell in the nearest free column (you’ll also want to name this budding field, say US State):

=IF(I2=”US”,RIGHT(J2,2),”ZZ”)

The formula inspects the country-bearing column J for US; if it finds it, it tugs two character from the right of the kindred cell in J. Once put in place try, for starters:

Rows: US State

Columns: US State (Count, of necessity)

I get, in excerpt:

air5

Two loose ends among the data need to be tied. First, the -A item that’s sorted itself to the head of the labels clearly promulgates an error, or given its count, a pair of errors, likely of the data-entry kind. By double-clicking the count cell of 2 an insurgent sheet adduces the two wayward records, including the airports’ latitudes and longitudes. By copying and pasting those coordinates into this site, I was told that the airports in fact operate in Indiana and Georgia.

That realization in turns calls for a find-and-replace of the errant data, but it seems that feature just won’t work with formula results, presumably because in our case you’d wind up replacing a formula with a hard-coded textual substitute and Excel wants to protect you from the inconsistency. You thus could run a copy > paste values atop S and then replace -A with IN for the record listing a 39-degree latitude, and exchanging GA for the -A lined up with the 33-degree measure (you could also find the airports’ respective Idents in column B and replace the state names in the records themselves).

The second restorative is a simple one, a filtering out of the ZZ, non-American airports. You can then sort the airport totals, largest to smallest:

air6

Ten percent of all US airports, then, serve flyers in Texas, with runner-up California counting half as many venues. Appoint type to the Columns area and you’ll see, in part:

air7

Note the proportion of heliports to all airports in California, though in fact a scan down the numbers puts the ratios, if not the absolute numbers, in Massachusetts, Connecticut, and New Hampshire even higher, for example.

Guess those flyers are in it for the short haul.

NHS Prescription Costs: Big Money, Startling Consistency

30 Mar

Huge organizations, huge budgets, huge pressures, huge controversy. That chain of indissoluble links clenches around the latest pronouncement of the UK’s National Health Service (NHS), heralding a cutback on “low priority” items that are otherwise available, and affordable, at local pharmacies – e.g. gluten-free products and hay-fever drugs.

Of course that is wherein the controversy lies, and if you’re interested in deeper contextual information about NHS prescription practices you’ll find it among a set of NHS data sources, including this one. That site provides what it terms Commissioning Group Prescribing Data across four quarterly spreadsheets for 2016, and for England (but not Northern Ireland, Scotland, and Wales); and by opening each of the four and copying and pasting it into a solitary workbook, the year’s prescription activity is consolidated, in 154,000 rows worth of data.

A large data set, but perhaps not optimally so. Each record is summative, after a fashion, totalling the number of prescriptions – or really, the number of products or items prescribed – for a medical speciality (Chapter Name), by a particular UK area team, e.g. (note the Items field):

pre1

Thus the first record counts 36 product-item prescriptions for gastro-intestinal treatments; and as such one might want to see that entry drilled down into 36 rows of information, resolving to prescription/product-level granularity. The analytical advantage there: a date-stamping of each and every item and a corollary enabling of useful time breakouts, e.g., items by week and/or month. (In fact some of that data are here, but for 2015, apparently; its billion or so records would overwhelm a spreadsheet, of course. and while these could be routed into Power Pivot – an add-in we haven’t explored here – the process might also exact a prohibitive download time, and so perhaps could be considered at another point.)

Note as a consequence that the 154,000 rows don’t total the numbers of items; that figure emerges only after a simple summing data in the Items field, returning 1,091,518,463 items (or again, prescribed products) for the year. Note in addition that because the quarterly demarcations in the A column are geared to the UK fiscal year – April to March – and not to calendar 2016 – what positions itself here as 4th Quarter data in fact recalls prescription activity for the actual first quarter of 2016, with 3rd Quarter numbers keyed to October through December of the year. In the interests of clarification, then, I’d run a series of Find and Replaces at the field data, something like this:

pre2h

And so on, so as to map the data fitly to calendar 2016.

Once that bit of administration is concluded a number of substantive breakouts suggest themselves. You may want to inaugurate the analysis, for example, by breaking out item totals by the broad Chapter Names in G:

Rows:  Chapter Names

Values: Items

Items (again, % of Column Totals)

I get:

pre3

You may want to comma-format the values, but the percentages convey the message in any case. We see that Cardiovascular and Central Nervous System account for nearly 42% of all product items. By means of follow-up we could refer Chapter Names to a Slicer and roll Section Names – categorical subsets of each Chapter Name – into Rows. If I tick the Slicer for Cardiovascular System I get:

pre4

Not surprisingly Hypertension and Heart Failure product head the list, though only very slightly ahead of Lipid-Regulating Drugs, which include statins.

Slicing for Central Nervous System products I get:

pre5

Analgesics (pain killers) and antidepressants contribute about 65% of the product total; the paucity of Obesity-treatment items is perhaps striking, however, given the incidence of the condition across the UK.

For another instructive measure, albeit a crude one, we could distribute item totals by England Area Teams. First, I’d tick the Slicer’s Clear Filter button:

And then delete the Slicer. Because I may want to return Chapter Name to the pivot table, clearing the Slicer’s filter selection sees to it that all the Chapter Name items will reappear in the table when requested. Otherwise, the Slicer’s filter will remain in force.

Then we can move assemble this pivot table:

Rows: Area Team Name

Values: Items (Sum, % of Column Total, sorted Highest to Lowest)

I get:

pre15

The percentages are clear but crude, in view of missing population data that would serve as a necessary corrective. Drag Chapter Name into Columns and, in part, you’ll see:

pre8

You’d doubtless need to take a series of long hard looks at the percentages, which don’t always reflect the overall proportions throughout.

And as for a simple, slightly coarse assessment of item totals by quarter:

Rows: Quarter

Values: Items

Items (again, by % of column totals)

I get:

pre9

The numbers are notably constant, and hence just a little provocative. Are these comparable aggregates an expectable consequence of the outlier-smoothing might of very large numbers, or are doctors somehow phasing prescriptions evenly across the year? I suspect the former, but there’s an investigative tack for you.

And yes, the Row Labels haven’t filed themselves into chronological order, have they? They’ve rather assumed a purely alphabetical progression, simply because they’re labels. Excel can’t construe Apr-Jun as a unit of time, and so resorts instead to the standard sort protocol. If you want the entries to adopt a faux chronology you’ll need to right-click the Jan-Mar label (and take pains to click only that cell) and click Move > “Jan-Mar” Up.

And what of the item costs? That, after all, is what’s impelling the NHS to rethink its prescription allowances. As long as we’ve installed the quarterly labels we can substitute Actual Cost for Items in the Values area, formatting the results into British pound currency:

pre10

(Note the distinction in the dataset between Actual Cost and NIC, or Net Ingredient Cost. For a definitional clarification look here. Actual Costs nearly always tend to fall a bit below NICs.)

Again, the totals evince a remarkable sameness, no less provocatively. In any event, that’s 8 billions pounds worth of prescribed items.

Next we could substitute Chapter Name for Quarter, and drag Actual Cost into Values a second time and recondition via % of Column Total:

pre12

The totals are lucidly apparent, but of course don’t break down into the kind of items the NHS wants to exclude from their reimbursements. If you again cast Chapter Name into a Slicer and filter for the largest Central Nervous System category, we’ll get:

pre11

More detail needed, perhaps. But for cost accountants, it’s a start.

New York City Film Permits – Coming to a Neighborhood Near You

19 Mar

Want to call a bit of New York real estate all your own? Just request a film permit from the city and, once granted, you’ll be able to arrogate, as the city puts it, “the exclusive use of city property, like a sidewalk, a street, or a park” – at least for a day of two.

Quite a bit of this sort of short-terming goes on in Manhattan these days, and in the other four boroughs too; and you can arrogate the dataset devoted to all that activity, and for free, on the New York city Open Data site via this link:

https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p/data

Once you’ve gotten there, just click the blue Export button holding down the center-right of the screen, and select CSV for Excel once the Download As menu appears.

per1

The data commit to cell form (as opposed to celluloid), nearly 40,000 permits secured between January 1, 2012 and December 16, 2016 (as of this writing; the data are ongoingly updated) somewhere around the city, and not just of the cinematic oeuvre. Television, commercial shootings and what are termed theatrical Load in and Load Outs – presumably scenery and equipment installation work for on-stage productions – are likewise inventoried.

As per so many open data renderings, a quick autofit of the sheet’s columns need be clicked, and you’d also probably want to let the EventAgency and Country fields in F and M respectively fall to the cutting room floor; their cell entries are unvarying and as such, unneeded.

A first, obvious plan for the data would have us break out the number of permits by New York’s five boroughs. This straightforward pivot table should serve that end:

Rows: Borough

Values: Borough

Borough (again, show values as % of Column Total)

I get:

per2

Ok – the field headers need retitling, but the data speak clearly to us. What’s slightly surprising, at least to me, is the smallish majority accruing to Manhattan. This native New Yorker would have thought that the storied island’s mystique would have magnetized still more filmic attention to itself; but perhaps Brooklyn’s latter-day repute for hipness packs a measure of ballast into the weightings. A very quick read of the Brooklyn zip (or postal) codes (column N) attracting the cameras turns up neighborhoods in the borough that border the East River, e.g., the perennially photogenic Brooklyn Heights (i.e. lower Manhattan skyline looming large in the background) and trending Williamsburg, for example.

We could then proceed with a natural follow-on, this a distribution of permits by category:

Rows: Category

Values: Category (Count)

Category (again, by % of Column Total)

I get:

per3

(Note: for information on New York’s permit categories, look here.) Television permits rule decisively; presumably the repeated filming regimen demanded by weekly series explains the disparity.

Now let’s break out category by borough, something like this:

Row: Category

Column: Borough

Values: Category (% of Row Total)

And we should turn off grand totals for rows; they all add to 100%.

I get:

per4

Remember that the percentages read across. I’m particularly surprised by Manhattan’s relatively small plurality of movie and television shoots, and by extension Brooklyn’s relative appeal. What’s needed here, however, are comparative data from previous years; for all I know, nothing’s changed among the borough distributions. (Remember as well that the above table plots percentages, not absolute figures. Exactly one Red Carpet/Premiere was shot during the five years recorded here.) Note at the same time Manhattan’s huge differential among Theater permits, a predictable concomitant of its concentration of Broadway and off-Broadway venues.

And what of seasonality? We can answer that question easily enough by installing StartDateTime into Rows – grouping by Months and Years – and pitching the same field (or really any field, provided all of its cells are populated with some sort of data) into values via Count.

I get (the screen shots have been segmented in the interests of concision):

per5

Among other gleanings, October seems to present a particularly attractive month for filmmakers, though the reasons why would and probably could be searched. Note in addition the spike in permit activity in 2015, and the fairly pronounced retraction in permits last year, at least through December 16.

But permit counts don’t tell us about the duration of the shoots, which naturally vary. But those data are here, and are welcomingly precise. To calculate permit lengths, all we need do is fashion a new field in the next free column (its position depends on whether you’ve deleted the superfluous fields I identified above), call it Duration, and enter in row 2:

=D2-C2

That paragon of simplicity yields a decimal result, quantifying the proportion of the day awarded the permit holder. Copy it down the Duration column and then try this, for starters:

Row: Category

Values: Duration (Average, rounded to two decimals).

I get:

per6

Remember we’re working with fractions of days; if you wanted results expressed in hourly terms you’d need to supplement the formulas in Duration with a *24 multiple.

We see a notably consistent range of average permit time allotments across categories with the obvious exception of Theater, whose set-up needs appear to require substantial permit times. Remember that a duration of .63, for example, signifies about 15 hours.

And if you simply add the durations (at least through December 16), the aggregate permit day count evaluates to 32,843.97. Divide the result by an average 365.25-day year, and 89.92 years worth of permit time report themselves, across a jot less than five years. That’s a lot of New York that’s been declared temporarily off-limits to the public.

Now you may also want to distribute permit prevalence by New York’s zip codes, but here a pothole obstructs the process. Because the areas requisitioned by the permits often straddle multiple codes, that plurality is duly recorded by the ZipCode(s) field, e.g. 11101,11222,11378.

But a workaround is at hand, though not the one I’d first assumed would have done the deal. What does seem to work is this:  First, range-name the ZipCode(s) field Zip, and in an empty cell – say R1 (or somewhere on a blank sheet), enter a sample zip code, say 10001. Then, say in R2, enter

={COUNT(IF(FIND(R1,Zip)>0,1))}

That’s an array formula, its entry requiring Ctrl-Shift-Enter, programming into its result those telltale, curly braces. The formula conducts a FIND of every cell in Zip for evidence of 10001; and when it finds it – somewhere in the cell – it posts a 1 to the formula, after which all the 1’s are counted. In this case I get 1950 permits having freed up a smidgen of acreage somewhere in the 10001 code, in the neighbourhood of the Empire State Building.

You can next copy and paste all of New York’s zip code listings into a blank spreadsheet area from here, entering New York City in the search field. Enter the above formula with the appropriate cell reference in the first of the listed codes and copy all the way down. If you sort the results largest to smallest, these codes return more than 1000 hits (note I’ve retained the web site’s neighbourhood field):

per7

That generic “Brooklyn” 11222 code topping the list points to the borough’s Greenpoint neighborhood, one of those districts hard by the East River, as is Queens’ Long Island City.

The formula that doesn’t work, contrary to my original surmise is

=COUNTIF(Zip, “*”&R1&”*”)

That’s in effect the one that had served us so well in my various key-word assays of Donald Trump’s tweets. It doesn’t work here because, for example, the zip code 11101 entered singly is vested with a numeric format; 11101, 11222,11378, on the other hand, stands as a textual datum, and COUNTIF likewise regards the two entries as different; and because it does, it won’t find both instances of the 11101s in the same formula. But FIND does.

Got that? I think that’s a wrap.

Chicago Homicide Data: Two Months, and Sixteen Years

8 Mar

Even given crime’s perennial, dark-sided newsworthiness, the burgeoning toll of homicides in Chicago has come in for recent, concentrated scrutiny. The New York Times has subject the recent spate of killings in the city to recurring coverage, and even the president has informed his followers that if the bloodshed isn’t staunched there he’ll send the “feds” in to enforce the peace.

While it’s not clear who the feds are, less uncertain data on the totals are available, and with them some perspective. A recent piece on the fivethirtyeight site reminds us that while Chicago’s homicide rates have indeed bolted upward in the past two years, the figures don’t approach Chicago’s deadly accumulations of the 90s – not exactly good news to be sure, but a measure of context, at least, that grounds the discussion.

And as could be expected these days Chicago’s Open Data portal makes the homicide data available to us, along with the records of other reported crimes. You can access the crime data by traveling to the portal and clicking this link on its home page:

chi1

(note that the different Crimes – 2001 to Present – Dashboard link on the page will take the user to a series of charts founded upon the data.)

I then filtered the crime data set for records beginning January 1, 2016:

chi2

 

If you’re doing the same, and have downloaded the filtered 300,000-plus records (again, these recall all reported crimes) you may want to disgorge some of their fields, ones you’re not likely to apply towards any analysis (e.g., the X and Y coordinates that in effect duplicate latitudes and longitudes, the Year field, whose information can be derived from Date, and the curious Location, whose records bundle crime-location latitudes and longitudes that already appear singly and more usefully in Latitude and Longitude).

And because our download request gathers records for 2017 as well we can begin to develop some inter-year comparisons between homicide rates.

It should be noted by way of additional preamble that Donald Trump’s tweeted augury about federal action dates back to January 25, at which point the Chicago homicide total for 2017 exceeded that as of the comparable date last year by 23.5%. But as we’ve since advanced more deeply into the year (my download runs through February 27), let us extend the comparison through this pivot table:

Rows: Date (Grouped by Month and Year)

Values: Primary Type

Slicer: Primary Type, and tick Homicide

(Note the deployment of the Primary Type field in both the Values and Slicer position – an allowable tack).

I get:

chi3

You’ll note that the 2017-2016 January/February differential has disappeared. Homicide totals are now nearly identical for the two months; again, the Trump tweet referenced a homicide total of 42 through January 25, and likened it to last year’s figure of 34 through the same day. Those numbers of course are relatively (and thankfully) small, and raise a corollary sample-size question. And of course it remains to be seen if the May-August surge in homicides last year will be duplicated across the same interval in 2017.

Of course, once the Slicer’s Primary Type field selection is in place, any and all of its items remain available for the analyssis; and by clicking some of the other Primary Types the January/February inter-year comparisons don’t trend uniformly. For example, tick Narcotics in the Slicer (and I’ve turned off the pivot table subtotals) and I get:

chi4

Here the 2017 January/February totals fall far beneath the 2016 aggregates for those two months – by about 40%, and I don’t know why. The temptation is to ascribe the retraction to some rethink of the reporting protocol, but I doubt that’s the case, though my surmise is easily researched. Click Weapons Violation, on the other hand and

Here the 2017 figures far outpace those of the preceding year, even again as the homicide totals for the January/February intervals stand as near-equivalent. Perhaps the publicity about weapon-inflicted murders in Chicago has spurred the city’s police to identify weapon wielders more concertedly, with that aggressive enforcement pre-empting still more homicides. But that is speculation.

It also occurred to me that, because the larger Chicago crime data set reaches back to 2001, we could download the homicide data for all the available years, and analyze these in a data set dedicated to that lethal offense. I thus returned to the Chicago set and filtered:

chi6

The records of 8,334 homicides, again dating from 2001, populate the rows. A simple first pivot table could confirm the yearly totals:

Rows: Date (Grouped by year; remember that the 2016 version of Excel will perform that grouping automatically)

Values:  Date (Count)

I get:

chi7

The precipitous ascent of the city’s murder rate, up 85% from 2013 to 2016, is confirmed (of course some control for population increase need be factored). But some deeper looks at the data also avail. (Note that columns E through G contain uniformly identical data down their rows, and as such could be deleted. Note as well that Chicago’s population may have actually declined slightly over the reported period above.) For example – have the distributions of the crime across the 24 hours of the day varied over the 2001-2016 span? By itself, that question presents what is normally a straightforward task for a pivot table, but in this case we need to call for a workaround. That’s because if we want to simultaneously break out the data by year and hour of day we’d have to derive those data from the same field –  i.e., Date in two different various Grouping modes – and install these in the Row and Column areas; but you can’t assign the one and the same field both to Row and Column.

And because you can’t, I’d claim the next available column, call it Hour, and enter in row 2:

=HOUR(C2)

And copy down the column. Now we have a second, independent field that reports a time reading, enabling us to proceed:

Rows: Date (by Years)

Columns: Hour

Values: Hour (Count, % of Row Total).

Filtering out the incomplete 2017 data I get:

chi9

The table is dense but readable and worth reading, at least selectively. Keep in mind that the percentages read horizontally across the years, returning the proportions of homicides for any year that were perpetrated by hour (for example – a percentage beneath the number 7 records the percentage of all the year’s homicides committed between 7:00 and 7:59 am ). Some numerous and notable variations are there to be considered, e.g., homicides during 2004 accounted for 3.96% of the year’s total during the 7:00 am time band, but in the following year the figure for that hour fell to .44%. In absolute terms the numbers were 18 and 2. The percentages at 21:00 pm for 2006 and 2012 come to 8.81% and 3.18% respectively; the actual totals stood at 42 and 16.

Are these fluctuations predictably “chance”-driven, or rather, statistically and sociologically significant?

For that question, I’m not confident about my confidence-level skills.

POTUS is an Anagram for POUTS: the President’s Tweets

27 Feb

Mr. President Trump comes at you now from two Twitter handles – his stalwart @realDonaldTrump identity, which thus appears to have gained security clearance, and the unimpeachably irreproachable POTUS id, or President of the United States, for acronym watchers worldwide.

And that very plurality – and the President knows something about pluralities, excepting perhaps the one by which he lost the popular vote – begs an obvious question for Trumpologists everywhere: namely, when does the incumbent decide to tweet from this account or that one?

It sounds downright sociological if you ask me, and even if you don’t.  The presidency is what those sociologists call a master status, with its relentless gravitas seeming to bear down upon its owner just about all the time, whether he wants it to or not; and as such, can the chief executive be said, or be allowed, to check his status at the door along with his shoes when he  bowls a few frames down at Trump Lanes (I’m remaining mum about those gutter balls), or spills his popcorn again even as he thrills yet one more time to that Chuck Norris epic?

I don’t know those answers, but we can go some ways toward resolving my Twitter-authorial question at least, courtesy of the go-to twdocs site, which granted me nine bucks worth of recent presidential downloads (these as of February 25), both from the POTUS account and the last 3050 epistolary gleanings (excluding retweets) from the @realDonaldtrump alter ego. (I’m perpetrating the legal fiction that you’ve downloaded these data as well, as I’m not sure what liberties I can take with my paid-for copies.)

What’s most noteworthy about the POTUS download is the confining of its output to tweets post-dating the president’s January 20 inauguration, even though I was prepared to pay for the twdoc max of 3200. It’s clear the messages of the erstwhile president have been retweeted to some vast clandestine archive, or the Smithsonian, or both. (In fact the download reports that the current POTUS account was initiated during the evening on January 19, while Barack Obama was technically still in charge.) As a result the now-fledgling POTUS account divulges a mere 7 tweets bearing the inimitable authorship of the POTUS himself; that count is tipped by the president’s signature DJT capping each self-written tweet, and which I captured and counted in Excel formulaic terms thusly (after I named the text-bearing field in column B Text, if you really are downloading along with me):

=COUNTIF(text,”*”&”DJT”&”*”)

(We’ve seen this formula before.)

Yet as of February 25 Mr. Trump has fired off 210 tweets from his @realDonaldTrump id since his instatement, or about six a day; it is clear, then, that master statuses notwithstanding, that the username signifies the….real Donald Trump.

So what is the president wanting to tell us these days? After having coined the range name incumbent for those 211 tweet-rows, I put this formula to the data:

=COUNTIF(incumbent,”*”&”fake”&”*”)

Thus putting in a search order for one of Mr. Trump’s current adjectives of choice. 22 of the tweets, or 10.48% of all his presidential messages, return the term, and that result appears to be unassailably real, along with Mr. Trump’s fetching penchant for emblazoning the label FAKE NEWS (appearing 19 times in the range, and 28 times among all 3050 tweets) in all-caps. Indeed, Mr. Trump’s peculiar alacrity of expression is affirmed by his recurring fondness for the exclamation mark; an extraordinary 125, or 59.8% of all his post -January 20 tweets, sport the punctuation. But in view of the larger fact that 61.5% of all his 3050 tweets are so embellished, we can’t be too surprised by his enthusiasms, though I am not sure what his tweet of February 4, reprinted here in full: MAKE AMERICA GREAT AGAIN! Means to exclaim.  I think we’ve heard it before.

For some other search-term frequency counts:

pres1

One could allow oneself to be struck by the paucity of references to “Islamic” or “Putin”, and the relatively prominent and decorous “thank you” and “Congratulations”. No one said Mr. Trump isn’t a nice guy. He does a have thing about “media”, though.

I was additionally interested in how Mr. Trump’s tweets spread themselves across time. That is a question we had explored in an earlier post, but there with a qualification: because twdocs’ Created At field details tweet times per French standard time, and because Trump could have been here or there in mid-campaign calibrating his time of transmission, as opposed to the time recorded by twdocs’ French server, stood as something of challenge. Trump the office seeker could have, after all, been in New York or Los Angeles, or somewhere in between. But now that he has entered the office we may be safe in assuming that the great preponderance – if not all – of his presidential tweets to date have been east coast-timed. And if so, then a constant six-hour decrement – the time difference between France and Washington – could be applied to each of the Created At times reported in column A. And again, that workaround can be realized by opening a new column to the immediate right of A, titling it EastCoast or something kindred and entering, in what is now B7 (the first row of data):

=A7-.25

The decimal of course represents one-quarter of an elapsed day, or six hours. Thus the date/time in A7 – 2/25/2017 23:02 – retracts to 2/25/2017 17:02 in B7, once the -.25 is applied. Copy down the B column and then try out this pivot table:

Rows: EastCoast (Grouped by Hours only)

Values: EastCoast (Count)

EastCoast (again, by % of Column total)

I get:

pres2

We see that the President is an early-morning tweeter, squeezing 40% of his output into the 6 to 9AM stretch (note that the 8AM grouping registers tweets through 8:59), and not a chirp to be heard between 1 and 5AM.

But doesn’t that measure of silence, at least, qualify as good news?

NHS Patient Waiting Times, Part 3: Weighting the Formulas

12 Feb

Welcome back to the next installment of spreadsheet cubism, in which the same data task is imaged from multiple formulaic vantages.

We’ve already submitted the NHS patient waiting data to two such looks, and here comes the third, an arrestingly different one; arresting, both because of its startling simplicity, and because its workings have been hidden in plain sight from Excel users for some time.

Again, our interest is in learning the number of patients waiting for an identified number of weeks for treatment in one of 19 different medical specialties (and we’re continuing to work with the IncompProv tab). Our vantage begins to come into view when we select the F3:BG22 range that rectangulates (it’s a word; I checked) the week-waiting numbers, bordered by medical specialty.

Then perform a pair of identical finds and replaces, first on F4:F22 and then on G3:BG3 – or the ranges that carry medical speciality and week-wait headers respectively:

waiting1

That is, replace every instance of a space among the earmarked cells with an underscore – yes, very much a to-be-explained step.

Re-select F3:B22 and turn next to an old Excel capability to which I’ve probably given rather short shrift, but have grown to appreciate of late – the Create Names from Selection feature, brought to you via the Defined Names button group in the Formula tab:

waiting2

The default decisions above ascribe range names to every row and column in the selected range, the names coterminous with the labels in the range’s top row and left column – but you’ve probably figured that one out.

Our medical speciality dropdown-list menu remains in place in E1, and we’ll proceed to slot another dropdown in D1, this one comprising the week-wait labels in G3:BG3 (in fact the Treatment Functions Description in F3 likewise names its range, i.e. the medical speciality rubrics in F4:F22, but that name isn’t contributory to the process here). Then enter yet another dropdown in D2, referencing precisely the same range assigned to the menu in D1; and that apparent redundancy needs to be understood, of course (in fact you can simply copy D1 to D2; the menu will be duplicated).

And now, by the way – and this aside is far from incidental – we’ve learned why we needed to substitute the underscore for all those spaces a few paragraphs ago. Excel named ranges must comprise labels of contiguous text, and so when the spreadsheet meets up with a multi-worded header. it insists on exchanging a word-linking underscore for every space when it forges the range names. We thus had no choice but to do the same with the labels in F4:F22 and G3:B3, in order to anticipate and Excel’s range-naming mandate.

Now back to our budding formula. Say we want to learn the number of patients who have waited up to six weeks for treatment in oral surgery. Select that speciality in E1, and select Gt_00_To_01_Weeks_SUM in D1:

waiting3

Tick Gt_05_To_06_Weeks_SUM in E2, because we’re counting patient numbers through week 6 for the oral surgery specialty.

Then, in a free cell, enter:

=SUM(INDIRECT(E1) (INDIRECT(D1):INDIRECT(D2)))

Which evaluates to 68,045, the number of oral surgery patients waiting up to six weeks for treatment.

That expression doubtless merits a to-be-explained as well. First, that is indeed a space insinuating itself between the first and second INDIRECT, and not a typo, begging in turn a rather pressing if rhetorical question: where does one find a space pushing its way inside an Excel formula?

You find it here. The space – and again, its functionality isn’t new to Excel – qualifies as an actual mathematical operator, of an operational piece with the traditional go-tos  +, -, /, *. The space performs an act of identification: that is, in its base mode pinpoints a value standing at the intersection of two ranges. (And thanks to Jordan Goldmeier’s and Purnachandra Duggirala’s Dashboards for Excel, which promotes and explains the space operator approach.)

By way of a more straightforward introductory example, consider this assortment of test scores lined by student names and subjects, say in A1:G11:

waiting4

(Note the collapsed space in the polisci entry, in anticipation of the Create from Selection’s name-underscoring practice.) Dubbing ranges again via the Create from Selection protocol, this unnervingly spare formula:

=Maureen[space]Art

returns 72, the value that stands at the confluence of ranges Maureen and Art.

The solution starkly pares the standard INDEX/MATCH solution to what is in effect this lookup task; indeed, so lean is the space-operator prescription that one is bidden to ask why it doesn’t predominate among users and commentators. I’m asking, but I don’t have the answer.

Now of course our denser formula departs from the above demo expression. For one thing it packs several instances of INDIRECT into the mix (see our discussion of that function here), because our three contributory dropdown-menu selections have returned merely textual references to the ranges with which we’re working, and these thus need to be synergized into actual, working range citations.

And the

(INDIRECT(D1):INDIRECT(D2))

half of the formula empowers it to find all the values dotting the intersections of the oral surgery specialty and all the columned ranges between and including the two we’ve actually identified in the formula; the space operator can do that, too (and note the placement of the parentheses, by the way).

And because the formula has sighted the multiple data points crossing their multiple intersections, the SUM function must gather these into a single total – our answer. Note that the standard INDEX/MATCH lookup can’t do carry out this additive step – at least I don’t think it can.

It’s probably worth your while, then, to learn more about the streamlining efficacy of the space operator. It’s been worth my while – and I’m lazier than you.

NHS Patient Waiting Times, Part 2: Weighting the Formulas

30 Jan

A certain type of spreadsheet bliss attaches to relative ignorance. Know just one formulaic way around a task, and your decision rule requires no deciding: write the formula.

But know at least a couple of ways to get where you want to go, and you’ll need to break out the map and plot your best-course scenario, or at least try to. Last week’s post described one means, driven in part by a teaming of the OFFSET and CELL functions, for totalling the number of patients having to wait up to a specified number of weeks for treatment in an identified medical speciality, those data emanating from the National Health Service spreadsheet upon which we drew in the post. But alternative means toward that end are available; and why you’d make use of this as opposed to that one stands as a good question, the answer to which has a lot to do with the confidence you can marshal toward the approach. Find one formula among the options the easiest to write and you’ll likely be magnetized in that direction – even if in some absolute, textbook-ish sense, some other formula comes best in show for elegance.

In any case we can, in the interests of informed choice-making, review two other formula possibilities, braced by the corollary concession that still others may be camouflaged in the brush. Remember we’re interested in learning the number of patients in a medical speciality who waited a maximum, stated number of weeks for treatment (we’re continuing to address the workbook introduced last week). We can begin to put our second option into play by retaining the dropdown menu of medical specialities in E1 (founded on cells F4:F22) we forged last week (look here for a precis of dropdown menu construction, if you’re new to the idea. You don’t really need to name the range, though, in spite of the linked discussion’s advisory – at least not in our case, as its contents won’t be augmented). Then enter a week-waited number in D2, say 12, and for illustration’s sake select ENT from the speciality dropdown in E1:

wait2

Thus we’ve declared in interest in discovering the number of individuals who had to wait up to 12 before receiving ENT treatment. Then, enter this expression, say in H1:

=SUMPRODUCT((F4:F22=E1)*(COLUMN(G3:BG3)<=D2+6),G4:BG22)

This formula likewise calls for an exposition, needless to say. SUMPRODUCT, Excel’s quintessential off-the-shelf array formula, is perhaps the deepest of the application’s functions, its diffident, user-familiar tip (i.e., it multiplies pairs of values and proceeds to add them all) transmitting the weakest of signals about the iceberg immured beneath.

Here, SUMPRODUCT combs F4:F22 for the medical speciality – ENT – we ticked in E1.  And you’ll observe that the search and find for ENT is pressed without any syntactical resort to the standard, conditional IF; that word is nowhere to be found in the formula. When it finds ENT – in F7 – the formula moves to examine row 7 for the values running across its contiguous, number-bearing cells, in columns G through BG (note too that the F4:F22=E1 phrase is couched in parentheses).

And that sweep through the columns takes us to the *(COLUMN(G3:B3)<=D2+6) piece of the formula. The star/asterisk reminds us that SUMPRODUCT continues to do what it’s been programmed to do –  assign a value of 1 to its successful sighting of the requested medical speciality ENT (a sighting that first imputes the name TRUE to the finding in F7 – and in the Boolean language of array formulas, TRUE is next quantified into a 1). In turn, the other non-complying entries in F4:F22 are deemed FALSE, and incur a 0 as a result. COLUMN identifies the absolute column number of any cell reference; thus =COLUMN(X34) returns 24, for instance.

Befitting its multi-calculation, array formula character, COLUMN(G3:BG3) flags the column number of each of the G3:BG3 entries, asking if any of these equal or fall beneath the value 18, that number a resultant of the 12 we entered in D2 – plus 6, a necessary additive that squares our formula with the fact that the first column we’re inspecting – G – has a column value of 7. Adding the 6, then, to the 12 in D2 – the week wait figure – transports us to the 18th column of the worksheet, R. And R contains the Gt 11 to 12 Weeks data that marks the outer bound of our search. And any column that satisfies our criterion – any week equal to or less than 12 – likewise receives a TRUE/1 evaluation.

Remember again that, convolutions notwithstanding, SUMPRODUCT is about multiplication – and here the 1 assigned F7 is multiplied (remember the asterisk) by all columns in receipt of a 1. All other cells – that is, all the other medical specialties in the F column, and all the weeks in excess of 12 – receive a zero, and their multiplications yield zero and drop out of the formula. The remaining 1’s, so to speak, are multiplied by the ENT values for week 12 and before in cells G4:BG22, and ultimately added – because that’s what SUMPRODUCT does.

Again, we’re counting the number of ENT patients needing to wait up to 12 weeks for treatment, and in this case I get 190,480. Type a different week number in D2 and/or select a different speciality from the dropdown menu in E1, and the sum should respond accordingly. And because SUMPRODUCT is a homegrown Excel function, it stores itself into its cell via a simple strike of the Enter key – and not the storied Ctrl-Shift-Enter in which customized array formulas are obliged.

Hope that’s halfway clear – though halfway probably won’t help you write the formula when and if you need to write something like it some other time. This application of SUMPRODUCT is a good deal more thought-provocative that its simpler implementations, but if it makes you feel any better I had to think about it, too.

The point again is that SUMPRODUCT has delivered us down an alternate formulaic route to our answer, and whether it’s to be preferred to the OFFSET-mobilized variant we explored last post remains a good question.

But there are still other possibilities.