Archive | Uncategorized RSS feed for this section

Art and Gender at MoMA, Part 2: Drawing Some Conclusions

14 Dec

If there’s a Part 1 can a Part 2 be far behind? And while I’m answering the question only because you’ve asked it, the answer is yes.

There was a Part 1 to this distended duology of looks at gender at the Museum of Modern art compiled by Anna Jacobson, the party of the first part having been issued a president or so ago. Your patience is appreciated.

Part 1 counted the numbers of installations at MOMA dotting the 61 years between 1929 and 1989 plied by women artists, a sum greased by a couple of Excel’s new and cool dynamic array functions. OK; they’re not as new now as they were then, but that doesn’t mean they’ve overstepped their sell-by. Hardly; and in fact, curmudgeon that I am, I want to use them again. And the file, again:


How about, for example, a proportioning of gender by exhibition year – that is, an envisioning of exhibition numbers for each year along the male/female axis? Question 2: does that sound like a job for a pivot table? It sure does. After all, I want to see something like a column of years teamed with a perpendicular row of gender identifiers, and some numbers shipped into all those year-gender cell intersections – and that prospect has pivot table written all over it. But I want to apply the dynamic array strategy to the task for two reasons: one, to demonstrate exactly how it works, and two, to hint at a larger, looming intent, one possibly strategized by Microsoft: to position dynamic array functions as a formula-driven alternative to pivot tables, at least in part.

Let us begin, then, by minting a new sheet and raising that column of years, a step which in turn raises a question: because the date fields (e.g. ExhibitionBeginDate) contain precisely that – full-bodied dates – our formula needs to ferret just the years from those data, and pour these into the column. Given that necessity, this formula in say, C5, seems to work:


We’ve again mustered the dynamic array UNIQUE function (we called upon it in Part 1, if memory serves me correctly), which here braces the nested YEAR function that shakes out the years from the exhibition start dates i.e., in excerpt:

Now for the gender field and its peculiar requirements. We’ll return to the UNIQUE function in view of our interest in isolating a one-time instance of each discrete entry in the field. Move to D4 – one row up and one column to the right of B5, preparatory to roofing our matrix with its horizontal edge. Thus =UNIQUE(Gender) delivers:

I didn’t say it would be pretty. We haven’t much use for the 8,700 empty Gender cells that have collected themselves up there with that 0; nor will we able to press ahead with the solitary entry for the Brazlian artist Roberto Marcelo, whose gender has somehow been reported as NULL. We want to offload these ciphers from our matrix-in-progress, and so we’ll submit the data to this close shave:  


We’ve here recalled the dynamic array’s mighty FILTER function, introduced some time during last spring training, or thereabouts. In our case, FILTER is asking after records in the Gender field that exhibit either Male or (expressed by the plus sign) Female in its range, with the UNIQUE remaining in place.

But if you’ve been following along you’ll understand that we’re not there yet, because our formula continues to return its output in default vertical orientation; and we want Male and Female to swing into horizontal place. We can make that happen thusly:


TRANSPOSE is an old Excel array function, greatly streamlined by the dynamic array regime. Befitting our task, TRANSPOSE flips a range 90 degrees either way – from vertical to horizontal or vice versa (and it works with multiple rows or columns, too). Next, clamp the dynamic array SORT around the expression – because, neat freaks that we are, we want Female to precede Male in its alphabetic scheme of things. Just make sure to get all those parentheses right:


(Note that SORT follows TRANSPOSE in the above syntactical flow. Try =SORT(TRANPOSE( and Male will continue to precede Female. We see that TRANSPOSE does its work after all the other chores have been attended to.

Now all we need to do is fill all those blanks with the counts of female and male exhibitors by the years, and a count sounds like a job for COUNTIF – or COUNTIFS in this case, but in dynamic array style. That is, I want to develop a count of the number of female and male artists exhibiting at MOMA, and by year – with precisely one formula.

To start, drop into cell D5, the coordinate that sits to the immediate right and directly beneath the formulas that gave rise to the years and gender identities:

What we want, of course, is for COUNTIFS to cobble every permutation of gender count by year, naturally invoking year and gender criteria along with a third parameter – ExhibitionRole, which comprises six different identifiers (in addition to the pertinacious 0), only one of which is Artist.

First, while our matrix’s year column was able to scrape its years from the exhibition dates via =UNIQUE(Year(ExhibitionBeginDate)), it seems that COUNTIFS won’t honor that sort of second-order criterion. Rather, we’re bidden instead to nail a derived field of years to the next-available AB column in the source data sheet. Call it Years, and enter in AB2 the singular


and nothing more. Again, the energetic dynamic array formula drives itself down AB, grabbing the year from each and every Exhibition Date.  Then post in D5 back in our matrix


We’d entered something like this in the previous post, but not exactly like this. What’s different here, among other things, is the surfacing of those pound signs (i.e. hashmarks if you live in the UK), important indicators that require explaining. The sign attaches to a reference signaling the address of a dynamic array formula. Since, for example, our pillar of years springs from the dynamic formula we installed into C5, C5# identifies it as such, and here as a criterion in our COUNTIFS statement; and the moma!AB2# reference thus points to the dynamic array formula serving as a criterion range.

In any case, that COUNTIFS should make this happen (in excerpt):

Indeed – note what’s happened, and why. We’ve fired a grand total of three formulas into the analytical fray and brought out 122 cells’ worth of results – two each for the 61 years separating 1929 from 1989; but had we been charged with tabulating 500 years’ worth of data we’d still be up to the task with just those three formulas. That’s what dynamic array formulas do. Note, for example, that the COUNTIFS picks up the numbers for both genders in both columns, thus casting its net horizontally and vertically.

And if you want to learn the yearly proportion of female exhibitors, one formula will do here, too. Try, in F5:


The entire range of women contributors furnishes the numerator, draped atop a denominator of all the female and male artists. Get that done and you’ll find an appreciable, if fitful, uptick in women’s installations, though the 10.44% figure for 1989 is in actuality the smallest since 1980.

And if it at least looks like a pivot table, you’re right. Can Microsoft sue itself for plagiarism?

Better than Netflix: Excel 365 Dynamic Array Function Training – Online

26 Oct

The choice is clear: Fandom or FILTER? Spotify or SPILL? YouTube or UNIQUE?

It’s a no-brainer as far as I’m concerned; I’ve never even heard of Fandom, and I’m offering something you can actually claim for your CV with without seeing your nose grow – online, real-time instruction in Excel 365’s dynamic array formulas, Google Meets-sited (it’s free and unlimited until March), with duration(s) and dates negotiable,

You’ll never get this offer from Rihanna, and Kim is on the campaign trail, so what are you waiting for?

As for the fine print, writ large, it reads like this: one-on-one encounters ring up at $30 per hour, with multi-learner sessions checking in at $20 per. Session times, numbers, and lengths can be hammered out during the negotiation phase (Sundays are possible, too, but not Saturdays. Remember too, that I’m in London); I’m just fine, for example, with a mano-y-mano of an hour and 47 minutes, as long as you’re fine with an invoice of $53.50.

You should be pretty comfortable with formula writing and MUST have the Office 365 iteration of Excel in tow, because the new functions don’t appear anywhere else. Time and inclination permitting, the intention is to say something about each of these worth-knowing, game-changing functions:







And perhaps XLOOKUP and XMATCH. I have lots of practice/exercise files, too, ideal for stoking the fun to fever pitch.

If you’re interested, just get back to me at

Just one request: if you insist on wearing those jeans inlaid with the faux rhinestones, please – turn your camera off. Or at least remain seated.

Corona Virus Spreadsheet 2 – for Excel 365

16 Jun

Here’s a follow-on to the previously-posted Corona virus workbook (again requiring the 365 iteration), the data again gleaned from a daily download from the European Centre for Disease Prevention and Control:


This time, a measure of user interactivity has been shaken into the formulas: The two Rolling Intervals sheets ask the user to set the number of time intervals across which the Corona-case data are to reported, along with the number of days per interval, e.g. the first sheet compiling region-wide data:


The above selections, then, could be recast, e.g.:


Thus user decisions can widen or contract the periodicities, and redistribute the interval counts accordingly (which all date back from the endpoint marked by the last day for which data have been reported). And the same options avail in the Rolling intervals country sheet:


Remember the workbook requires Excel 365, and you may need to click Data> Refresh All if the download doesn’t execute.

Corona Virus Spreadsheet – Requires Excel 365

22 Apr

Corona worldwide and country spreadsheet

The workbook automatically downloads current data from the European Centre for Disease Prevention and Control and presents

  • Rolling, last-10-day case totals worldwide (the Worldwide totals sheet)
  • Rolling, last-10-day totals by continent (the By continent sheet)
  • Last-10-day data by country with accompanying sparklines (Last 10 days by country)
  • A fourth sheet that enables the user to enter any country name in an auto-complete text box in the sheet’s upper left corner and any two dates, to produce case data for the country across that span (Selected time span and country).

Art and Gender at MoMA, Part 1: Drawing Some Conclusions

8 Mar

The data are incomplete, but if you want to learn something about gender skew in the art world while cutting your teeth on Excel’s new dynamic array functions at the same time, then book a viewing here:

(Click the green “Clone or download” button and call up MoMADirectorsDepartmentHeads.csv from the resulting zip file).

You’ll be brought face-to-monitor to a catalog of installations curated by New York’s Museum of Modern Art, and chronologized from the museum’s inception in 1929 through 1989. The data have been expounded here by Anna Jacobson, of the curiously-named UC Berkeley School of Information (one would like to think all schools disseminate information, but I do go on). Jacobson details the markedly pinched proportion of women artists who’ve threaded their way among the exhibitors, via a series of tabular and charted corroborations.

And while I’m not sure what application helped guide Jacobson through her trek across the data I can say with uncharacteristic certainty that it wasn’t the dynamically-arrayed Excel; that’s because, September 2018 curtain lift nothwithstanding, the general release of the upgrade was vamped-till-ready until this January – and then only for Office 365 users.

But’s it’s here, at last, and the MoMA data cultivates a fertile staging ground for a first look at the dynamic-array complement, in part because the records bare numerous redundancies among the entries, e.g. repeated Exhibition titles; and compacting redundant data into a uniques-only set has stuck in the craw of Excel users like an Excalibur jammed into its stone.

Of course ways through the problem have already been charted, which we earlier encountered here, for example. We can now, for instance, apply Pivot Tables’ Distinct Count to the challenge; but the dynamic array alternative has it beat.

By way of demonstration, I first streamed the dataset through the ancient Create from Selection command, thus field-naming each column per its header. I next inaugurated a new worksheet and entered somewhere, say in I6:


I got, in excerpt:


We’re viewing the game-changer. Here, and for the first time, Excel enumerates the elements of an array that would have heretofore been trapped inside an array formula:


We now see, in their respective cells, the singular disclosure of each and every exhibition – all propagated by a solitary formula, the one in I6.

And we could proceed:



Cool, simple, and trenchant. And for a simple unique count, try:


I’ve thus learned that 1664 different exhibitions held down a space at MoMA – this from a dataset comprising 34559 records.

And now about those gender ratios: The worksheet’s Gender field lends itself to a number of informing looks, e.g, a simple, traditional COUNTIF-driven look at the women/men divide:



The respective aggregates: 2,527 and 23,269, a disparity needing to recognize the nearly 25% of the Gender cells that are unfilled (note that Jacobson’s dataset discloses a blanks count of around 15%.) In any case, we learn that more than 9.2 times as many men are represented among the exhibitors than women – maybe. I’m equivocating for two reasons: first, the ExhibitionRole field names seven statuses by which an individual could be described, only one of which is Artist (in addition to the Blanks residual category); and while the great majority of entries are so identified, an exclusion of the non-Artist entries need be imposed, and via the following refinement:



(Jacobson points out that Constituent Type need be lowered into the formulas, as a handful of Institutions entries in the field are associated with the Male item in Gender.)

Now I total 2,301 and 22,507, returning an even higher male/female disparity of 9.8.

Telling and instructive to be sure, but alternative takes on the data are there for the taking. We haven’t, for example, distilled the data into unique compendia of men and women artists, in which each exhibitor is accounted for but once; and once realized, those outcomes could likewise be submitted for comparison.

If that’s what you want, this formula would appear to deliver:


That dilated expression needs to be explained, of course. Here we’ve rolled out the powerful dynamic-array FILTER function, enwrapped both by UNIQUE and the old favorite COUNTA.

FILTER asks two essential questions, embodied in its two required arguments: what field (fields) is to be filtered, and by what criteria? Thus


Will succeed in returning – that is, unroll down a column – all the names in the DisplayName field associated with the “Female” entry in the Gender field. But that’s an example. Our expression here will hinge its yield on three criteria – the appearance of Female in Gender along with the incidence of the Artist entry in ExhbitionRole and Individual in ConstituentType; and to cut a longish story short, the syntax above means to assign the value of 1 to each instance of Female, Artist, and Individual. When all conditions are met in a given record/row the three 1’s are multiplied, and the resulting 1 instructs the formula to display the associated DisplayName. But if even one of the conditions disqualifies itself – say if an entry of Male populates a Gender cell – that expression returns a 0. Multiply 1 by 0, for example – and the asterisk actually does multiply the three values – and the ensuing 0 knocks out that DisplayName from the filter.

And all that means that


will develop this list, in excerpt:


But the list is overpopulated. Note the five iterations of Georgia O’Keeffe, for example; they’re there because we’ve filtered for all instances of the women artists who’ve exhibited at MoMA. But we want to see each name once, an intention that calls for this expression:


That works, but we’re not done – because we want to count that total, via:


That lengthy but intelligible construction yields 671, the number of distinct women whose work featured at MOMA between 1929 and 1989. Substitute “Male” in the Gender argument, and the sum identifies 4317 different men who exhibited during that span. The male-female quotient – 6.4 – overwhelmingly favors the former, of course, but at a notable pull-back from the earlier ratios we cited – the ones comparing all installations for all artists.

And how do these findings compare with Jacobson’s measures? I’m bidden to ask the question at all because her numbers are different. First, as observed above, Jacobson’s set presents a richer Gender field, in which only 15% of its cells are unfilled. She also went on to literally fill in some of the blanks via internet searches and first-name inferences. She reports, then:


And so the comparison reads:


Jacobson’s reckonings are actually the more “favorable” to women, recalling our discrepant data counts.

In any event all this art talk has gotten me inspired. I’m thinking about crowd funding an exploratory budget for a Museum of Spreadsheets, a place where practitioners of the number-crunching arts can hang their workbooks in a back-lit setting befitting their craft, and with wings devoted to portraits and landscapes, and those just-discovered VisiCalc antiquities.

The first show: 21st Century Format Painters.

Political Protest Data, Part 2: A Few Demonstrations

7 Jan

Our compare-and-contrast tête-à-tête with a sampling of Alyssa Fowers’ protest-demonstration charts is ready to convene, now that we’ve put the sandpaper to the rough edges on the protest source data.

For our first look we can consider this chart, a timelining of protests per week. (A play/pause button whitens the data successively across each week in the timeline. Thus, the March 20, 2017 date below keys itself to the last whitened bar.)


For the pivot table rendition:

Columns: Date, grouped by days in 7-day intervals:

Values: Date (count)

I get, in excerpt:


(A Wrap Text instruction stacks the dates as we see them.) While the table is not visually stirring, it sports the virtue of a heightened precision, and that could matter. We can now identify the modal week in the above chart – March 11-17, 2018, the one in which 1447 demonstrations were conducted; and while in theory all that chronology could likewise have been tacked onto the chart, the resulting textual flurry loosed upon the columns would have run roughshod over the viz. And the pivot table data could of course also be subjected to a round of conditional formats – a visual enhancement, to be sure, but not of the charting variety.

Another Fowers chart breaks out demonstrations by time and theme in stacked waves:


You’ll note that the Y axis is value-free, as it were, leaving us to guess at the demonstration numbers and the axis minimum, which may or may not bottom out at zero. Moreover, the identifying labels pinned to the data points appear selectively, and in addition, I’m not sure what the legend posted in the chart’s far left – “Height indicates number of protests per month” – means to suggest, given the two-dimensionality of the data points. What then do their depths signify?

In any event, how would a pivot table capture the data? We could try this:

Rows: Theme (per the previous post)

Columns: Date (grouped by Months and Years)

Values: Theme (Count, necessarily)

I get:


Not a thing of beauty, to be sure, but again I would submit the actual numbers deepen the larger story. Now we know that the conspicuous blue wave at March, 2018 – the one marking a spate of gun-control advocacies – reports 2031 protests for the month on the theme. And the table in effect identifies every theme’s totals each month, without the selective clarifications issued by the chart. Of course, the themes could likewise be inlaid into a Slicer:


Indeed – Fowers’ “When Americans protested about…”, a chart that asks readers to tick a theme from its own drop-down:


Does something comparable, and most legibly. Again, note the occasional fastening of data labels to the columns, along with the variability among the Y-axis maximums by theme, a standard chart feedback. However, here the minimum steadfastly and unambiguously holds to zero across the themes.


Another drop-down menu features in the above chart, asking the user to tick a state, although interpreting what happens next is slightly tricky. Nominate a state, and its demonstration themes are calculated as a percentage of all demonstrations for that state – but ranked at the same time among the other states’ percentage. Got that? We thus see that Florida, represented by the enlarged circles, is heavily sited by protests against guns and the Executive Branch (that sounds like Donald Trump to me). But relative to such activity in other states, its Supreme Court-themed demonstrators are few, however.

But the ratios call for both an intra and inter-state understanding. If, for example, gun control demonstrations were the modal protest in Florida – say accounting for 30% of all demonstrations in the state – that share could nevertheless position Florida in the rear of the state gun-control count, if other states were to experience still higher gun-control totals. Thus, it seems to me that a number-driven exposition of the intra/inter relation could add a few watts of illumination here, something like:

Rows: State

Columns: Theme

Values: Theme (Count, Rank Largest to Smallest, % of Row Total)

In excerpt, the table presents itself thusly:


Here we want to learn the extent of each theme’s contribution to the demonstration total of each state. The percentages, then could be compared to the proportion among other states down the same theme column. We see, for example, that Florida’s percentage of gun-themed and executive-branch demonstrations does exceed that of most, but by no means all, states. But a proper substantive emulation of Fowers’ viz would have us order the state percentages by each theme, and I’m not sure how that can be carried off here.

We can’t, for example, apply the Rank Largest to Smallest tweak to the above percentages, because Largest to Smallest only ranks the raw numeric values from which the percentages derive; and because Florida’s actual population is among America’s largest its Guns-demonstration rank may skew toward the far side of the curve. In fact, only California has accommodated more actual Guns demonstrations.

Moreover, even if the percentages we see were amenable to sorting, one can’t arrange to sort every theme highest to lowest simultaneously in the pivot table; that project would require a discrete dataset set aside for each theme – and perhaps that is how Fowers poised her chart.

But Fowers’ charts aside, another, assuredly character-based question put itself before me. Wikipedia recalls a Washington Post finding that “since Trump was inaugurated there has been a protest every day somewhere in the United States”, presumably shouting out at our fearless leader. That Post conclusion tracks back to January, 2018, but the Count Love protest data (through January 6) should facilitate a follow-up corroboration now, two years later.

First, we can calculate the number of days that the data span. Since the protests are sorted chronologically and my data drop down to row 22318, this simple formula:


yields 1087 (the +1 adds a day that would otherwise be ignored; 4-1=3, but you want all four days to be counted). We then need to count the dates on which at least one demonstration was conducted, and that sounds like a job for the pivot table’s nifty Distinct Count operation (available from Excel 2013 on).

Once you’ve saved the download as an Excel workbook (upgrading its native CSV character) and ticked the Add this Data to the Data Model box on Create Pivot Table window, you need only try

Values: Date (Summarize Values As > Show More Options… > Distinct Count

I realize a distinct count of 1076. That is, eleven days in the January 15, 2017 – January 6, 2020 interval went demonstration-free.

And that’s only right, after all; if the President can go golfing all the time, the least he could do is give his loyal opposition a few days off.

Political Protest Data, Part 1: A Few Demonstrations

23 Dec

If a spreadsheet is here, can a chart be far behind?

Apart from giving new meaning to the phrase “poetic license”, it’s a question nevertheless worth asking, one which piqued me anew when I came across Alyssa Fowers’ Points of Light: Protest in America site, a chart-driven catalogue of the mighty spate of political demonstrations flooding the country across the 2017-19 span, dating more specifically from January 16, 2017, or four days prior to the inauguration of the current president, whose name escapes me. But in addition to the columns, bars, and bubbles she’s , the native data in their own, primordial columns offer themselves for download here (click the “download a full export…” link beneath the Can I use your protest data for…? paragraph head).

More than 22,000 protests are there for the counting, or about 20 a day if you’re averaging, and the data are remarkably current. My download records demonstrations mounted as recently as December 21, including last week’s the pro-impeachment rallies speaking their collective piece about the aforementioned chief executive. Fowers has broken out and represented the data by protest theme, state of incidence, month of staging, and the like, and touches off my question in turn: are these data best studied via the chart/datz viz medium, relegating alternative spreadsheet scrutiny to the hopelessly character based?

A fair question, though one that perhaps slopes the argument toward a reductio ad absurdum: given the choice between communicating data by chart or spreadsheet, is the former modality always to be preferred? Absurd indeed, or disconcertingly plausible?

After all, it seems people tend to think about data visually, and the charted portrayal of quantitative information with its language of proportion – e.g. this bar protrudes farther than that one, or a plotted line negotiates time in its leaps and troughs. What, then, can a spreadsheet bring to the picture – so to speak? If the answer is nothing, we will have flung the question beyond the vanishing point, and that sounds a little…absurd.

Perhaps I should have addressed so elementary – and unnervingly pertinent – a question in my very first blog post, but it probably hadn’t occurred to me then – an elephant in the room, perhaps, that must’ve been crouching behind my laptop.

So here are about three answers. First, spreadsheets afford the virtue of precision. The value signified by a jutting bar, or column, or flitting line point, or crescent of a pie slice (the latter is particular disesteemed, even by data-viz wizzes), may not tell you exactly and unambiguously want you want to know. Indeed – the very fact that so many charts garnish their outputs with data labels tells you as much. Reportage is not well served by revelations that Donald Trump received about 300 electoral votes in 2016. You want 304, not a good guess of a column’s height.

Second, a multi-parameter look at the data might – might – profit from the hard-edged clarity of a spreadsheet. For example, a distribution of demonstration themes by state committed to a more-or-less conventional chart must reserve one axis for the demonstration counts and the other to say, the themes. The state identities, then, are necessarily assigned to the legend, where things can get a touch messy. But a pivot table can align demonstration with row labels and states along the columnar orientation, nestling the counts in the interior values terrain. Dense, but clear.

And of course charts in effect need spreadsheets, are hinged upon them. Yet-to-be-aggregated data resist charting; thus if you want to learn about Civil Rights demonstrations in California you need to bring them all together first before they pull across a bar.

In any case, we could contrast the data-capturing virtues of charts and pivot tables by culling screen shots of a few of Fowers’ charts, companioning them with a pivot-tabled take of the same data, and letting you decide.

But before those chart/table tandems set forth the data need to be sharpened, to wit: the Location field bears the state names that figure in a number of the charts – but these are joined in their cells to the cities in which the demonstrations were held. Because the two-character state names (probably their postal codes) appear in each cells’ endpoint, by marking out a new State field in column I and entering in the first data cell


we can isolate the respective state codes – with three exceptions, as it turns out. In 19 instances the formula culled the two-character “am”, delivering not a state abbreviation but rather the last two letters of the US territory Guam, which appears in its cells in full. In fact Guam has been assigned its own two-character postal code, GU; and so by running this Find and Replace:


We could refit Guam with the prevailing nomenclature.

Another two formula outcomes yielded a #VALUE error, because neither sported a delimiting comma and corresponding state reference. One, location, Fredon Township, apparently maps somewhere in New Jersey (so saith Google; the entry Fredon, NJ appears in other records), and so I merely added a restorative, “, NJ” to the offending cell. The second discrepancy, recalling a demonstration held in Space, denotes the April 12, 2017 anti-Trump weather balloon floated by the Autonomous Space Agency Network, a decidedly non-governmental confederation of hackers, scientists, and artists. My cell rewrite, then: “Space, SP”.

And another field likewise begged our attention before the pivot tables could be commissioned: the Tags parameter in E classifying demonstration by theme, but fine-tunes each with a sub-theme of sorts set off by a semi-colon, e.g. Civil Rights; For racial justice; Martin Luther King, Jr. Because the Civil Rights rubric has in particular constellated so many of these modifiers, any pivot table invoking the Tags field would teem with granularity. I thus extracted the superordinate demonstration motifs by heading column J Theme and entering this formula in J2 and copying down:


(Note that the Civil Rights example above features two semi-colons, but FIND will pinpoint only the first instance of the entry to be found).

Now we can think about those chart/pivot table comparisons. Just don’t ask me to map that anti-Trump balloon. I think in 2-D.



Best U. S. Cities: Letting the Data Decide

11 Nov

Thinking of a change of scenery? No; I’m not talking about that fetching screen saver you just downloaded. I’m talking a real move, of the bricks and mortar kind – apocryphal floor plans, burly guys manhandling your Wedgwood and your pet, change of address forms, having two keys for three locks – in short, big fun.

But if you’re still planning on going ahead, take a look at and a download of Ben Jones’ Data World-sited ranking of what he terms 125 of the Best US Cities, a reworking of the listing assembled by US News and World Report.

It’s all a judgement call, of course, but US News has marshalled a set of parameters, which when judiciously factored, yield their rankings. City number 1, and for the third year in a row: Austin, Texas, the state’s capital and a college town, leading the pack for its “value for the money, strong job market, high quality of life and being a desirable place to live.”

OK – that last plaudit seems a touch tautological, but Austin it is. And if you’re wondering, New York City pulls in at 90, Los Angeles at 107, and San Juan, Puerto Rico, – the latter situated in a not-quite-a-state – floors the list at 125. But the larger point, of course, is to make some aggregate sense of the ranking criteria dropped across the sheet’s columns. I see no master formula into which the criteria contribute their evidence and output the rankings, so we’ll have to make do with some homemade reads on the data.

To start with, we could draw up a series of correlations pairing the rankings with the numerically-driven parameters in columns D-Q, e.g., correlate ranking with Median Month Rent. Enough of those might deliver a crude but sharpened sense of city stature.

But you’ll respectfully demur. You’ll submit that the ranking data in column A is ordinal, that is, expressive of a position in a sequence without delineating the relative space between any two positions. Thus, for example, the grade average of a school valedictorian isn’t twice as high as the student holding down the second position, even as their respective class standings are quantified 1 and 2. On the other hand, a median monthly rent comprises interval data, whereby a rent of $2,000 is indeed twice as high as a charge of $1,000.

You’re right, and as such the correlation between city ranking and rents, for example, is assailable, but not, in my view meaningless. In spite of the gainsaying, I’d allow that higher-ranking cities should in the main associate more favorably with the variables the sheet reports. But let’s see.

We can figure the correlations easily enough by aiming Excel’s CORREL function at the fields. For example, to correlate city ranking and Median Home Price in H:


I get -.191, or a small negative association with home prices and ranking. With higher city rankings (denoted by the lower the number) then, comes a small but not overwhelming uptick in home prices, befitting those cities’ desirability. But again, the connection is small. (Note that if one of the paired values in a row features an NA the CORREL simply ignores the row.)

We can than apply the correlations to some of the other parameters:

With Metro Population: .426

Average Annual Salary: -.382

Median Age: -.105

Unemployment Rate: .679

Median Monthly Rent: -.211

Percent Single: .515

Violent Crime: .329 (By the way – I’m assuming the crime rates are superimposed atop a per-100,000 denominator, which appears to serve as the standard.)

Property Crime: .251

You’ll note the emphatic, not terribly surprising correlation between ranking and unemployment rate (here both dimensions reward a lower score). A city’s plenteous job market should naturally magnetize people through its perimeters; and note in turn the trenchant, but less decisive, association with ranking and average salary. The fairly high correlation with Metro Population suggests that a city’s appeal heightens somewhat with its relative smallness.

More curious and decidedly less obvious is the impressive correlation of ranking and percent single. Here loftiness of a city’s standing comports with a smaller rate of unmarried residents (but no, I don’t know if live-in partners are identified as single). (Again, don’t get confused; a “smaller” ranking of course denotes the more desirable city.) That finding could stand a round of analysis or two; after all, city quality is stereotypically paired with young, hip, unattached urban types. And by the way – though there’s nothing revelatory here – the correlations between annual salary and median home price, and median monthly rent: .658 and .756.

But of course, it’s a trio of fields, e.g. Avg High/Low Temps, AVG Annual Rainfall and Avg Commute Time – we haven’t correlated that begs the next questions: namely, why not? The answer – and we’ve encountered this complication before – is that the values staking this collection of averages have been consigned to textual status. Austin’s average commute time of 26.8 minutes, expressed precisely in those terms, is hard to quantify, because it isn’t a quantity. But couldn’t the field have been headed Avg Commute Time in Minutes instead, with authentic values posted down the column?

But the field is what it is, and so in order to restore the commute times to numerical good standing we could select the Avg Commute Time cells and point this Find and Replace at the range:


That’s a space fronting the word “minutes”, a necessary detail that obviates results of the 26.8(space) kind, which would remain textual.

Now with the textual contaminants purged the correlation can proceed, evaluating to a most indifferent .012.

But while we’re at it, this array formula can correlate average commute times and city ranking without removing the minutes term:


The formula extracts the four leftmost characters from each Avg Commute Time cell (all of whose first four characters are numerics), convert them into values, and gets on with the correlation. (But note, on the other hand, that the array option doesn’t appear to comport with the AVG Annual Rainfall data in I, because the NAs obstruct the array calculation. The Find and Replace alternative will work, however).

And by the same token, analytical ease would have been served by allotting a field each to the average high and low temperatures, and without the degree symbol. Here you might have to make room for a couple of blank columns and enter something like



Fo the average high temperature, and copy down. You need all that because a straight-ahead VALUE(LEFT ) will turn up a good many VALUE errors, which will stymie the correlation. If you go ahead you’ll realize an association between city ranking and high average temperature of .1975 – something, but not much.

And for low temperatures, don’t ask. But if you do, try:


Correlation, if you’re still with me: .2159.

Maybe not worth the effort.






Bicycle Race Data: Signs of the Times

5 Sep

The thing about formatting a cell value is that it’s never wrong – or at least, rather, never “wrong”. Those quotes point to an irony of sorts, of course, meaning to reaffirm the powerless effect of formatting touches upon the data they re-present. That is, enter 23712 in cell A6. retouch it into $23,712.00, 2.37E+04, or Tuesday, December 01, 1964, and you’re still left with nothing but 23712. Write =A6*2 somewhere and you get 47424 – no matter how that value is guised onscreen.

But a format that fails to comport with its data setting can seem like Donald Trump delivering a keynote address to a Mensa conference – that is, unthinkably inappropriate. Baseball aficionados don’t want to be told that Babe Ruth hit Saturday, December 14, 1901 home runs, after all, when they’re expecting to see 714 – and not even 714.00.

Thus data appearances matter, of course, even as they don’t change anything, as it were. And for a case in point in how they do and don’t matter, pedal over to the Velopace site, an address devoted to a showcasing of that company’s adroitness at timing sporting events with all due, competition-demanding precision. By way of exemplification Velopace has archived the results of a raft of races at which it’s been called upon to time, including the 2018 LVRC National Championships Race A, a contest bearing the aegis of the UK’s  League of Veteran Cyclists. Those results come right at you in spreadsheet form here:

Velopace results

Apart from the usual obeisance to column auto-fitting, the data make a few interesting claims on our scrutiny. Consider, for example, the timings for the first five finishers, lined up in the spreadsheet:


Then turn to the same quintet in the source web-site iteration:


First, note that the four Lap readings (the Finish parameter incorporates the times for the fourth lap) are cumulative; that is, Lap 2 joins its time to that of Lap 1, and so on. Note in addition that the Total Race Time field seems to merely reiterate the Finish time result, and as such could be deemed organizationally redundant, and perhaps a touch confusing.

But it’s the spreadsheet version’s formatting may force you to pull off the road for that jolt of Red Bull. Here, for starters, the timings have been rounded off to tenths of a second, in contradistinction to the web-versioned drill-down to thousandths – if nothing else, supporting testimony to Velopace’s skill at clocking exactitude. Now while that fine touch makes sense, Lap 2’s time for race victor Simon Laws in cell I2 reads 10:07.2. A click on that cell bares its formula-bar content of 1:10:07 AM – that is, Laws’ aggregated two-lap time, and expressed in the web version as 01:10.07.180. We need to ask first of all about the missing hour reference in the spreadsheet time in I2, which appears to you and me as 10-plus minutes. Remain in I2 and right-click Format Cells and you’ll be brought here:


That customized format excludes the hour parameter, and so should properly read something like:


Getting there asks you to click the bar immediately beneath the Type: caption and add hh: to the expression:


The hour is thereby returned to view (note the sample above, returning the newly-formatted, actual time value in I2), and a mass application of the Format Painter will transmit the tweak to all the times recorded across the spreadsheet, including the sub-hour entries for lap 1, which will be fronted by two leading zeros. The 0 following the decimal point above instates a code that regulates the number of in-view decimals visiting the cell; thus h:mm:ss.000 will replicate Laws’ 01:10:07.180.

But the first question that need be directed at the data is why the above repairs had to be performed at all. Indeed, and by way of curious corroboration, two other race results I downloaded from the Velopace site in which cyclist times pushed past the hour threshold were likewise truncated, but It would be a reach of no small arm’s length to surmise that the spreadsheet architects had built the shortcoming into their designs. Could it be then that the peculiarly clipped formats facing us owe something to some shard of HTML code that went wrong? I don’t know, but after downloading the LVRC file in both CSV and Excel modes (the latter dragging along with it some additional formatting curiosities), I found the hours missing either way.

Now for one more formatting peccadillo, this one far more generic: enter any duration, say 2:23, and the cell in which you’ve posted it will report 2:23 AM, as if you’ve decided to record a time of day, e.g. 23 minutes after 2 in the morning (yes; type 15:36 anywhere and you’ll trigger a PM). I do not know how to eradicate the AM, though Excel is smart enough not to default it into view, consigning it to formula-bar visibility only. Indeed, if you want to see the AM in-cell, you’ll need to tick a custom format in order to make that happen.

But the quirks keep coming. If, for example, you enter 52:14 – that is, a time that bursts through the 24-hour threshold – Excel will faithfully replicate what you’ve typed in its cell (in actuality 52:14:00), but will at the same time deliver

1/2/1900 4:14 AM

to the formula bar. That is, once a time entry exceeds a day’s worth of duration, Excel begins to implement day-of-the-year data as well, commencing with the spreadsheet-standard January 1, 1900 baseline. But as you’ve likely observed, that inception point doesn’t quite start there. After all, once the dates are triggered by postings in excess of 24 hours, one might offer that 52:14 should take us to January 3, 1900 – the first 48 hours therein pacing off January 1 and 2, with the 4:14 remainder tipping us into the morning of the 3rd.

But we see that the expression doesn’t behave that way. It seems as if the first 24 of the above 52 hours devote themselves to an hourly reading alone, only after which the days begin to count off as well. Thus it seems that Excel parses 52:14 into an inaugural, day-less 24 hours – and only then does the 28:14 remainder kick off from January 1, 1900.

But still, format 52:14 as a number instead and the expression returns 2.18 – that is, the passage of 2.18 days – or 4:14 on January 3, 1900.

Because even when formatting looks wrong, it’s always right. Now why don’t they say that about my plaid tie and polka dot shirt?


The Grey Lady Learns Pivot Tables: NY Times J-Course Data, Part 2

15 Jul

The Intermediate tine of the three-pronged New York Times data journalistic syllabus casts its analytic lot with pivot tables, and the kinds of questions properly placed before that mighty aggregating tool. Among its several data offerings awaits a 2900-record gathering of New York (i.e. Manhattan) real estate data, naming property owners, their holdings and their assessed value, and a trove of additional metrics. Per the course’s pedagogical remit, the learners are prompted to pose a few questions of their own of the data – surely a useful heuristic – before they turn to the assignments at hand, which among other things asks “What are the things you notice about this dataset that shows it isn’t perfect?”

A good, important, and generic question, one that nevertheless isn’t advanced by the Census data sheet we reviewed in the previous post. In fact, worksheet data imperfections can assume at least two forms: a discernible scruffiness among the records, and/or design impediments that could constrain analysis, and I’m conjecturing the Times wants it staffers to concern themselves with flaws of the former stripe.

For example, if this qualifies as a blemish: once downloaded to Excel, both the start and end_date year entries present themselves in text form, thus obstructing any unmediated attempt to group those data. Yet the year_built data remain indubitably numeric, and I can’t account for the discrepancy. But at the same time, however, these data in their native Google sheet mode appear appropriately numeric, and when I copied and pasted some of the start dates to a Google sheet of my own they behaved like good little values; and moreover, the left orientation imparted to the numbers in the end_date field suddenly righted themselves (that was a pun intended) via my paste. Another Google-Microsoft lost-in-translation flashpoint, perhaps, and not a species of data imperfection, if one remains in Sheets. (Another note: I’ve been heretofore unable to actually work with the Times sheets in the their Google trappings, as access to them defaults to View Only, and seems to require permission from the sheet owners in order to actually work with them. My requests for permission have gone unrequited to date, but in fact you can copy and paste the data to a blank Google sheet and go ahead. The data are open-sourced, aren’t they?)

Far more problematic however, and presumably one of the data failings over which the Times hoped its learners would puzzle, are the disparate spellings in the owner_name field of what appears to be the one and the same New York City Department of Housing:


(Note the last entry above is simply misspelled. The data were drawn from the coredata site, by the way, a project of New York University’s Furman Center.) And, while we’re at it:


But the Times’ marching orders don’t oblige its learners to proceed and do something about the inconsistencies. Absent that determination, no accurate answer to the Times’ question (number 6) – “Which owner has the most buildings?” – can be enabled. Remember that the Intermediate unit is pivot-table-driven, and any table invoking the owner_name field is going to loose the untidy spate of spellings reported above.

Yet one more imperfection besetting the selfsame owner_name field is the formidable complement of cells – 381 of them, to be exact, or about 13% of all the records – that contain no owner name at all, a lacuna that likewise comprises the analysis. The Times asks its learners “Who are the biggest owners in each neighborhood based on the number of units? Limit your table to owners who have more than 1,000 units”, an exercise which would appear to call for a pivot table that looks something like this:

Rows:  Neighborhood


Values: res_units (filtered in the Rows area for sum of res_units equal to or greater than 1000)

And that alignment of parts kicks out a set of results that, in excerpt, embody the problem:


Indeed, both data shortcomings – the blanks and the variant spellings – degrade the findings prohibitively.

The Times also wants its learners to “Compare the average value per unit for different neighborhoods. Which is the most expensive and which is the cheapest?” That chore seems to call for a calculated field, e.g. in Excel:


I’m just wondering if the Times cohort learned the equivalent feature for Google Sheets; perhaps it did, after all.  Its Data Training Skills List merely records the Pivot Tables rubric without elaboration. (Note in addition that the housing data sheet hoards an Income sheet from which the Neighborhood population, income, and diversity fields on the Housing sheet have been presumably drawn, probably through a series of VLOOKUPS whose yields have been subject to a mass Copy > Paste Special routine directed to the Housing sheet.)

Of course, that surmise points to a larger question: the breadth of spreadsheet capabilities spanned by the Times training. How, for example, were learners expected to apply themselves to this assignment: “Which neighborhoods will be most affected (in terms of number of units) in each of the next 10 years by expiring subsidies and which one is the most secure?” I’d try this:

Rows: Neighborhood

Columns: end_date (filtered for years 2019-2028)

Values: program_name (Count, necessarily; the data are textual)

And my table looks like this:


Thus Central Harlem is the neighborhood most vulnerable to short-term expirations of program subsidies – by far – with the Stuyvesant Town/Turtle Bay district, really a mélange of sections on Manhattan’s East Side, the least exposed. But does my pivot table approximate toward the strategy the Times was seeking?  Again I don’t know, but a conversation with the paper’s syllabus architects about their intentions for the exercises would prove instructive – at least for me.

And that conduces toward the inexorable follow-on, then: I’m happy to offer my services to the Times, in the edifying form of a weekly column on spreadsheets, and for a magnanimously modest emolument; and I’d make myself available to help with the in-house training, too.

Just one question: will my press pass get me into all the Yankees games for free?

Addendum to the above: My previous post recounted my inability to access and edit the Times’ files in native Google Sheet mode. The paper’s Elaine Chen did get back to me yesterday (July 16), pointing to the File > Download alternative. One assumes, after all, that Times doesn’t want to approve shared file access for the multitudes, and probably for good reason. I should add that if one downloads the data in CSV instead of Excel mode, the formatting discrepancies I described in Part 1 seem to disappear.