Search results for 'New York data mine'

New York Data Mine, Part 2: The Grafitti Files

25 Apr

Do you believe the words of the prophets are written on the subway walls? That’s what the song said; but the Talmud says nowadays prophecy is given to fools and children, so the authorial judgement is yours.

In any case, we’re talking graffiti, the in-your-face, on-your-property genre that urban incumbents either adore or abhor. For devotees, it’s nothing less than a beauteous cultural template, Lascaux in spray paint; for antagonists, nothing more than vandalism under cover of night, daubed in attitude-laden hues of faux hipness.

And it’s the partisans from the latter camp who vent their pique at all that unwanted art to New York’s designated pique-collector, its Department of Sanitation, who’ve passed the grumbling over in turn to the New York Open Data site, via this spreadsheet:

 grafitti

The sheet bundles graffiti complaints for the year period bounded by September 2010 and August 2011, 13,000-plus strong, and touch off one obvious question for the research-minded, for starters: do the data describe any significant variation, when broken out by the city’s boroughs (or counties)?

For example: after completing the usual sheet housework (i.e., auto-fitting the columns) we could pivot table the boroughs’ ratios of closed to open complaints:

Row Labels: Borough

Column Labels: Status

Values:  Status (Count), shown as % of Row Total (and you’d probably want to display grand totals for columns only; each borough’s status breakout must come to 100%).

I get:

    gra1

Interesting, I think. The data disclose considerable headroom between Manhattan’s close rate and those of the other boroughs. Surprising? Maybe not; I would allow that a belief in a Manhattan service-delivery bias has circulated among other New Yorkers for some time, perhaps most memorably during the throes of the 1969 snowfalls, when Queens residents excoriated then-Mayor John Lindsay for his slowness to clear that borough’s streets. The obvious remit here is to explain the disparity – or attempt to learn if the Manhattan margin is perhaps better laid to some artifactual, measurement-driven issue instead. (By the way, I don’t have a handle on the few Pending complaints, particularly their closing dates. If they’re pending, how were they closed?)

What does that mean? It could mean, for example, that Manhattan’s superior open/close times might have more to do with a relative dearth of graffiti reports there than any favoured-treatment status the borough might enjoy. No; that doesn’t sound right, but the data should stand in nicely for our guesswork. Just yank Status from its Column Label setting, then click in the Values area and return to the Show Values As button and select No Calculation (that is, we’re here simply breaking out, or counting, the number of complaints by borough, irrespective of complaint status):

gra2

So much for that conjecture. Given its massive daytime population, Manhattan unsurprisingly leads the complaint parade (though fewer than 20% of all New York residents actually live there).

Ok – so if Manhattan is indeed New York’s pet, would it stand to reason that its complaints are closed more swiftly than the ones logged in the other boroughs?  It might, and in order to make that determination we need a new field and a formula, one that simply subtracts Created Dates from Closed Dates (after we’ve satisfied ourselves that the data lining those columns are in fact all dates, and not the text-formatted imposters we’ve encountered in previous posts; running a couple of COUNT functions beneath the entries should validate the data types, bearing in mind in turn that a good many of the Closed Dates cells are vacant. Sorting by that column at one point will enable you to perform a COUNT that gathers all its occupied cells into the expression. And remember that dates are really sequenced numbers, baselined to January 1, 1900).

Let’s see…where was I? Call column L something like Day Span and enter, in L2:

=I2-F2

(You’ll note the large, unsightly spate of negative numbers streaming down the column, flowing necessarily from the absence of close dates for the open complaints. These don’t matter for our current purposes, but under other analytical circumstances they might, and might have to be dealt with differently.) Then haul those data into a pivot table (and if you’re working with an existing table remember to bring the L column into the operative range via PivotTable Tools > Options > Change Data Source):

Row Labels: Borough

Values:  Day Span (Average) Format decimals as you wish.

Report Filter: Status (select Closed)

gra3

That’s interesting too. The Manhattan average closing day span falls very much in the middle of the pack – again, even as its overall close rate pulls away from the other boroughs. And that qualifies as a bit of a puzzlement. If, after all, Manhattan complaints were somehow easier to resolve, it should reasonably follow that these would be brought to closure more swiftly. But in fact Bronx and Brooklyn complaints take less time to close out, and only the Queens numbers careen into outlier status.

The answer – or at least what I take as the answer – lies in the points across the September 2010-August 2011 year at which the complaints were registered, and from which borough. If it could be demonstrated that Manhattan complaints tended toward a calendrical front-loading – that is, if more of them were issued earlier in the year rather than later – it might be sensibly inferred that, all other things being equal, more of these would be closed as a result.

Remember that our data extend only so far as August, 2011, and tell us nothing about the disposition of calls remaining open at that cut-off. Presumably by now the great preponderance of these will have been closed – but we can’t know that from the information in our hands. Thus many of the later complaints among our data appear as open – even as, for all we know, they were eventually closed with comparable timeliness. The bottom line, then, is that the earlier the complaint the more likely for it is to be closed, and if Manhattan complaints could be shown to predominate in the early going (for whatever reason), they’d evince the higher close rate.

To check this all out, try this pivot table:

Row Labels: Borough

Column Labels: Created Date (which I’ve then grouped via 180 Days, thus splitting the chronology here more or less in two)

Values: Status (Count)

 gra4

That looks confirmatory. Again, and for reasons that call for a round of deeper looks, Manhattan complaints were lodged disproportionately in the first half of the reporting period – and that seems to be one of those measurement artifacts about which I spoke earlier. If I’m right, then no Manhattan bias – rather, a superior close rate born of a set of disparately-timed borough complaints.

Hope you’ve been writing all this down. If not, there’s a nice big wall over there…

I’m in a New York Data Mine: Excerpts from the Open Data site, Part 1

18 Apr

Big Apple, big data. That sentence fragment comes to you straight from New York’s Open Data site at

https://nycopendata.socrata.com/

an eccentrically, wondrously miscellaneous trove of information whose contents have little else in common apart from their source city. A good many of Open Data’s (continually updated) holdings make themselves available in de facto spreadsheet form; click any of the dataset icons dotting your search results (the search field beckons in the column stationed at the left of the screen shot), proceed to click the resulting Export button, and finally fire away at the XLSX format option in the Download area (though you can get away with CSV, too).

nyc1a

There’s a lot to look at here, but in the interests of starting somewhere I’ve dusted off the SAT (College Board) 2010 School Level Results dataset and stamped it into the spreadsheet here:

SAT__College_Board__2010_School_Level_Results

SATs, for the worldwide readership, are optional preparatory exams plied by the college-bound American high schoolers, the results directed to applicants’ university choices. Tripartite in construction, each section ranges in score from 200 to 800, and presently comprises Mathematics, Critical Reading, and Writing units. The SAT has been impugned by critics who charge cultural biases lace the exams, and some institutions have struck it from their student-acceptance formulae.
Still, the SAT is out there and continues to impact hundreds of thousands of life chances; as a result right or wrong the 2010 results have something reportable to say.

Once you’ll called up the worksheet you’ll want to put the wrench to a few of the nuts and bolts in there that need tightening, e.g., auto-fitting the B column the better to bare its school names, and sorting the data by say, the C column (by Smallest to Largest) in order to consign score-less schools to the lowest reaches of the data, and then expatriating them from the usable numbers by aiming a blank row between 387 and 388 (and I can’t yet explain why 74, or 16% of the schools, report no scores). You’ll next likely want to introduce a Totals column in G, simply summing the three school SAT scores populating columns D through F (and ignore any “Formula Omits Adjacent Cells” advisory if you see it).

And once those deeds are done you’ll then need to understand the selection bias of sorts that prefigures these data. Our spreadsheet considers New York’s public schools only (that term to be understood in the American sense), spread within the domain of the New York City Board of Education; the data look not at all at the substantial fraction of the city’s children enrolled across its broad panoply of private schools. It is fair to assume that students in that latter cohort come home at night to higher-income families (at least overall), with the attendant educational advantages (see for example, http://admissionsusa.com/messageboard/2011/02/23/do-private-schools-guarantee-higher-sat-scores/ ).

Put all those ducks in their rows and the obvious next step suggests itself – a descending sort of the test data in our TOTALS field. Go ahead and the list tops out with these schools:

nyc5

And here’s where it pays to be a New Yorker. Natives will recognize familiar, storied names among the high fliers, including Stuyvesant, the Bronx High School of Science, Brooklyn Tech, and Townsend Harris, special-entry schools that proffer what Wikipedia calls accelerated academics to talented young city dwellers. The 735 average Mathematics score boasted by Stuyvesant’s crew is no mean feat, after all.

But now what? Get past that sort, and the data don’t seem particularly inspiring. But – and again, a New York provenance helps here – that six-character DBN field secretes a hidden nugget of data, in position three, to be exact. That alpha codes the borough, or county, in which the schools find themselves, to wit:

K Kings
M Manhattan
Q Queens
R Richmond
X The Bronx

(Kings is the official name of Brooklyn; Richmond is the by-the-book listing for Staten Island.)
And once we associate each school with its borough we should be able to aggregate SAT scores by their geographical identifier.

To start, enter the above data in a range of your choosing (I’ve gone with N1:O5) and name it say, Borough, enter the field name Borough in H1, and slip down to H2 and write:

=VLOOKUP(MID(A2,3,1),Borough,2)

(I’ve described MID in my September 13 post; a first cousin of LEFT or RIGHT, it plucks characters from the interior of a text expression.)

Then copy that result down the H column.

Now you can light up a pivot table, for starters:

Row Labels: Borough
Values: Totals (Summarized by Values and formatted to say, two decimal points)

nyc3

County honors go to Richmond or Staten Island, with a population of around 470,000, by far New York’s smallest, but the borough piling the city’s highest median (but not mean) income. The hierarchy of results stacked above correlate pretty strongly with borough income averages, but Manhattan jabs a sore thumb at the data; with a surpassing mean household income of about $120,000, common-sense points us to the expectation that Manhattan schools’ SATs would rule – and perhaps they do. Remember – we’re confining ourselves to public school scores, and as such haven’t factored in all those rich kids haunting Manhattan’s Upper East and West sides, the ones whose parents pay the big bucks to those posh, well-resourced schools. That issue is a matter for an additional, different set of data.

One more data morsel: as the SAT is optional, one might – might – project a thumpingly positive correlation between the number of test-taking students in a school and its average SAT score. That is, greater student interest in higher education, bigger SAT scores. The correlation between number of SAT-takers and scores:

=CORREL(C2:C387,G2:387)

evaluates to .495 – palpable by social-scientific standards, but one still leaving much of the variance to be explained. A truer measure would begin by first learning the proportion of all school enrolees opting for the SATs, and only then dropping the numbers into the correlation.

In any case, that’s just a taste from the Open Data menu. Hope to serve up something else next time.

@abbottkatz

The Grey Lady Learns Pivot Tables: NY Times J-Course Data, Part 1

28 Jun

This just in: The newspaper of record is rebranding itself into the newspaper of records. The Times – the one from New York, that is – has moved to evangelize the data-journalistic thing among its staff, and towards that admirable end has crafted an extended in-house workshop, syllabus and practice files/exercises made available to all the rest of us in Google Sheets here and here, respectively (ok, ok, call me the Luddite; I’m downloading the files into Excel).

The latter link above points to a sheaf of workbooks classed Advanced, Intermediate and Beginner (these rubrics sorted alphabetically, thus interpolating Beginner between the two ostensibly more challenging data collections. And note the Times cautions that even as the data sets have been mined from real-world repositories they’ve been alloyed, the better to serve their instructional purposes), and it occurred to me that a look some of the course contents might prove instructive in its own right.

We can begin with the Beginner Census_Characteristics of Older Americans (2016, 2019) workbook, whose associated Census: Worksheet exercise document asks us to unhide all its sequestered columns (about 65 of them in fact, most of which are massed at the far end of the data, something I missed repeatedly). Remember I’m downloading the data to Excel, an improvised recourse that bunches the field headers into ill-fitting Wrap Text mode. But by directing the download in Open Document mode instead the book nevertheless returns to Excel, but with the headers properly, visibly wrapped, though the columns could do with a bit of resizing (I don’t know if these little disjunctions bear traces of Google-Microsoft woofing).

The exercise text proceeds to let us know “We roughly color the group of categories. For example, the race and Hispanic stats are in light orange, and green columns are about marital status”. But no; these tinted ranges aren’t conditionally formatted, and to be fair can’t really lend themselves to those cellular  ornamentations. What shared textual/numeric datum, for example, could encourage all the ethnic data cells in columns K through V to turn orange? On the other hand, the columns brandish their colors all the way down to row 999, Google Sheet’s default row allotment maximum, though the data come to a halt at row 52.

Next, among other questions the exercise puts to us, we’re asked to “Take the average of the state mean earnings [presumably for Americans age 60 and over] and then look up the mean average for the US. Why do these numbers differ? “ Again, devoting ourselves to the 60-and-older data in the “states, 2016” sheet, and more particularly the 60+Pop; mean earnings field in column BB, that average is realized easily enough. But what mean average for the US does the Times want us to look up, and how? Of course, that very requisition may contribute to the exercise; and so after a bracing scroll-through across the 419 fields bulking up the “2016, US” sheet I stepped atop its cell K2, the one reporting mean household earnings for the 60+ plus demographic of $65,289 (sans curency format). But my lookup was eyeball-driven, and  certainly not under the steam of any maneuver typically entrusted to the redoutable V, or HLOOKUP function. Those instruments, after all, assume we know the identity of the pertinent lookup value – and we can’t know that the value reads “60 years and over; Estimate; INCOME IN THE PAST 12 MONTHS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Households – With earnings – Mean earnings (dollars)”, the header entry in cell KL1:

Times1

And so by “look up” I’m constrained to assume that the Times is asking of us a simple, unmediated, visual hunt for that information. In other words: look up, not LOOKUP.

And with the respective means – the national average recorded unitarily in the “2016, US” sheet and the state-by-state average figured in “states, 2016” – in hand, we can propose an answer to the question the exercise puts to us: namely why the two averages differ. The answer, I hope: that the state average accords equivalent weight to each of the 51 (Washington DC appears in the list) income figures irrespective of population size, while the single national figure in effect tabulates the earnings of every American, thus flattening out any skew.

And speaking as a mere auditor of the Times workshop, I’d pose the same strategic conjecture about the exercise question “Which 3 states have had the largest percentage increase in their residents who are above 60 over that time?” That is, I’d wonder if the Times expects its tutees to simply do the basic math and literally look for the three most prominent state increases – or rather, filter out a top three, a la Excel’s filter option.

But the filtering alternatives in Google Sheets can pull users in two very directions. One pathway transports them to a filter view resembling the standard Excel dropdown-menu mechanism – but I can’t find a Top 10 (or 3) possibility registered among its capabilities here. The other byway to a top 3 drops us off at the most cool FILTER function, a de facto progenitor of the yet-to-be-released Excel dynamic array function of the same name; but its workings demand an intricacy not likely to be broached in a beginner class. Thus, I suspect that the Times again wants its learners to resort to simple visual inspection in order for them to glean that top 3.

As for the actual line-item math here, should you or a Times staffer choose to proceed with the exercise, I’d hammer in a new column somewhere in the “states, 2016” sheet and slot this formula in for the District of Columbia, the first “state” in row 2:

=D2/C2-J2/I2

The column references – D,C, J, and I – offer up the base and 60+ population data for 2016 and 2009. (And yes, the formula can make do without parentheses: The order of operations take care of themselves.)

Copy down the ad hoc column, and the trio of states divulging the largest increments in the 60+ cohort will be in there, somewhere.

And if the filter isn’t working for you, why not sort the numbers largest to smallest, and check out rows 2 through 4?

 

New York City Restaurant Inspection Data: Tips Included

9 Jan

Expressing an interest in getting a bite to eat in New York calls for a bit of narrowing down. You will need to get a little more specific about your preferences, in light of the 26,000 or so restaurants in the city happy to fill your mouth, and evacuate your wallet.

Indeed – advising your New York crew “Let’s go to a restaurant” reminds me of the woman who stood in front of me in a Starbucks and requested a cup of coffee, the kind of order that drives baristas into a hand-crafted frenzy.

But once you’ve finally sat yourselves down you may want to learn a little more about what exactly it is you’ve gotten yourself into – literally – and the restaurant inspection data reorganized by the Enigma public data site may go a ways towards telling you more than you wanted to know (the data are free to you, but they want you to sign up first. Remember that the lower the inspection score, the more salubrious.)

I say “reorganized” – although Enigma will tell you they’ve “curated”-  the data, because the inspection outcomes have presumably been culled from New York’s remarkably near-real-time and far larger official data set, available on the city’s open data site (and Enigma’s too, though their version is three months old). The revision opens an interesting point of entry, then, to an understanding of how someone else’s data have been re-presented by someone else.

In what, then, does Enigma’s remake of the original data consist? For one thing, they’ve proposed to distill the source data set down to a unique entry for each restaurant (keep that that stratagem in mind), each of which, after all, have been subjected to several inspections.  By means of verification I aimed a Remove Duplicates check at the camis field comprising restaurant ids, and came away with but six redundancies – not too bad for a compendium of nearly 25,000 records.

And once having completed that chore we can run a simple but revealing pivot-tabled census of New York’s eateries by borough:

Rows: boro

Values: boro (count)

boro (again, by % of Column Total)

I get:

resto1

No one will be surprised by Manhattan’s restaurant plurality, though it should be added that the residential populations of both Brooklyn and Queens far exceed that of the storied island. In addition, keep in mind that the endless turnover of restaurants (the Quora article linked above declares an annual restaurant closure rate of 26%, though that assertion should probably be researched), turns the count into an implacably moving target.

And for another thing, the Engima set has padded the progenitor data with each restaurant’s geo-coordinates (latitude-longitude), thus priming a mapping capability. But they’ve also, befitting one of Enigma’s enigmatic apparent first principles, reformatted the inspection dates into text mode.

And Enigma’s alternate take has also put the scissors to some of the set’s original fields. The Critical Flag field – naming restaurants that incurred what the Department of Health and Hygiene terms critical violations, “…those most likely to contribute to food-borne illness”, is gone, and I’m not sure why. Those data sound like something you’d want to know about, and analyze.

But there’s a pointedly more serious issue besetting the data that I haven’t quite figured out. Because Engima determined to squeeze the data into a one-record-per-restaurant yield, it had to decide exactly which record would be earmarked for retention; and common analytical sense would commend the latest such record, conveying the current inspection standing for each restaurant. But it appears that Enigma hasn’t always nominated the latest record. A spot comparison of the records across the two datasets turned up some Enigma selections that predate more current inspections for the same restaurant in the official New York workbook. And if those kinds of discrepancies riddle the Enigma data, then we need to wonder about the decision rule that authorized their inclusion – and I don’t know what it is. What would an aggregate averaging of inspection scores purport to say, if some of the scores have been superseded by newer ones? (My emailed query to Enigma about the matter remains unanswered as of this writing.)

Moreover, because the one-record stipulation is in force, Enigma was impelled to collapse disparate violation codes in that eponymous field. The very first record, for example, for the Morris Park Bake Shop, reports two violations coded 10F and 8C, both filed on May 11, 2018. But New York’s precedent dataset has assigned a distinct record to each of the two, easing a pivot table breakout by code.

And those code consolidations – an ineluctable follow-on of the one-record-per-restaurant decision – probably explains Enigma’s omission in turn of the original Violation Description field. Boxing multiple violations in the space of one cell might confound legibility for both researchers and readers, and so Enigma likely concluded the whole field was best expurgated – at a price of course, because now we don’t know what the violation codes mean.

Now to be fair, Enigma also furnishes a worksheet-housed directory of those codes, which make for a most serviceable lookup array; but the multiple-code cell structure of its inspection data makes for an exceedingly messy prospect for 24,000-plus lookup values, which must be individuated somehow.

But all these cogitations have given me the munchies. Where do you want to eat? You want Chinese? Fine – that pares the choices to around 2,400. Your treat.

Walking Across the Brooklyn Bridge: Stepping Through the Data, Part 1

11 Dec

Ask New Yorkers what they were doing at 4 o’clock in the morning on the night of October 12, 2017 and they’ll tell you they a) don’t remember or b) won’t answer on the advice of counsel. But no matter – we know exactly what one intrepid iconoclast was doing at that hour – walking towards Manhattan on the Brooklyn Bridge.

That solitary – and courageous – wayfarer will be happy to know we have no more information about him, or her; all we know is that he – or she – was sighted and duly recorded by the Brooklyn Bridge Automated Pedestrian Demonstration Project, an initiative of New York’s Department of Transportation.

Squirreling an electronic counter somewhere on the Manhattan-side approach to the storied bridge the project gathered footfall data for both directions through the October 1 2017-July 31 2018 span (I’ll own up to the pun) and walked them over to New York’s open data site here (click the Export button on the screen’s far right and tick the relevant CSV option).

Our solo nightwalker presages a dataset that is nothing if not intriguing, even as it exhibits a few organizational curiosities. Apart from column A’s all-but-standard need for a widening via auto-fit (its dates/times are authentically numeric, though) you’ll join me in declaring the contents of column B – comprising 7,296 citations of the phrase Brooklyn Bridge – slightly superfluous, and so eminently dispensable.

And much the same could be offered about the lat and long fields, each of whose cells deliver the same coordinate, presumably positioning the Brooklyn Bridge in its earthly locus. So too, the Location1 field restates its data all the way down, and in textual terms, no less. We’ve seen this sort of thing in any number of open datasets, and it’s proper to wonder why. One assumes some industry-wide download routine batches out these relentless uniformities but whatever the accounting, the data they produce aren’t needed and could be either deleted or ignored.

And there’s another corrective that merits a claim on our attentions. The numbers in the Pedestrian field for November 9 at 7 and 8 PM read 411 and 344 respectively, but the companion data in the Towards Manhattan and Towards Brooklyn cells – which when summed should equal the Pedestrian figures – report nothing but zeroes. And in view of the fact that the pedestrian numbers for November 9 at 6 and at 9 PM read 455 and 300, it seems clear that the count for 8 and 9 could not have amounted to nothing at all. I broached the discrepancy via emails to both the Department of Transportation and the New York open data site, but have yet to hear from either. For the moment, we have to proceed with four empty cells.

And there’s something else, a data-organizational failing that could, and should, be righted, and one we’ve encountered in previous posts. Because the Towards Manhattan and Towards Brooklyn fields host data that are really of a piece and should be treated unitarily (the easier to calculate the percent of pedestrians by direction and date/time, for example) they should migrate their holdings to a single parameter (which I’ll call Direction), via the Get & Transform Data routine with which we’ve recombined  other datasets. Start by selecting both Towards Manhattan and Towards Brooklyn and continue per these instructions.

bridge1

The Get & Transform alternate take then also frees us to delete the Pedestrian field, because the new amalgamated Direction brings along with it a Value (I’ll rename it Sum) column that totals the Brooklyn/Manhattan numeric data by each date/time entry, thus superseding Pedestrian.  Of course the new dataset (presenting itself in table form) comprises twice as many records as its progenitor, because each Towards Manhattan and Brooklyn notation now populates the new Direction field with an individual record instead of the previous parallel fields populating the same record; but that trade-off is well worth the price.

Once that extended preliminary chore completes its work we can try to make some sense of the all that fancy footwork on the bridge. We could start by pivot tabling pedestrian totals by hour of the day:

Rows: hour_beginning (group by Hours only)

Columns: Direct

Values: Sum

I get:

bridge2

The two obvious attention-getters here are the exceedingly, but largely predictable, fluctuations in foot traffic by hour, and the direction of that movement. Note the grand totals as well: over 5 million unidirectional trips launched across the ten-month project period – about 16,500 a day, or approximately 8,200 discrete, carless (don’t read careless) New Yorkers opting for the scenic route over the East River, assuming of course they walked the walk both ways. And if you’re wondering about 4AM, we find around 5.6 average crossings ventured at that hour – even though that hardy coterie probably could have gotten a seat on the subway, then, too.

And consider the literal back and forth of the walkers’ directional proportions. 3PM (that is, the hour running through 3:59), attracted the most pedestrians, closely flanked by 2 and 4PM; and though I doubt the reasons for the hours’ appeal are particularly mysterious, I don’t know what they are. An optimum time for a mid-afternoon stroll? A cadre of workers perambulating home from their early shift, or setting out on their way to a late one? I doubt it, but research on the matter awaits.

And what of the 8AM hour, in which walks toward Brooklyn far outnumber trips to Manhattan? I would have thought – wrongly – that the press of rush hour would have drawn Brooklynites toward Manhattan in predominating numbers, but we’ll have to look elsewhere for an explanation. But by 9AM the flow reverses, rather pronouncedly and curiously.

Now the above table might read more crisply by removing the grand totals for rows (while retaining them for columns) and spinning the numbers through the Show Values As > % of Row Total grinder (remember that this prospect has been enabled by the Get & Transform protocol):

bridge3

Thus we see 58% of the 8AM traffic heading toward Brooklyn, with a pull-back to 44.71% by the next hour – a dramatic, and analytically provocative, reversal.

And what about pedestrian accumulations by day of the week? Common sense allows that weekend totals should be the greater, and that conjecture dovetails with the truth. But extolling common sense won’t do much for your byline; you’ll need to substantiate the finding with a wee bit more precision. Start by introducing a new column to the immediate right of hour_beginning, call it Weekday and enter in what should be B2:

=WEEKDAY(A2)

(If you click on A2 in order to emplace that cell reference in the formula, Excel with respond with a table structured reference; but the result will be identical).

And because the dataset has assumed table form, the formula will instantly copy itself down the B column.

Then try:

Rows: Weekday

Values: Sum

Sum (again, this time % of Column Total)

I get:

bridge4

(You’ll want to rename the headers. Remember as well that 1 signifies Sunday, 7 Saturday.)

Common sense prevails, though I hadn’t foretold Saturday’s substantial edge over Sunday.

But nor did I expect to find someone on the Brooklyn Bridge at 4 in the morning. I want to see the selfie.

 

 

Airbnb Data, Part 1: A Tale of Three Cities

27 Apr

Would you rent your apartment to me? I have references from real people, don’t smoke, clean up after myself (is my nose growing?), and probably can be counted on not to trash your living room and adjoining vicinities.

Still don’t want to take my scratch for your pad? See if I care; there are plenty of other flats out there where yours came from.

Too many, in fact, according to Murray Cox, the self-identified “data activist” whose researches into Airbnb’s rental listings opened the door on a warehouse of dodgy practices, in numerous localities, e.g. property owners who market multiple dwellings, a clear and present violation of New York state law. Cox maintains that, among other things, the outsized scale of Airbnb offerings can worrisomely constrict a city’s available rental stock, and has published a city-by-city inventory (brought to my attention by a student) of Airbnb listings that you and I can download in most convenient spreadsheet form (look for the Summary Information and metrics for listings in… link attaching to each city).

It occurred to me that, among other potential takeaways, an intercity comparison of Airbnb activity might advance the journalistic cause a mite. I thus downloaded the data for New York, London, and Paris, all nicely exhibiting the same fields. With the intention of copying and pasting the data to a single workbook I cleared out a new column to the left of A, called it City, and entered and copied down the respective names of the three locations, properly lined up with their data once pasted, culminating in 162,701 rows of data, its 20 megabytes just itching to tell me what Airbandb has been up to.

Of course, the three-city amalgamation means to prime the data for a range of comparisons, but some comparisons don’t avail. I’m thinking in the first instance about the price field in what is now column K. These entries presumably cite daily rental rates, but express themselves in disparate currencies – dollars, pounds, and euros. One supposes an exceedingly determined investigator could mobilize and apply a round of currency equivalences to the records, a prospect that would require a vast compiling of date-specific rate fixes in short, a challenge likely to a real-world, deadline-mindful journo. I’ve thus elected to leave the numbers as they stand, and if that touch of laissez-faire works against the analysis I have no one to blame but myself. The buck stops here – and maybe the euro, too.

In any case, before we get fancy, we can think about this self-effacing pivot table:

Rows: City

Values: City (Count, by definition for a textual field)

I get:

air1

We see that Paris – by far the smallest of the three cities – nevertheless advertises the largest number of Airbnbs. An accounting for that disjuncture would probably justify a deeper look. Might tourist cachet or friendlier legalities spur the Paris margin? I don’t know. But consider that, juxtaposed to Paris’ population of around 2.25 million and its average household size of approximately 2.3 persons, the city’s Airbnb stock could house around 6% of its residents – with the point, of course, that the inventory is apparently being withheld from the permanent-residence rental market.

Other incomparables have their place among the data, too. There’s little comparing to do as such among the three cities’ neighborhoods, and indeed – the neighbourhood group (UK spelling) field for Paris and London is utterly empty (the field for New York comprises the city’s five boroughs).

But of course other workable comparisons are available. What, for example, about average minimum stay requirements by city and type of rental? We could try this:

Rows: City

Columns: room_type

Values: minimum_nights (Average, formatted to two decimals)

I get:

air2

We see that diffident London Airbnbers expect notably briefer stays at their places on average, with those uppity Parisians insisting that you agree to set down your knapsack – and debit card – more than a day-and-a-half longer before they let you in. At the same time, New York’s shared-room minimum is disruptively anomalous.

And for more evidence of cross-cultural heterogeneity – if that’s what it is – flip the values into Count mode and hand them over to the Show Values As > % of Row Total, ratcheting the decimals down to zero and switching the Grand Totals off (because the rows must invariably figure to 100%). I get:

air3

The overwhelming Paris proportion devoted to the Entire home/apt offering is not, I would submit, proof positive of yet one more Gallic quirk, but rather a simple function of the Paris housing stock, in which apartments predominate.
For additional, if glancing, corroboration, try this pivot table:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (tick New York)

Values: neighborhood_group (Count, % of Row Total)

I get:

air4

Recall that New York is the only city among our trio whose neighborhood group field is actually occupied with data – the names of its five boroughs. Note the relative Manhattan tilt towards Entire home/apt, even as the other boroughs, whose housing range features far more private homes, incline towards Private room – that is, presumably one private room among the several bedrooms in a home.

And what of daily price by city, keyed as it doubtless is to room type? It looks something like this:

Rows: City

Columns: room_type

Values: price (Average, to two decimals)

I get:

air5

Again, that imperative qualification – namely, that the prices reflect evaluations per their indigenous currencies – need be kept in mind. As such, the New York tariffs verge much closer to the London figures when the appropriate, albeit variable, pound-to-dollar conversion is applied. With those understandings in place, the Paris Entire home/apt average seems strikingly low – because the Euro consistently exhibits a “weaker” relation to the pound, the former as of today equaling .88 of the latter. Yet at the same time, Paris’ private room charge would appear to be effectively higher.

Now again, because the data are there, we could compare average prices for New York’s boroughs:

Rows: neighbourhood_group

Columns: room_type

Slicer: City (New York)

Values: price (Average)

I get:

air6

No astonishments there, not if you’re a New Yorker. Manhattan expectably heads the rate table, though Staten Island’s second-place Entire home/apt standing may issue a momentary pause-giver, along with its basement-dwelling (pun intended) shared room rate.

That’s $7,300 a month for an entire place in Manhattan. And wait until you see the interest on the mortgage.

NY Regents Exam Data, Part 2: Multiple Choices

12 Apr

Numbered among the additional conclusions we can draw from the New York Regents data is a natural next question from the aggregate test averages we reckoned last week, answered by kind of reciprocal finding: namely, the aggregate fail rates. Guided by the concerns about weighting we sounded in the previous post, I’d allow that a calculated field need be applied to the task here too, a field I’ve called PctFail.

But before we proceed we again need to contend with the not insignificant number of records who, for apparent reasons of confidentiality, won’t count their fewer-than-five students, replacing the totals with an “s”. Thus I tapped into column N, called it NFail, and entered in N2:

=IF(J2=”s”,0,K2)

The formula assays the relevant cell in J for an “s”; if it’s there, a 0 is supplied. Otherwise the value in K – the Number Scoring Below 65 – is returned.

Again, we’ll copy that formula down N and proceed to calculate the PctFail field:

reg1

Once effected, this simple pivot table, abetted by the All Students Slicer selection we ticked last post, opens the story:

Rows:  Year

Values: PctFail (formatted here in Percentage mode to three decimals)

I get:

reg2

The failure rates are substantial, a near-ineluctable follow-on from the overall score averages settling in the 68 range (remember that 65 passes a Regents exam).

But you’ll want to know about failures by Regents subjects, too. Sight unseen, you’d expect appreciable variation among the test areas, and that drill down can tool its way into the data via the Regents Exam field, e.g. something like this:

Rows: Regents Exam

Columns: Year

Values: PctFail (formatted similarly to the table above)

I get:

reg3

And variation there is, some more provocative than others. You’ll note the massive leap in failure rates for English and Geometry from 2015 to 2016, a determined, ascending slope of failures for Algebra2/Trigonometry, and a restitutive, noteworthy shrinkage in failures for Common Core Algebra. (The Common Core tests are controversial, in part because of their redesign; see this report, for example).

You’ll also want to do something about those error messages. In some cases, the #DIV/0! outcomes simply key the absence of data for the exam, owing to an exam’s discontinuation or later introduction, while the (blank) label appears in virtue of the ten rows that bear no exam name. Should you want to pave over the errors, and you probably do, click anywhere in the pivot table and proceed to PivotTable Tools > Analyze > Options > Options > Layout & Format tab > and tick For error values show:. Enter some appropriate stand-in for #DIV/0!, e.g. — , and click OK. Because the dashes in the 2015 column push far left and look almost as unseemly as the original error message, you may want to select all the Values and align them right. (You could also filter out the blanks.)

Now if you want to crunch failure rates by ethnicity, for example, you’ll again have to reconcile the double-counting character of the fields we described last post. The ethnicities – Asian, Black, Hispanic, Multiple Race Categories Not Represented, and White – have been quartered in the Demographic Variable field, but so have a potpourri of other, disconnected items bound to other Variables, e.g. Female to Gender, English Proficient to ELL Status.

We’ve stubbed our toe against this odd problem in the previous post, in which Excel’s million-record limit has forced records otherwise deserving of their own field into one, messy cosmopolitan column – the one called Demographic Category, itself dispersed into no-less-heterogeneous items in Demographic Variable. It’s a confusing issue, but I think we need to tick Ethnicity in the Slicer now and slide Demographic Category –confined by the Slicer to its ethnic item entries – into Rows. Sweep Year into Columns and you get:

reg4

The disparities here are dramatic, and rather self-explanatory – the results, that is, not the accountings of them.

Next opt for Gender in the Slicer:

reg5

Women outdo men, a finding that more-or-less jibes with current understandings of gender performance differentials. The female margin, nearly equivalent across 2015 and 2016, pulls away slightly in the following year.

And what of gender outcomes by exam? Slip Regents Exam atop Demographic Category (which has been sliced to Gender) in Rows, and (in excerpt):

reg6

And (in second, remaining excerpt):

reg7

You’re looking for palpable divergences, of course, but palpable congruences mean something here, too.  The decisive female advantages in the Common Core English scores are perhaps notable but not arresting; but their edge in Common Core Algebra does a fair share of stereotype busting, even as males emerge the stronger in Common Core Algebra2. (Important note, the Grand Total pass rates vary by Demographic Category even as the Total Tested remains neatly constant across all Demographic Variables. That’s because the distribution of “s” entries across the Categories isn’t constant.)

There are plenty of other permutations in there, but let’s try one more. Column Q quantifies the number of students in the record whose score achieves what the Regents calls College Readiness (CR), i.e., a 75 in the English Language Regents or an 80 on any Math exam in the system.

And here’s where I have to own up to a bit of spreadsheeting excess. In the previous post I implemented what I termed an NFail field, embodied by its foundational formula – an expression that replaced “s” entries with a 0, the better to factor these into a calculated field. I now understand that those exertions were unnecessary, because Excel will completely ignore an “s” or any other label in any case. Thus here (and last week, too) we can work directly with the Number Scoring CR field in Q. But because we do need to acknowledge the large number of “s” and “NA” entries in Q (NA, because only some Regents’ qualify as CR exams) that will impact any denominator we also need what I call here a CRCounted field to be put in place in the next available column, punctuated by this formula that gets copied down:

=IF(ISNUMBER(Q2),I2,0)

We then need compose that calculated field, which I’m calling CRPass:

reg11

Remember here, and for the first time, we’re computing pass rates. This pivot table among others, awaits, under the aegis of the Demographic Category Slicer – Gender:

Rows: Demographic Variable

Columns: Year

Value: CRPass

I get:

reg9

Provided I’ve done my homework correctly the results point to a striking spike in CR attainments, a set of findings that calls for some journalistic deep-backgrounding. (Note that the absolute CR numbers are far smaller than the global Total Tested figures, because as indicated above only certain exams march under the CR banner.) We see a small-scaled but real stretching of the female advantage in the pass rates between 2015 and 2017, one that also needs to be sniffed by some nose for news.

Now let my take a break while I double-check my homework. I hear this blogster is one nasty grader.

America’s Schools, Part 1: Some Truants Among the Data

8 Sep

The segue is inadvertent, but calling up a census of America’s schools look right after our look at New York school attendance data with makes for a deft transitioning, if I may say so myself, and I think I’ve just granted permission to do so.

This nationwide listing -105,000 institutions strong – is curiously archived in a site putatively devoted to information about hurricane Harvey. I’m not sure about the placement, but that’s where it is.

And it’s pretty big, as you’d expect, counting 105,087 schools in its fold and pitching 24 MB at your hard drive, after a download and a save as an Excel workbook. (Note: the data unroll stats for the country’s public – that is, government operated – schools only. The very large number of private and sectarian institutions diffused across the US are thus excluded from the inventory.) And if you run a simple SUM at the base of column AC, the field storing student enrollment numbers, you’ll wind up with 50,038,887, and that’s pretty big, too.

But of course, that number can’t get it exactly right. For one thing, the overview introducing to the workbook tells us that the data feature “…all Public elementary and secondary education facilities in the United States as defined by…National Center for Education Statistics…for the 2012-2013 year”. And since then a few hundred thousand little Justins and Caitlins will have moved on to other venues, to be replaced by littler Treys and Emmas – and the turnover just can’t be equivalent. Moreover the (apparent) Source Dates recorded in X track back to 2009 in many cases, though I don’t completely know how those dates are to be squared with the 2012-2013 reporting period.

Now apart from the as-usual column autofits in which the dataset obliges you, you may also want to shear those fields likely not to figure in any analysis, though that of course is something of a judgement call. In view of the virtual equivalence of the X and Y data in A and B with those in the LATITUDE and LONGITUDE parameters in S and T, I’d do away with the former pair. I’d also mothball ADDRESS2 (and maybe ADDRESS, too – will you need their contents?) I’d surely dispense with the NAICS_CODE entries, as each and every cell among them declaims the same 611110. And I think VAL_METHOD, VAL_DATE, SOURCE (storing basic information about the school committed to web sites), and probably SHELTER_ID could be asked to leave as well, lightening my workbook by about 5.3 MB all told. On the other hand, WEBSITE appears to have done nothing but clone the contents of SOURCE and as such could assumedly be dispatched as well, but I’ve since learned that the sites offer up some useful corroborating information about the schools, and so I’d retain it. But a field I would assuredly not delete, in spite of my early determination to do so, is COUNTRY. I had misled myself into believing the field comprised nothing but the USA legend, but in fact it entertains a smattering of other geopolitical references, e.g. GU for Guam, PR for Puerto Rico, and ASM for what I take to be American Samoa, for example.

I’m also not sure all the Manhattan schools (the ones in New York county, that is) display their correct zip codes for what it’s worth, and it might be worth something. The Beacon High School on West 61st Street is zip-coded 10022, even as it belongs, or belonged, to 10023 (though that wrong zip code informs a review of the school by US News and World Report); but the error may be excused by an updated reality: the Beacon School moved to West 44th Street in 2015, calling the timeliness of our data into a reiterated question. I’m equally uncertain why the Growing Up Green Charter School in Long Island City, Queens is mapped into New York county.

More pause-giving, perhaps, are the 1773 schools discovered inside New York City’s five counties – New York, Queens, the Bronx, Brooklyn (Kings County), and Richmond (or Staten Island; note that a Richmond county appears in several states in addition to New York). You’ll recall that our posts on New York’s attendance data, drawn from the city’s open data site, numbered about 1590 institutions. Thus any story-monger would need to be research the discrepancy, but in any case it is clear that the dataset before us errs on the side of inclusiveness.

But a lengthier pause punctuates a Largest-to-Smallest sort of the ENROLLENT field. Drop down to the lowest reaches of the sort and you’ll find 1186 schools registering a population of 0, another 1462 reporting -1, 4493 sighting -2 persons on their premises, and 91 more submitting a contingent of -9. Moreover, you’ll have to think about the 5399 schools counting a POPULATION (a composite of the ENROLMENT and FT_TEACHER fields) of -999. It’s not too adventurous to suggest that these have been appointed stand-ins for NA.

In addition, we need to think about the schools declaring only 1 or 2 students on their rolls. Consider for example the Marion Technical Institute in Ocala Florida and its 1 student and 34 full-time teachers. Visit its web site, however, and we encounter a more current student enrollment of 3 and a FTE (full-time equivalent) instructional complement of 37 (as of the 2015-16 school year), not very far from what our database maintains. But at the same time many of the 1-student schools are accompanied by FT_TEACHER values of 1 or 0 as well, and these microscopic demographics demand scrutiny. The web site for Bald Rock Community Day school in Berry Creek, California, for example, reveals no enrolment/teacher information, for example.

What to do, then? It seems to me that any school disclosing a negative or zero enrollment – and now sorting the ENROLLMENT field highest-to-lowest will jam all of these to the bottom of the data set – be disowned from the data set via our standard interpolation of a blank row atop 97407, where the first zero figure sits. We’ve thus preserved these curious entries for subsequent use should their other fields prove material.

And all that begs the larger question tramping, or trampling, through the data: How much time, effort, and money should be properly outlaid in order to support the vetting of 100,000 records? Multiple answers could be proposed, but there’s a follow-on question, too: In light of the issues encountered above, hould the data in the public schools workbook should be analysed at all?

Well, if we’ve come this far, why not?

NYC School Attendance Data, Part 2: What I’ve Learned

23 Aug

Once we’ve decided we’re pleased with the New York City school attendance data in their current, emended state (per last week’s post), we can move on to ask some obvious but edifying questions about what we’ve found.

First, a breakout of attendance percentages by day of the week is something we – and certainly Board of Education officials – will want to see. In that connection, we again need to decide if we want to break out the attendance percentages arrayed in the %_OF_ATTD_TAKEN field, and/or the numbers we derived with our ActualTotals calculated field, the latter according numerical parity to each and every student; and as such, it seems to me that ActualTotals is fitter for purpose here (of course we could deploy both fields, but let me err on the side of presentational tidiness here).

But in the course of tooling through and sorting the data by the above %_OF_ATTD_TAKEN, I met up with a few additional complications. Sort that field Smallest to Largest, and you’ll have gathered a large number of records reporting days on which absolutely no students attended their respective schools – 7,245 to be exact; and while an accounting for these nullities can’t be developed directly from the dataset, we could be facing an instance of mass, errant data entry, and/or evidence of a requirement to furnish a daily record for a day on which classes weren’t held. And in fact, over 14,000 records attest to attendance levels beneath 50% on their days, and I don’t know what that means either. It all justifies a concerted look.

But in the interests of drawing a line somewhere, let’s sort %_OF_ATTD_TAKEN Largest to Smallest and split the data above row 513796 – the first to bear a 0 attendance percentage – with a blank row, thus preserving an operative, remaining dataset of 513974 records. But still, I’d submit that more thinking needs to be done about the low-attendance data.

Returning now to our day-of-the-week concerns, the pivot table that follows is rather straightforward:

Rows: Day

Values: ActualTotals

I get:

attend1

(Note that you’d likely want to rename that default Sum of ActualTotals header, because the calculated field formula itself comprises an average, in effect – NumberStudents/REGISTER*100. You’ll also want to know that calculated fields gray out the Summarize Values option, and thus invariably and only sum their data. Remember also that 2 signifies Monday.)

I for one was surprised by the near-constancy of the above figures. I would have assumed that the centripetal pull of the fringes of the week – Monday and Friday – would have induced a cohort of no-shows larger than the ones we see, though attendance indeed slinks back a bit on those two days. But near-constancy does feature strikingly in the middle of the week.

And what of comparative attendance rates by borough? Remember we manufactured a Borough field last week, and so:

Rows: Borough

Values: ActualTotals

I get:

attend2

By way of reorientation, those initials point to these boroughs:

K – Brooklyn

M – Manhattan

Q – Queens

R – Richmond (Staten Island)

X – The Bronx

The disparities here are instructive. Queens students are the literally most attentive, with Bronx students the least. Of course, these outcomes call for a close drilldown into the contributory values – e.g., economic class, ethnicity, and more – that can’t be performed here.

We can next try to learn something about attendance rates by month, understanding that the data encompass two school years. Try

Rows: Date (grouped by Months only)

Values: ActualTotals

I get:

attend3

The school year of course commences in September, with those early days perhaps instilling a nascent, if impermanent, ardor for heading to class. We see that attendance peaks in October, and begins to incline toward the overall average in December.

The question needs to be asked about June, or Junes, in which the attendance aggregate crashes to 85.21%, deteriorating 4.69% from the preceding May(s). While explanations do not volunteer themselves from the data, an obvious surmise rises to the surface – namely, that students beholding the year’s finish line, and perhaps having completed all material schoolwork and exams, may have decided to grab a few discretionary absences here and there. It’s been known to happen.

But let’s get back to those zero-attendance days and their polar opposite, days in which every student on a school’s roster appeared, or at least was there at 4 pm. The data show 1641 records in which each and every enrollee in the referenced institution was there, a count that includes 31 days’ worth of school code 02M475, i.e. Manhattan’s renowned Stuyvesant High School; a pretty extraordinary feat, in view of the school’s complement of around 3,300. And while we’re distributing kudos, mark down September 21, 2016, the day on which all 3,965 of Staten Island’s Tottenville High School registrants showed up, and June 24 of that year – a Friday, no less – on which the entire, 3,806-strong enrollment of Queens’ Forest Hills High School settled into their home rooms. But ok; you could insist that these laudable numbers should likewise be subjected to a round or two of scrutiny, and you’d probably be right.

Now for a bit of granularity, we could simply calculate the average daily attendance rates for each school and sort the results, and at the same time get some sense whether attendance correlates with school size as well. It could look something like this:

Rows: SCHOOL_DBN

Values: REGISTER (Average, to two decimals)

%_OF_ATTD_TAKEN (Average, to two decimals)

Remember first of all that a given school’s enrollment is by no means constant, swinging lightly both within and across school years. Remember as well that because you can’t average calculated field totals, I’ve reverted to the %_OF_ATTD_TAKEN field that’s native to the data set.

Sort by %_OF_ATTD_TAKEN from Largest to Smallest and I get, in excerpt:

attend4

That’s the Christa McAuliffe School (named after the astronaut who died in the Challenger explosion) in Bensonhurst, Brooklyn on the valedictorian’s podium, followed very closely by the Vincent D. Grippo school, (physically close to the McAuliffe school, too). And if you’re wondering, I find Stuyvesant High School in the 907th position, with its daily attendance average put at 90.87. Tottenville High, interestingly enough, pulls in at 1155 out of the 1590 schools, its average figuring to a below-average 87.43. At the low end, the Research and Service High School in Brooklyn’s Crown Heights reports a disturbing 41.84% reading, topped slightly by the Bronx’s Crotona Academy High School (remember that you can learn more about each school by entering its code in Google). These readings merit a more determined look, too.

And for that correlation between school size and attendance: because my pivot data set out on row 4, I could enter, somewhere:

=CORREL(B4:B1593,C4:1593)

I get .170, a small positive association between the parameters. That is, with increased school size comes a trifling increment in attendance rates.

But if you want to learn more about correlations you’ll have to come to class.

Campaign Contribution Data, Part 3: More New York and National Numbers

11 Oct

Just one, or perhaps 390,000, more things about the New York Presidential contribution data: might not the analysis be buoyed by a counting of monies apportioned by contributors’ occupations? It might; start with

Row: contbr_occupation

But stop there and you’ll already understand. Prepare to gaze upon 12,355 uniquely labelled vocation descriptions – but it isn’t their formidable heft that’s the rub. What’s adulterating the mix – again – is the manner(s) of inconsistencies among the labels, and how they impede the prospects for any follow-through across the data. What, for example, are we to do with these cognate employments:

campa1

Of course we’ve beheld this complication before, but the problem set here is perhaps slightly different. For one thing, the occupations have in all likely been furnished by the position holders themselves, and so we should properly withhold “blame” from the spreadsheet compiler. And with self-description comes classificatory idiosyncrasy, the bane of the grouping enterprise. Thus what editorial decisions are we to make about these kindred callings?

campa2

Does the analyst subject all of the above to a unifying, equivalent proxy occupation? There are 12,000 of those determinations to make here, and you’ll have to determine in turn if you have the time and/or money to devote to the challenge. The answer might be no.

In any event, if you’re enjoying the confusion and the complexity and want to go national, pop open the sheet detailing overall top fundraising periods in election years (it’s the very last Excel link on the page). The compendium, recording candidate receipts and contributions (the former is a superset of all monies including public funding, and hence are very often the greater sum. In 29 cases, however, contributions exceed receipts, often by a very small differential. See the notes on rows 3664-5) for presidential contests tracking back to 1980 and including the current tilt, comprises an assortment of parameters that seemed designed to inspire unease. Again, throw the Presidential Candidate field in a pivot table rows area and survey the ensuing list. How about

campa3

(Yes, the two are one and the same.)

Or more famously:

campa4

The point won’t be belaboured further (note, on the other hand, that the Bush, George and Bush, George W entries indeed signify father and son, and are meaningfully distinct).
Then replace Presidential Candidate with Report Type:

campa5

First, I don’t know what all those markers mean – e.g. 10P and 12C. Second, you do know what the redundant M6, M7, and M8s mean – uninvited spaces that have to go (the M data’s less-than-fully-numerical sort order follows from the fact, of course that they aren’t numeric at all; the M demotes these to label status, and if you want M10-12 to
Note as well that the M, or month, items advance their chronological references by an increment of one. That is, M3 reports contributions for February, and M12 counts monies for November. December data, then, are anomalously assigned YE, which I take to denote year end – but in fact if you filter all the data set entries for the YE’s, numerous pre-December 1 Coverage Start Dates loom, along with some Coverage End Dates that pull into January. Anyone have the FEC phone number?

Third, I’m disquieted by the potential apples-and-oranges mishmash of monthly and quarterly contributions that spill across the entries. Turn to the data and autofilter the receipt data for say William Jefferson Clinton (you remember him). You’ll see in excerpt:

campa6

On the other hand, you’ll learn that the above receipts emanate from different Clinton election committees, and so if the quarterly and monthly numbers properly and strictly line up with different organizational entities – that is, the totals can be deemed mutually exclusive, and we’re likewise satisfied that we alleviated any name discrepancies – then we should at least be able to aggregate a candidate’s intake by year:

Rows: Report Year

Slicer: Presidential Candidate

Thus Hillary Clinton’s numbers read:

campa7

Remember of course the 2016 totals come to rest at August 31.
If you want to play safe, you could at least globally array receipts by election year:

Rows: Report Year

Values: Total Receipts

I get:

campa8

(It appears that all dates are authentically numeric, by the way.) Remember that the 2016 total tops at August 31 (really September 20 for one failed candidate, Lindsey Graham); thus with two-thirds of the compiled year having elapsed, the current total projects linearly to around 1.044 billion dollars, a small downsizing from the two previous presidential years; but it may be reasonable to suppose that a late-in-the-game contribution rush could hoist the final numbers above and beyond. The significant contraction of the 2012 sum from 2008’s aggregate warrants study, however. In any case that latter year seems to have incited a quantum leap in campaign inflow. And as for those small non-Presidential-year 1990 and 1998 totals, they were earmarked to hopefuls Willie Carter (something of a perennial candidate, apparently) and Frank Starr, respectively, and about them I know nothing more – nor am I certain about the reasons for those years’ inclusion. A sidebar awaits.

And if you were wondering, Santa Claus raised $180.00 in 2012.

And I’m on vacation now; I just can’t find Word Press’ out-of-office reply.