[I'm not entirely sure why I'm turning this into a post. It's essentially my final project for my Advanced GIS class. I think it's rather provocative, however, and it shows a few immediate possible further directions for analysis.]
In my earlier geospatial analysis of the U.S.A. trilogy by John Dos Passos, I decided that I was unsatisfied with my initial findings regarding the distribution of some data points within the United States.1 This project attempts to refine the analysis, so that I may see if I can make bolder statements about how events in the text are distributed. To do this, I enhanced the “Average Nearest Neighbor” analysis built into ArcGIS and did a Monte Carlo simulation two different variables to see if I could get an expected value that was closer to the observed value that I get when running the analysis on my data.
The U.S.A. trilogy, published between 1930 and 1936, is astonishingly dense with geographical information. In addition to tracking the movements of 12 different characters across four different continents, the novels include news from around the world in the “Newsreel” sections, which are made up of pasted together snippets of newspapers from the first 30 years of the 20th Century, as well as biographical portraits of 27 different influential Americans, which includes spatial data. A simple random sample of 30 pages yielded an average of 2.7–6.6 geocodable observations per page, at 95% confidence.
For my analysis, however, I decided to focus on three thematically important sections: the “U. S. A.” section, which opens the trilogy in the form of a preface to the first novel, The 42nd Parallel; “The Body of an American,” which closes the second novel, 1919; and “Vag,” which closes the final novel of the trilogy, The Big Money. These three sections, though containing fewer than ten total pages, capture Dos Passos’s desire to create a work that captures the totality of the United States and include over 70 geocodable observations, of which 50 fall within the United States of the first three decades of the 20th Century (the time of the trilogy). At the end of the first section, in fact, he ends with a paragraph of statements about what the U.S. is, beginning with “U. S. A. is the slice of a continent.” Dos Passos, however, moves from the spatial to the linguistic by closing his list, and the section, with “But mostly U. S. A. is the speech of the people”:
U. S. A. is the slice of a continent. U. S. A. is a group of holding companies, some aggregations of trade unions, a set of laws bound in calf, a radio network, a chain of moving picture theatres, a column of stockquotations rubbed out and written in by a Western Union boy on a blackboard, a publiclibrary full of old newspapers and dogeared historybooks with protests scrawled on the margins in pencil. U. S. A. is the world’s greatest rivervalley fringed with mountains and hills, U. S. A. is a set of bigmouthed officials with too many bankaccounts. U. S. A. is a lot of men buried in their uniforms in Arlington Cemetery. U. S. A. is the letters at the end of an address when you are away from home. But mostly U. S. A. is the speech of the people.
An initial glance at the distribution on the map of the U.S. suggests that the points are dispersed rather widely. There seems to be some clustering near New York City, but otherwise the data is rather spread out. Zooming in on Manhattan, however, we can count three observations there, including two practically on top of each other—on opposite sides of Stuyvesant Square.
A simple average nearest neighbor analysis of this data, however, gives a nearest neighbor ratio of .58 (114,000 m/196,000 m), for a Z score of -5.64. Despite what our eyes may suggest, this distribution shows intense clustering, with an immeasurably small p-value. The usual caveats about nearest neighbor analysis, however, apply, most notably edge effects.2 Many of the points fall into metropolitan centers, which are often on the edge of the U.S., which shows a different problem.
The average nearest neighbor analysis compares the results of the dataset to a randomly distributed set of points, from which it calculates an “Expected mean distance.” But considering that my points correspond to Americans, who are not randomly distributed within the space of the U.S. (they tend to cluster in metropolitan areas and, especially in 1920, in the northeast), it makes sense that my results should show intense clustering. Might there be a way, then, I wondered, for me to change the expected mean distance and then measure my results against that?
I decided the answer would lie in weighting the distribution of the random points. If I could weight locations by their population in 1920, they would attract more points and create mini-clusters around New York City, Chicago, Philadelphia, and Los Angeles (cities in the largest counties of the time).3If I then ran this simulation a thousand times, it would create a dataset of expected mean distances that are more human-aware.
Downloading both state-level and county-level 1920 U.S. Census data from NHGIS, I started my simulations at the state level. For each state, I gave it a weight, the “Total White Population.” Weighted by this, I let the script pick a state 50 times—to correspond to the number of observations from the trilogy. Then I used ArcGIS’s “Create Random Points” function to place a point at random in each state that was chosen one of the 50 times. I then ran the average nearest neighbor analysis, recorded the observed mean distance, the expected mean distance, and the Z score. I repeated the simulation 1000 times. The results:
Observed values (m) Expected values (m) Z score Mean 182,000 196,000 -0.940 Std. Dev, (fraction of mean) 19,700 (0.108) 15,300 (0.078) 1.39 Median 183,000 198,100 -0.956
As we can see, on average, the points veered toward the clustered, but not significantly so. Perhaps, I reasoned, the states were simply too large an area in which to toss my random points. After all, most of my points from the U.S.A. trilogy are specific at the city level, so perhaps I should use smaller chunks of the U.S. population and toss my points at the county level. This time I used the Census’s “Total Population” data field and ran the simulation another 1000 times:
Observed values (dd) Expected values (dd) Z score Mean 1.70 1.99 -1.91 Std. Dev, (fraction of mean) 0.214 (0.126) 0.178 (0.089) 1.49 Median 1.70 2.01 -1.95
Using county-level population data, the clustering was much more intense, but still not significant at 95%. On average, the simulation gave a Z score of -1.91, just missing the threshold for 95% significance. Compare this result, again, to the -5.64 I received for the novel’s data. Also of note, however, are the larger standard deviations at the county-level, though I am not certain why that might be.
What these simulations give, however, are two sets of new expected values that I can compare to the data from U.S.A. That is, the “Observed values” above now become the “Expected values,” and I can calculate new Z scores for my trilogy’s data, based on these simulations:
Scale Obs. U.S.A. Value Exp. Value Std. Dev. New Z score State-level 114,000 m 182,000 m 19,700 -3.45 County-level 1.14 dd 1.70 dd 0.214 -2.616
The results are clear, although not astonishing: the trilogy’s points remain clustered, even in comparison to random points distributed with population weights, but they are far less clustered. And, more notably, the amount of clustering in comparison to the county level clustering almost pushes it outside of statistical significance.
Yet these results are still frustrating. A return to the novel’s data might serve useful here. Of the 50 points, three reference the Arlington National Cemetery, which means there are three points whose nearest neighbors are 0 away. That is intensely unlikely, even in a weighted distribution. Of the 50 points, eight are duplicates of some sort, which means that they, too, have nearest neighbors of 0. If I remove those from the stack and run the nearest neighbor analysis again with only 42 points, I get an observed value of 1.40 dd, which generates, using the same table above, a Z score of -1.40. In other words, with the duplicates removed, the points still trend toward the clustered, but they do not do so significantly at the 95% level. If I go one step further and remove one of the points referring to two places within Stuyvesant Square in New York City, the Z score falls to -1.21, getting increasingly more seemingly randomly distributed. Of course, there is a tradeoff to intentionally removing the duplicates: Dos Passos as a novelist consciously repeatedly referred to the Arlington National Cemetery (it is the focus of “The Body of an American”), but it is still interesting that he managed to create a distribution of points in these three sections that the computer suspects, when weighting for population distribution within the U.S., are randomly distributed.
There’s much more I could do with this analysis, including running the Create Random Points command with the matrix set to toss a number of points within each state and county that correspond to the states and counties mentioned (so each mention of the Cemetery would be scattered across all of Virginia, for example). I could then measure those results against the simulations. Furthermore, I could investigate the distributions created by my simulations and develop new scores for them that take into account the nature of their distributions (I did no tests to see how normal the distributions are). My suspicion is that this would only bring the Z score ever closer to 0, although I am not yet certain what the cost of that sort of analysis could be.
Alternatively, I could sample my 1000 iterations and run them through a more sophisticated bit of clustering software, like CrimeStat III. This would give me a sense of where the clusters of people are in the US (preliminary guess: NYC, Pennsylvania, New England, and the Rust Belt). I could then compare those second- and third-order clusters with the clusters I find with my data and see if there are deviations there (like, perhaps, the DC area is over-represented).
- I showed that my points in question were more widely distributed than the population of the U.S., suggesting that Dos Passos encompassed a larger swath of the U.S. than one would have expected him to do if he was picking places to mention more attuned to the distribution of people within the U.S. in 1920. [↩]
- Later in the analysis, I’ll address an even more remarkable effect, that of points on top of each other. [↩]
- The script I used for weighted random choice I found on this Python discussion. I used David’s “compiling version of 2.” [↩]
Tags: digital humanities, GIS, John Dos Passos, monte carlo, nearest neighbor, statistics



Leave a Reply