Post45 HathiTrust Dataset Exploration Notes
Part 1: Spreadsheet Woes
I will begin this section by explicating the information provided in the relevant links below (specifically, Post 1945 HathiTrust Database) and the rationale motivating the presentation of the two pivot tables.
Relevant Links:
Post45 HathiTrust Database
Pivot Table 3: This pivot table graphically displays the frequency and usage of “War” in the titles of texts released after 1945. The sheet, and accompanying graph, uses the inferreddate metadata tab (the earliest date of publication for a given text) and the title metadata tab (the list of all entry titles in the HathiTrust database) to track the frequency of the keyword “War” in titles between 1945 and 2000.
Pivot Table 4: This pivot table graphically displays the place of publication for science fiction texts in the HathiTrust database between 1945-2000. Based on the limited quantity of science fiction texts within the HathiTrust database, the largest publishing body of science fiction texts appear located in New York and England. Although the data does not provide rationale as to why this may be the case, the graphic is a useful starting point for further exploration of the science fiction publishing scene in New York and England. As a note, the place metadata tab is organized and coded using the standardized MARC code list.
Part 2: Observations
The HathiTrust database is comprehensive and meticulously designed. Yet, this meticulous design philosophy produces a rigidity in scope that limits researchers ability to perform batch analyses with large swathes of data. For example, in the construction of Pivot Table 4 I sought to focus on the science fiction genre. Yet, the HathiTrust database divides the genre “Science Fiction” into multiple sub-genres. While I am not opposed to the construction of sub-genres, these sub-genres should still be folded into the broader category of Science Fiction. Within the dataset, all genres, subjects, and sub-genres are equally identified. In the construction of large-scale data analyses, there is no way to subsume all sub-genres to a root genre. This produces a confusing and difficult situation where researchers are unable to visualize large segments of the dataset in a uniform manner. In Pivot Table 4, one may notice the rows are organized around subject and place. An example of a subject includes: Science Fiction, Science Fiction & Fantasy, American Science Fiction, 20th Century American Science Fiction, and so on. The titles which appear under “20th Century American Science Fiction” do not appear under the subject heading “American Science Fiction”. This produces a batching problem whereby data becomes too segmented. A potential solution lies in the creation of a sub-genre metadata tab.
Further, the distinction between Subject and Genre is equally ambiguous and requires further clarification by the database organizers. An example: Carlton Mellick’s 2008 novel The Egg Man is labeled under the genre metadata tab as “Novel” and is labeled under the subjects tab as “Horror tales, American/Dystopias/Science fiction, American”. What the genre of “Novel” is remains unclear, but let us continue. Klas Östergren’s 2009 novel The Hurricane Party is, similarly, labeled under the genres category as “Novel/Dystopias” and labeled under the subjects line as “Fathers and sons/Gods, Norse”. The question is, then, what constitutes a genre and what constitutes a subject under the parameters of the HathiTrust database? “Dystopia” appears to float freely between subject and genre, and “Novel” does not appear to be a uniform category that appears for all novels. Why the database is constructed in this manner is beyond my understanding, but produces immediate problems for researchers looking to work through a coherent compilation of data and metadata.
Supplement: After attending Office Hours with Professor Thomas, I learned some additional information about the HathiTrust database that, while not excusing the legitimate categorical faults with the system, contextualize the existence of these discrepancy’s. The database is the amalgamation of a variety of library categorization systems. The non-uniform amalgam of different systems produces a lack of categorical coherence and thoughtful construction. This is precisely because there is no categorical coherence by design. It is, as Professor Thomas explained, a starting point for additional research. It is not the place to find the answers to quasi-complex questions (such as those relations and intersections explored above in Pivot Table 3 and Pivot Table 4).
Part 3: Reflections
In Jessica Marie Johnson’s “Markup Bodies: Black [Life] Studies and Slavery [Death] Studies at the Digital Crossroads,” Johnson writes, “Data is the evidence of terror, and the idea of data as fundamental and objective… obscures rather than reveals the scene of the crime.” (70). I argue that the Trans-Atlantic Slave Trade database, while operating as an exceptional research tool, is itself implicated by design in the obfuscation of the Trans-Atlantic Slave Trade as “the scene of [a] crime.” (70).
The Trans-Atlantic Slave Trade database is an incredible feat of archival/historical research. Data is categorized meticulously. Metadata fields track ports of purchase, year of arrival, year of disembarkation, captain’s of a given voyage, and so forth. The database offers a wealth of data yet the horror, “the evidence of terror,” is not adequately translated through the visual form of rigid and clearly defined textual fields and spreadsheets. Undoubtedly, the information within the database provides researchers with the broad navigation history and temporal scope of the Trans-Atlantic slave trade. This is, as mentioned, a monumental achievement. Researchers are capable of charting, with excellent detail, the movements of specific voyages and crew. Yet, within this scope the legacy of violence, generational and bodily, material pain are nullified. Take, for example, this statistic: “Percentage of slaves embarked who died during voyage.” This data is presented in a raw, numerical value. The value: 12.2%. While this presents a certain numeric god’s eye view of the situation, this data can and should be supplemented by sensitive experiential accounts of death on these vessels. The brutality of these deaths (sickness, viral infection, murder, etc.) sinks beneath the numeric facticity. As Johnson states, “the idea of data as fundamental and objective… obscures rather than reveals the scene of the crime.” (70). Johnson advocates, in part, a shift in our relation to this kind of statistical/archival documentation. The bluntness of its numbers hides the scene of a horrific crime.
The Trans-Atlantic Slave Trade database does provide a number of essays designed to supplement and contextualize one’s encounter with the database. Yet, there is a lack of, as Catherine D’Ignazio and Laren F. Klein explore in Chapter 3 of their text Data Feminism, visceralization. In other words, the database does not inculpate and trouble the body and affective sensibilities of the observer. A researcher or student is capable of maintaining a neutral, strictly observational relationship to the information in question. We must remember, though, that the “neutral” relation is a product of a vast network of framing effects that color the database itself. One is able, in this way, to see numbers and not people. This position is not unlike those ship captain’s who saw the violently detached enslaved people aboard their ships as cargo. The effect in question can be troubled through the use of visceralization procedures that directly politicize the neutral, observational stance as itself inculpated in the “people as cargo” position.