Lab Notebook 5
Questions 1-5
- What do you notice about how this function has split the string “Okay, okay, ladies, now let’s get in formation, cause I slay”? What has it done that isn’t quite right, and why has it done this? Write down your response in your notes document.
- This regular expression is isolating all non-alphabetic characters and splitting on them. In other words, the \W is splicing all words at spaces, punctuations, numbers, etc. and the + is telling the command to run this function until it is unable to do so.
- This also why the function splits “let’s” into “let” and “s” because the function does not view let’s as a single word or a contractive term but as “Let” and “S” with the apostrophe functioning as a non-alphabetic character.
- What happened? Did it work as you expected? If not, what happened that you didn’t expect? Write down your response in your notes document.
- My any_chunk_of_text selection was the opening lines of Ernest Hemingway’s The Sun Also Rises. In this instance, the selection from the text did split as expected but, interestingly, the punctation at the end of the sentence was identified as an additional character. The sentence: Robert Cohn was once middleweight boxing champion of Princeton. In this case, there are 9 alphabetic characters that were identified in this selection. Yet, the punctation mark was counted as an invisible character.
- Example: [‘robert’, ‘cohn’, ‘was’, ‘once’, ‘middleweight’, ‘boxing’, ‘champion’, ‘of’, ‘princeton’, ‘’] This is, according to Professor Thomas, possibly an issue with the \W+ function.
- Describe the output of this script (the dataframe that displays after the above cell finishes running). Remember that this is the same output as the “vir-ver-counts-specific” spreadsheet in our Lab5 Google drive folder, only for just 10 texts. What is this dataframe showing us? Write down your response in your notes document.
- The script output produced a 9 line chart. In the chart, we see the particular xml files associated with a given document, the title and author of the document (as well as the authors date of birth and death). We also see the date in which the document was written/published and the count for appearances of “virtu” and “vertu” in a given document.
- Look at the below lines from the
compare_counts_specific
function above. These lines use regular expressions to do something to the value of the<date>
field in an xml file (if the contents of the<date>
field meet certain conditions, that is). What are these lines doing?- The above lines, specifically the ^ 20, are identifying dates within the eebo test data document that correspond to the year in which a given text was uploaded to the database (that is, after 2000). In this specific instance, and related to the question that Professor Thomas was looking to answer with this algorithm, the date the document was uploaded was not as relevant as the date in which the text was written. Therefore, the ^ function is telling Python to match and remove the beginning of a line or string, and 20 is identifying the content of that line or sting. This function, then, is telling Python to remove all dates that begin with “20”.
Reflections
I refer to the Data Sitters Club in nearly every reflection and with good reason: their work is accessible and relatable. In certain instances we may want the more technically-loaded and specifically argumentative capacity of a peer-reviewed journal article but, and I may speak only for myself here, when it comes to working with code, scripts, and algorithms this can be an especially alienating experience.
There is a moment in DSC #12 that exemplifies this alienated feeling. In the section, “Sneakers, shoelaces, and code” Quinn and Katia are messaging about the terms like “package” and “library”. Katia writes, “The training I have had on this training should not be called training… more like a confusing trainwreck.” Quinn replies in a series of short messages explaining what packages and libraries in rather plain terms. To this, Katia responds, “See, why has no one so far been able to explain this in a way that makes sense? It took you 4 text messages.” And, here, I believe lies the core of the frustration around learning how to code: there is a great deal of techno-babble (this term is, for one, diminutive of the complex vocabulary within the coding field and, mostly, tied to science fiction stories) and not a lot of translatability work. Encountering something like Python for the first time can be daunting but many times this experience is not characterized and recognized as daunting. I appreciate the honesty of the DSC: learning to code is difficult and tedious and frustrating and it will drive you up a wall but most times it isn’t as intense as it looks and, especially in the humanities, there are few times that our work will be pushing us to the utter limits of computation. It simply requires a helping hand and some guiding words capable of translating what we need to do on-screen to what we are actually aiming to do.
I also appreciated Benjamin Schmidt’s piece as it offered a salient piece of advice akin to the thrust of DSC #12: we need to learn outputs (transformation) and a bit of what is happening under the hood but we don’t need to know it all. In this way, we avoid getting pushed to the limits of computational work and can continue focusing on what we love so much in the humanities: reading texts! We need to know enough to readily deploy these tools and gather workable data. I guess it’s just reassuring that some in the DH field are comfortable with knowing enough to creatively use tools to interpret data. We don’t need to know all about computers to use a computer.