Skip to content

Explorations in Text Analysis

Motivations:

  • Accomplishing specific research or pedagogical goals
  • Exploring a text or set of texts
  • Simply seeing what’s out there

First, you must have some text(s):

Consider—

  • Metadata, paratexts, page numbers, end-of-line hyphens
  • Stop words (these can be period-specific)
  • Genre-specific features (such as speaker tags in drama or line numbers in poetry)
  • Outcomes of the digitization process (such as potential need for OCR cleanup)

Tips and next steps—

  • Regular expressions can facilitate much of this work (cheat sheets here and here)
  • You will need text editing software: for Mac users, TextWrangler is free and allows regex find-and-replace; LibreOffice is another option and will work on Linux, Windows, and Mac
  • Alan Liu’s DH resources page collects tools and tutorials for text preparation (as well as most of the other aspects of text analysis discussed here)
  • The Programming Historian has several lessons on data manipulation (as well as many other aspects of text analysis )
  • The data preparation you will need to perform will depend on what you plan to do with your texts—sometimes, it will be enough to remove metadata and page numbers; sometimes, you’ll need to perform much more advanced transformations
  • It’s a good idea to test your data throughout the preparation process, to see where additional changes might be needed and discover any issues you haven’t anticipated
  • You can also see whether there are existing datasets or corpora that will work for your needs

Try before you buy (in):

  • The TAPoR (Text Analysis Portal for Research) tools page will let you sort tools by the type of analysis they can provide, as well as by those that will run in a browser
  • The DiRT (Digital Research Tools) Directory is another great place to look for tools; it sorts by what you want to do with your research objects and will allow filtering for browser-based tools
  • For example, if you are interested in network visualization, Textexture is a free, web-based tool for visualizing textual networks (registration is required)
  • If you want to go further with network visualization: Gephi is also free, but requires installation and some practice; Palladio is another option—it’s web-based and can be easier to use than Gephi, but it works best with simpler visualizations
  • If you’re interested in keyword analysis, Jason Davies’s word tree tool is another free web-based option for finding keyword patterns in your text (just paste your text in); Voyant was recently updated and includes several new tools and an improved interface
  • If you want to go farther with keyword-based analysis of textual corpora: AntConc (free, installation required, tutorial here)

Play in a pre-constructed sandbox:

  • For keyword analysis across a large set of early modern texts: Early Modern Print (using materials from EEBO–TCP)
  • If you want to use Ngrams to analyze textual trends over time, the Bookworm tool has several pre-loaded corpora; you can create your own Bookworm corpus if you register and follow the instructions here
  • This approach to beginning your research is particularly helpful for cases when you want to see results over a large set of texts or when it’s not as easy to start with a lightweight web-based tool; for example, if you want to see what topic modeling produces, Serendip has some sample models
  • It also helps to see what others have produced with a research process you’re interested in—continuing with topic modeling as an example, Cameron Blevins has a blog post explaining what he found when examining a late seventeenth-/early eighteenth-century diary. Looking at word embedding models instead of topic modeling, Ben Schmidt’s post on using WEMs to analyze language use and gender is another good example of applied analysis. And, for one last example, Douglas Duhaime has a recent article on using “combinatorial ngrams” to identify Eliza Haywood’s quotation sources and thus examine intertextuality in her work
  • If you decide to go further with topic modeling, you can download Serendip to create your own corpora and models; you can also work with MALLET for a robust range of text analysis tools or download RStudio and work in R, a programming language that is a very powerful, though not immediately accessible, tool for text analysis
  • Essentially, you can engage with text analysis at many levels—from experimentation and exploration using lightweight web interfaces to writing your own code; you can decide what level of involvement works best for you by playing with what’s readily available and looking at what the processes you’re considering can produce

Text samples 

  • Aphra Behn, Oroonoko, 1688: EEBO–TCP transcription
  • Text file of Oroonoko (paratexts, metadata, end-of-line hyphens, and page numbers removed; click to view in browser or control/right click to save)
  • Aphra Behn, The Fair Jilt, 1688: EEBO–TCP transcription
  • Text file of The Fair Jilt
css.php