Text Mining - Big Data
The Digital Convergence Lab is seeking History or other humanities majors to take part in text-mining research, which is a form of the digital humanities applying techniques often described pertaining to Big Data. The project involves collaboration with faculty and graduate students in the Department of Computer Science.
The humanities are traditionally taught as emphasizing the close reading of a relatively limited number of documents. Text mining emphasizes the broader analysis – sometimes called “distant reading” – of bodies of text so large that no individual could read them in a reasonable period of time. Scholarship in the study of literature has proven fruitful in identifying patterns of language within very large sets of text that previous scholarship has not detected. I would like to explore how these techniques may contribute to the study of history.
One potential application of text-mining technology has recently come to my attention. The United States Central Intelligence Agency regularly releases very large amounts of documentary materials for public review in an online, digital format (PDFs), but fails to provide the information infrastructure needed to investigate them at the appropriate scale. These materials present a mass of information so large that they will likely frustrate any student or scholar’s attempt to review and analyze them, or even a portion of them related to anything but the narrowest research question.
Text mining technology can perhaps help students and scholars in the field to make sense of some part of this vast collection of data. I propose an experiment in which a team of students work with NIU faculty to identify a collection of CIA materials of potential interest to researchers, convert them to machine-readable text using Optical Character Recognition technology, and attempt to determine basic elements of their content that can inform future student and scholarly work. Professor Hamed Alhoori of the Department of Computer Science has recommended clustering and topic modeling technology as particularly appropriate for this task. Cluster analysis uses algorithms to sort texts into groups reflecting their similarity to each other, based on the words that appear in them, relative to those materials associated with other clusters. Topic modeling analysis identifies individual words that occur in close proximity to each other in a body of text. This allows researchers to gain insight into what themes, or topics, might appear most commonly in a defined set of materials. Topic modeling can also be used to track how these relationships between words change over time.Student interns will work as a team of up to seven members, comprised of history students and programmers, to produce a short report discussing the results of the above experiment, with an emphasis on text mining technology’s success in identifying sets of documents that can prove useful in future research as determined by a scholar in the field.
Drew VandeCreek will be recruiting applicants for these positions (three or four History students, and 2 or three programmers) via the NIU Internship Fair in December 2017. Please ask interested parties to attend, looking for the Digital Convergence Lab booth, or have them contact Drew directly at:
Drew E. VandeCreek
Director of Digital Initiatives
University Libraries
Northern Illinois University
DeKalb, IL 60115
drew@niu.edu
(815) 753-7179