Virginia Tech® home

Applications of data analysis on scholarly long documents

Bipasha Banerjee, Edward Fox

Abstract

Theses and dissertations record the work of graduate students and are typically a requirement at the culmination of the graduate degree. Thus, they contain important information that reflects a graduate student’s exploration of their research topic. Although print submission was commonplace early on, most universities now require students to submit an electronic version. The electronic document referred to as an ETD henceforth has become the primary way of submitting, storing, and distributing graduate work. Millions of such documents have been created in the past two decades. They are maintained and stored by university libraries, digital repositories, and other academic publishing companies. These online repositories have increased access to such documents. Nonetheless, these documents fail to meet the needs of researchers, who find it challenging to find and access knowledge from such long documents. The worldwide ETD collection has increased in volume to become what is known as ‘scholarly big data’. Apart from the text body, these documents contain a myriad of other pieces of knowledge like tables, figures, definitions, literature reviews, and references. There is a growing demand amongst researchers across various domains to make this collection of scholarly documents more computationally driven. We use ideas from natural language processing, information retrieval, and machine learning to excavate knowledge from this rich information source. In this paper, we examine some of the challenges we face, identify some key areas of exploration, and discuss our methods to mitigate the challenges.

Publication Details

Date of publication: January 25, 2023

Conference: Big Data

Page number(s): 2473-2481

Volume:

Issue Number:

Publication Note: Bipasha Banerjee, William A. Ingram, Jian Wu, Edward A. Fox: Applications of data analysis on scholarly long documents. IEEE Big Data 2022: 2473-2481