When Bipasha Banerjee was looking for a Ph.D. program she had one major criteria: it had to give the highest importance to research. With her continuing passion for knowing more, she wanted to delve deeper into open questions and learn how to solve them.

“The quality of research in computer science at Virginia Tech is unparalleled and professors associated with the Sanghani Center are involved in projects that encompass a large range of real-world issues,” said Banerjee. “I realized this was the right fit for me and, thankfully I was accepted and started an exciting journey of research.”

Advised by Edward Fox, she is a member of the Digital Library Research Laboratory (DLRL), and serves as graduate research assistant for a project funded by the Institute of Museum and Library Services (IMLS)-funded project.

Banerjee’s keen interest in natural language processing goes back to her undergraduate years at West Bengal University of Technology, Kolkata, India, where she graduated with a bachelor’s degree.

In her research, Banerjee works with long documents, especially book-length documents like Electronic Thesis and Dissertations (ETDs). Generally, an ETD, which averages 100 pages, is hosted by the university from which the author graduated. Finding all ETDs related to a particular query requires searching thousands of repositories as there are no global full-text search sites covering the worldwide set of ETDs.

In addition to making ETDs more accessible she aims to add services that make it easier to engage with such book-length documents and tailored specifically to each class of stakeholder.

“Most of the theoretical and applied experimentation is focused on short documents like webpages, journal articles, or papers in conference proceedings. While each of the articles in a journal volume or conference proceedings has its own abstract, there are no summaries for the chapters of an ETD. Aggregating such works in a shared space and performing applied research like segmentation and summarization would prove to be extremely valuable for readers,” Banerjee said.

Banerjee’s focus area was a direct result of her own experience at the beginning of her graduate work at Virginia Tech when she found that most professors urged her to read the theses and dissertations of past graduates in their labs.

“I found reading these documents very useful in understanding the research as opposed to reading a paper, which often, because of page limitations, contain only certain important portions of the research,” she said. “I quickly realized that although the documents contain detailed information, I was only able to parse through the documents quickly if a comprehensive summary was available for sections. Hence, it was a natural fit to work with long documents as my research topic.”

Banerjee said she greatly appreciates the work culture in the Sanghani Center. “It is easy both to approach other students and to seek guidance from faculty members,” she said.

She has collaborated with Fox and another Ph.D. student at the Sanghani Center on the paper, “Summarizing ETDs with deep learning” published by Cadernos BAD 1 (2020).

She is on track to get her Ph.D. in 2023. Her goal after that, Banerjee said, is to remain in academia in a position where she can teach and continue her research