Probabilistic Topic Modeling for Comparative Analysis of Document Collections – Sanghani Center for Artificial Intelligence and Data Analytics

Chandan Reddy

Abstract

Probabilistic topic models, which can discover hidden patterns in documents, have been extensively studied. However, rather than learning from a single document collection, numerous real-world applications demand a comprehensive understanding of the relationships among various document sets. To address such needs, this article proposes a new model that can identify the common and discriminative aspects of multiple datasets. Specifically, our proposed method is a Bayesian approach that represents each document as a combination of common topics (shared across all document sets) and distinctive topics (distributions over words that are exclusive to a particular dataset). Through extensive experiments, we demonstrate the effectiveness of our method compared with state-of-the-art models. The proposed model can be useful for “comparative thinking” analysis in real-world document collections.

People

Chandan Reddy

Associate Professor of Computer Science

Publication Details

Date of publication:: March 4, 2020
Journal:: ACM Transactions on Knowledge Discovery from Data (TKDD)
Page number(s):: 1-27
Volume:: 14
Issue Number:: 2
Publication note:: Ting Hua, Chang-Tien Lu, Jaegul Choo, Chandan K. Reddy: Probabilistic Topic Modeling for Comparative Analysis of Document Collections. ACM Trans. Knowl. Discov. Data 14(2): 24:1-24:27 (2020)