Probabilistic Topic Modeling for Comparative Analysis of Document Collections – Sanghani Center for Artificial Intelligence and Data Analytics

Ting Hua, Chandan Reddy

Abstract

Probabilistic topic models, which can discover hidden patterns in documents, have been extensively studied. However, rather than learning from a single document collection, numerous real-world applications demand a comprehensive understanding of the relationships among various document sets. To address such needs, this article proposes a new model that can identify the common and discriminative aspects of multiple datasets. Specifically, our proposed method is a Bayesian approach that represents each document as a combination of common topics (shared across all document sets) and distinctive topics (distributions over words that are exclusive to a particular dataset). Through extensive experiments, we demonstrate the effectiveness of our method compared with state-of-the-art models. The proposed model can be useful for “comparative thinking” analysis in real-world document collections.

People

Ting Hua

Alumni

Chandan Reddy

Associate Professor of Computer Science

Publication Details

Date of publication:: March 4, 2020
Journal:: ACM Transactions on Knowledge Discovery from Data (TKDD)
Page number(s):: 1-27
Volume:: 14
Issue Number:: 2
Publication note:: Ting Hua, Chang-Tien Lu, Jaegul Choo, Chandan K. Reddy: Probabilistic Topic Modeling for Comparative Analysis of Document Collections. ACM Trans. Knowl. Discov. Data 14(2): 24:1-24:27 (2020)