Probabilistic Topic Modeling for Comparative Analysis of Document Collections
Ting Hua, Chandan Reddy
Abstract
Probabilistic topic models, which can discover hidden patterns in documents, have been extensively studied. However, rather than learning from a single document collection, numerous real-world applications demand a comprehensive understanding of the relationships among various document sets. To address such needs, this article proposes a new model that can identify the common and discriminative aspects of multiple datasets. Specifically, our proposed method is a Bayesian approach that represents each document as a combination of common topics (shared across all document sets) and distinctive topics (distributions over words that are exclusive to a particular dataset). Through extensive experiments, we demonstrate the effectiveness of our method compared with state-of-the-art models. The proposed model can be useful for “comparative thinking” analysis in real-world document collections.
People
Publication Details
- Date of publication:
- March 4, 2020
- Journal:
- ACM Transactions on Knowledge Discovery from Data (TKDD)
- Page number(s):
- 1-27
- Volume:
- 14
- Issue Number:
- 2
- Publication note:
Ting Hua, Chang-Tien Lu, Jaegul Choo, Chandan K. Reddy: Probabilistic Topic Modeling for Comparative Analysis of Document Collections. ACM Trans. Knowl. Discov. Data 14(2): 24:1-24:27 (2020)