Knowledge discovery of multiple-topic document using parametric mixture model with dirichlet prior

Authors:
Issei Sato;Hiroshi Nakagawa
Affiliations:
University of Tokyo;University of Tokyo
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 6
Cited 4

Foundations of statistical natural language processing

Foundations of statistical natural language processing
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Single-shot detection of multiple categories of text using parametric mixture models

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Latent dirichlet allocation

The Journal of Machine Learning Research
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Inferring parameters and structure of latent variable models by variational bayes

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Named entity mining from click-through data using weakly supervised latent dirichlet allocation

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Named entity recognition in query

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic-Based Hard Clustering of Documents Using Generative Models

ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
A statistical model for topically segmented documents

DS'11 Proceedings of the 14th international conference on Discovery science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Documents, such as those seen on Wikipedia and Folksonomy, have tended to be assigned with multiple topics as a meta-data.Therefore, it is more and more important to analyze a relationship between a document and topics assigned to the document. In this paper, we proposed a novel probabilistic generative model of documents with multiple topics as a meta-data. By focusing on modeling the generation process of a document with multiple topics, we can extract specific properties of documents with multiple topics.Proposed model is an expansion of an existing probabilistic generative model: Parametric Mixture Model (PMM). PMM models documents with multiple topics by mixing model parameters of each single topic. Since, however, PMM assigns the same mixture ratio to each single topic, PMM cannot take into account the bias of each topic within a document. To deal with this problem, we propose a model that considers Dirichlet distribution as a prior distribution of the mixture ratio.We adopt Variational Bayes Method to infer the bias of each topic within a document. We evaluate the proposed model and PMM using MEDLINE corpus.The results of F-measure, Precision and Recall show that the proposed model is more effective than PMM on multiple-topic classification. Moreover, we indicate the potential of the proposed model that extracts topics and document-specific keywords using information about the assigned topics.