Statistical Models for Text Segmentation
Machine Learning - Special issue on natural language learning
Unsupervised learning by probabilistic latent semantic analysis
Machine Learning
Topic-based document segmentation with probabilistic latent semantic analysis
Proceedings of the eleventh international conference on Information and knowledge management
The Journal of Machine Learning Research
A unified framework for model-based clustering
The Journal of Machine Learning Research
TextTiling: segmenting text into multi-paragraph subtopic passages
Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Generative model-based document clustering: a comparative study
Knowledge and Information Systems
Knowledge discovery of multiple-topic document using parametric mixture model with dirichlet prior
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An extension of PLSA for document clustering
Proceedings of the 17th ACM conference on Information and knowledge management
Text segmentation with LDA-based Fisher kernel
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Topic-Based Hard Clustering of Documents Using Generative Models
ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
A statistical model for topic segmentation and clustering
Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence
Hi-index | 0.00 |
Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at documentlevel, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.