A statistical model for topically segmented documents

Authors:
Giovanni Ponti;Andrea Tagarelli;George Karypis
Affiliations:
ENEA - Portici Research Center, Italy;Department of Electronics, Computer and Systems Sciences, University of Calabria, Italy;Department of Computer Science & Engineering, University of Minnesota, Minneapolis
Venue:
DS'11 Proceedings of the 14th international conference on Discovery science
Year:
2011

Citing 16
Cited 0

Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Topic-based document segmentation with probabilistic latent semantic analysis

Proceedings of the eleventh international conference on Information and knowledge management
Latent dirichlet allocation

The Journal of Machine Learning Research
A unified framework for model-based clustering

The Journal of Machine Learning Research
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Knowledge discovery of multiple-topic document using parametric mixture model with dirichlet prior

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An extension of PLSA for document clustering

Proceedings of the 17th ACM conference on Information and knowledge management
Text segmentation with LDA-based Fisher kernel

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Topic-Based Hard Clustering of Documents Using Generative Models

ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
Multirelational Topic Models

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
A statistical model for topic segmentation and clustering

Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence
A segmented topic model based on the two-parameter Poisson-Dirichlet process

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at documentlevel, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.