Sweeping through the topic space: bad luck? Roll again!

Authors:
Martin Riedl;Chris Biemann
Affiliations:
Technische Universität Darmstadt, Darmstadt, Germany;Technische Universität Darmstadt, Darmstadt, Germany
Venue:
ROBUS-UNSUP '12 Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP
Year:
2012

Citing 11
Cited 2

Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A Dynamic Programming Algorithm for Linear Text Segmentation

Journal of Intelligent Information Systems
A Comparative Analysis of Latent Variable Models for Web Page Classification

LA-WEB '08 Proceedings of the 2008 Latin American Web Conference
Text segmentation with LDA-based Fisher kernel

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Hierarchical text segmentation from multi-scale lexical cohesion

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Text segmentation via topic modeling: an analytical study

Proceedings of the 18th ACM conference on Information and knowledge management
Topic models with power-law using Pitman-Yor process

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Typology of mixed-membership models: towards a design method

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II

TopicTiling: a text segmentation algorithm based on LDA

ACL '12 Proceedings of ACL 2012 Student Research Workshop
Automatically generated spam detection based on sentence-level topic information

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topic Models (TM) such as Latent Dirichlet Allocation (LDA) are increasingly used in Natural Language Processing applications. At this, the model parameters and the influence of randomized sampling and inference are rarely examined --- usually, the recommendations from the original papers are adopted. In this paper, we examine the parameter space of LDA topic models with respect to the application of Text Segmentation (TS), specifically targeting error rates and their variance across different runs. We find that the recommended settings result in error rates far from optimal for our application. We show substantial variance in the results for different runs of model estimation and inference, and give recommendations for increasing the robustness and stability of topic models. Running the inference step several times and selecting the last topic ID assigned per token, shows considerable improvements. Similar improvements are achieved with the mode method: We store all assigned topic IDs during each inference iteration step and select the most frequent topic ID assigned to each word. These recommendations do not only apply to TS, but are generic enough to transfer to other applications.