Making large-scale support vector machine learning practical
Advances in kernel methods
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
The Journal of Machine Learning Research
Applied morphological processing of English
Natural Language Engineering
An empirical model of multiword expression decomposability
MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18
Topic modeling: beyond bag-of-words
ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Incorporating non-local information into information extraction systems by Gibbs sampling
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Modeling online reviews with multi-grain topic models
Proceedings of the 17th international conference on World Wide Web
Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Modeling Chinese documents with topical word-character models
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Exploring content models for multi-document summarization
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
The design, implementation, and use of the Ngram statistics package
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Automatic evaluation of topic coherence
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Best topic word selection for topic labelling
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Optimizing semantic coherence in topic models
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Word sense induction for novel sense detection
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
An unsupervised ranking model for noun-noun compositionality
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Hi-index | 0.00 |
We investigate the impact of preextracting and tokenizing bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and suggest an alternate measure that penalizes model complexity. We show how the Akaike information criterion is a more appropriate measure, which suggests that using a modest number (up to 1000) of top-ranked bigrams is the optimal topic modelling configuration. Using these 1000 bigrams also results in improved topic quality over unigram tokenization. Further increases in topic quality can be achieved by using up to 10,000 bigrams, but this is at the cost of a more complex model. We also show that multiword (bigram and longer) named entities give consistent results, indicating that they should be represented as single tokens. This is the first work to explicitly study the effect of n-gram tokenization on LDA topic models, and the first work to make empirical recommendations to topic modelling practitioners, challenging the standard practice of unigram-based tokenization.