Word order matters: measuring topic coherence with lexical argument structure

Authors:
Steve Spagnola;Carl Lagoze
Affiliations:
Cornell University, Ithaca, NY, USA;Cornell University, Ithaca, NY, USA
Venue:
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Year:
2011

Citing 4
Cited 0

Latent dirichlet allocation

The Journal of Machine Learning Research
Topic modeling: beyond bag-of-words

ICML '06 Proceedings of the 23rd international conference on Machine learning
Evaluating topic models for digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
A new semantics: merging propositional and distributional information

IWCS '11 Proceedings of the Ninth International Conference on Computational Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topic models are emerging tools for improved browsing and searching within digital libraries. These techniques collapse words within documents into unordered "bags of words," ignoring word order. In this paper, we present a method that examines syntactic dependency parse trees from Wikipedia article titles to learn expected patterns between relative lexical arguments. This process is highly dependent on the global word ordering of a sentence, modeling how each word interacts with other words to gain an aggregate perspective on how words interact over all 3.2 million titles. Using this information, we analyze how coherent a given topic is by comparing the relative usage vectors between the top 5 words in a topic. Results suggest that this technique can identify poor topics based on how well the relative usages align with each other within a topic, potentially aiding digital library indexing.