Word order matters: measuring topic coherence with lexical argument structure

  • Authors:
  • Steve Spagnola;Carl Lagoze

  • Affiliations:
  • Cornell University, Ithaca, NY, USA;Cornell University, Ithaca, NY, USA

  • Venue:
  • Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Topic models are emerging tools for improved browsing and searching within digital libraries. These techniques collapse words within documents into unordered "bags of words," ignoring word order. In this paper, we present a method that examines syntactic dependency parse trees from Wikipedia article titles to learn expected patterns between relative lexical arguments. This process is highly dependent on the global word ordering of a sentence, modeling how each word interacts with other words to gain an aggregate perspective on how words interact over all 3.2 million titles. Using this information, we analyze how coherent a given topic is by comparing the relative usage vectors between the top 5 words in a topic. Results suggest that this technique can identify poor topics based on how well the relative usages align with each other within a topic, potentially aiding digital library indexing.