An effective coherence measure to determine topical consistency in user-generated content

  • Authors:
  • Jiyin He;Wouter Weerkamp;Martha Larson;Maarten de Rijke

  • Affiliations:
  • University of Amsterdam, ISLA, Science Park 107, 1098GX, Amsterdam, The Netherlands;University of Amsterdam, ISLA, Science Park 107, 1098GX, Amsterdam, The Netherlands;Delft University of Technology, EEMCS, Mekelweg 4, 2628 CD, Delft, The Netherlands;University of Amsterdam, ISLA, Science Park 107, 1098GX, Amsterdam, The Netherlands

  • Venue:
  • International Journal on Document Analysis and Recognition - Special Issue NOISY
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

When searching for blogs on a specific topic, information seekers prefer blogs that place a central focus on that topic over blogs whose mention of the topic is diffuse or incidental. In order to present users with better blog feed search results, we developed a measure of topical consistency that is able to capture whether or not a blog is topically focused. The measure, called the coherence score, is inspired by the genetics literature and captures the tightness of the clustering structure of a data set relative to a background collection. In a set of experiments on synthetic data, the coherence score is shown to provide a faithful reflection of topic clustering structure. The properties that make the coherence score more appropriate than lexical cohesion, a common measure of topical structure, are discussed. Retrieval experiments show that integrating the coherence score as a prior in a language modeling-based approach to blog feed search improves retrieval effectiveness. The coherence score must, however, be used judiciously in order to avoid boosting the ranking of irrelevant but topically focused blogs. To this end, we experiment with a series of weighting schemes that adjust the contribution of the coherence score according to the relevance of a blog to the user query. An appropriate weighting scheme is able to improve retrieval performance. Finally, we show that the coherence score can be reliably estimated with a sample exceeding 20 posts in size. Consistent with this finding, experiments show that the best retrieval performance is achieved if coherence scores are used when a blog contains more than 20 posts.