Detecting topic labels for tweets by matching features from pseudo-relevance feedback

  • Authors:
  • Jing Zhang;Derek Liu;Kok-Leong Ong;Zhijie Li;Ming Li

  • Affiliations:
  • Deakin University, Burwood, Victoria;Deakin University, Burwood, Victoria;Deakin University, Burwood, Victoria;Deakin University, Burwood, Victoria;Deakin University, Burwood, Victoria

  • Venue:
  • AusDM '12 Proceedings of the Tenth Australasian Data Mining Conference - Volume 134
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Detecting a suitable topic label for short texts, e.g., tweets from Twitter, is an important component in many applications including diversity ranking, clustering, information retrieval, and information filtering. To automatically detect topic labels however is a major challenge. The character limit of a short text means the lack of a significant feature space to adequately describe its content in relation to other short texts in a given collection. Therefore, methods like LDA, TF-IDF or similarity measures all fail due to their sensitivity to a small feature space. And when a collection of related short texts are considered, e.g., from a Twitter search, the result set collectively exhibits sparsity and high dimensionality -- a nightmare for information processing. A solution to this problem is to expand the feature space through a process known as pseudo-relevance feedback. Unfortunately, they disappoint when subjected to real-world conditions. The fundamental problem lie in the level of noise present in both the short texts and the feedback source, which is often the World Wide Web. We propose a novel pseudo-relevance feedback algorithm to accurately identify topic labels for short texts. Our algorithm robustly handles noise in both the short texts and the feedback source through a method called 'feature matching'. Empirical results confirm the efficacy of our algorithm.