Detecting topic labels for tweets by matching features from pseudo-relevance feedback

Authors:
Jing Zhang;Derek Liu;Kok-Leong Ong;Zhijie Li;Ming Li
Affiliations:
Deakin University, Burwood, Victoria;Deakin University, Burwood, Victoria;Deakin University, Burwood, Victoria;Deakin University, Burwood, Victoria;Deakin University, Burwood, Victoria
Venue:
AusDM '12 Proceedings of the Tenth Australasian Data Mining Conference - Volume 134
Year:
2012

Citing 14
Cited 0

Generic topic segmentation of document texts

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Improved automatic keyword extraction given more linguistic knowledge

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
BuzzTrack: topic detection and tracking in email

Proceedings of the 12th international conference on Intelligent user interfaces
Discovering key concepts in verbose queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Emerging topic detection on Twitter based on temporal and social terms evaluation

Proceedings of the Tenth International Workshop on Multimedia Data Mining
Eddi: interactive topic-based browsing of social status streams

UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Topic detection and organization of mobile text messages

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Breaking News Detection and Tracking in Twitter

WI-IAT '10 Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Clustering weblogs on the basis of a topic detection method

MCPR'10 Proceedings of the 2nd Mexican conference on Pattern recognition: Advances in pattern recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting a suitable topic label for short texts, e.g., tweets from Twitter, is an important component in many applications including diversity ranking, clustering, information retrieval, and information filtering. To automatically detect topic labels however is a major challenge. The character limit of a short text means the lack of a significant feature space to adequately describe its content in relation to other short texts in a given collection. Therefore, methods like LDA, TF-IDF or similarity measures all fail due to their sensitivity to a small feature space. And when a collection of related short texts are considered, e.g., from a Twitter search, the result set collectively exhibits sparsity and high dimensionality -- a nightmare for information processing. A solution to this problem is to expand the feature space through a process known as pseudo-relevance feedback. Unfortunately, they disappoint when subjected to real-world conditions. The fundamental problem lie in the level of noise present in both the short texts and the feedback source, which is often the World Wide Web. We propose a novel pseudo-relevance feedback algorithm to accurately identify topic labels for short texts. Our algorithm robustly handles noise in both the short texts and the feedback source through a method called 'feature matching'. Empirical results confirm the efficacy of our algorithm.