One-class clustering in the text domain

Authors:
Ron Bekkerman;Koby Crammer
Affiliations:
HP Laboratories, Palo Alto, CA;University of Pennsylvania, Philadelphia, PA
Venue:
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2008

Citing 13
Cited 0

Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Topic Detection and Tracking: Event-Based Information Organization

Topic Detection and Tracking: Event-Based Information Organization
A two-stage mixture model for pseudo feedback

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A needle in a haystack: local one-class optimization

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Disambiguating Web appearances of people in a social network

WWW '05 Proceedings of the 14th international conference on World Wide Web
Riemannian geometry and statistical machine learning

Riemannian geometry and statistical machine learning
Robust one-class clustering using hybrid global and local search

ICML '05 Proceedings of the 22nd international conference on Machine learning
Estimating the Support of a High-Dimensional Distribution

Neural Computation
Text clustering with extended user feedback

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Query performance prediction in web search environments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A rate-distortion one-class model and its applications to clustering

Proceedings of the 25th international conference on Machine learning
Combinatorial markov random fields and their applications to information organization

Combinatorial markov random fields and their applications to information organization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Having seen a news title "Alba denies wedding reports", how do we infer that it is primarily about Jessica Alba, rather than about weddings or reports? We probably realize that, in a randomly driven sentence, the word "Alba" is less anticipated than "wedding" or "reports", which adds value to the word "Alba" if used. Such anticipation can be modeled as a ratio between an empirical probability of the word (in a given corpus) and its estimated probability in general English. Aggregated over all words in a document, this ratio may be used as a measure of the document's topicality. Assuming that the corpus consists of on-topic and off-topic documents (we call them the core and the noise), our goal is to determine which documents belong to the core. We propose two unsupervised methods for doing this. First, we assume that words are sampled i.i.d., and propose an information-theoretic framework for determining the core. Second, we relax the independence assumption and use a simple graphical model to rank documents according to their likelihood of belonging to the core. We discuss theoretical guarantees of the proposed methods and show their usefulness for Web Mining and Topic Detection and Tracking (TDT).