Information Filtering in TREC-9 and TDT-3: A Comparative Analysis

Authors:
Thomas Galen Ault;Yiming Yang
Affiliations:
Language Technologies Institute, Carnegie Mellon University. tomault@cs.cmu.edu;Language Technologies Institute, Carnegie Mellon University. yiming@cs.cmu.edu
Venue:
Information Retrieval
Year:
2002

Citing 10
Cited 1

Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Improving text categorization methods for event tracking

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The score-distributional threshold optimization for adaptive binary classification tasks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Maximum likelihood estimation for filtering thresholds

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Combining Multiple Learning Strategies for Effective Cross Validation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning

Introduction to the Special Issue: Overview of the TREC Routing and Filtering Tasks

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Much work on automated information filtering has been done in the TREC and TDT domains, but differences in corpora, the nature of TREC topics vs. TDT events, the constraints imposed on training and testing, and the choices of performance measures confound any meaningful comparison between these domains. We attempt to bridge the gap between them by evaluating the performance of the k-nearest-neighbor (kNN) classification system on the corpus and categories from one domain using the constraints of the other. To maximize comparability and understand the effect of the evaluation metrics specific to each domain, we optimize the performance of kNN separately for the iF1, iT9P (preferred metric for TREC-9) and iCtrk (official metric for TDT-3) metrics. Through a thorough comparison of our within-domain and cross-domain results, our results demonstrate that the corpus used for TREC-9 is more challenging for an information filtering system than the TDT-3 corpus and strongly suggest that the TDT-3 event tracking task itself is more difficult than the TREC batch filtering task. We also show that optimizing performance in TREC-9 and TDT-3 tends to result in systems with different performance characteristics, confounding any meaningful comparison between the two domains, and that iT9P and iCtrk both have properties that make them undesirable as general information filtering metrics.