Information Filtering in TREC-9 and TDT-3: A Comparative Analysis

  • Authors:
  • Thomas Galen Ault;Yiming Yang

  • Affiliations:
  • Language Technologies Institute, Carnegie Mellon University. tomault@cs.cmu.edu;Language Technologies Institute, Carnegie Mellon University. yiming@cs.cmu.edu

  • Venue:
  • Information Retrieval
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Much work on automated information filtering has been done in the TREC and TDT domains, but differences in corpora, the nature of TREC topics vs. TDT events, the constraints imposed on training and testing, and the choices of performance measures confound any meaningful comparison between these domains. We attempt to bridge the gap between them by evaluating the performance of the k-nearest-neighbor (kNN) classification system on the corpus and categories from one domain using the constraints of the other. To maximize comparability and understand the effect of the evaluation metrics specific to each domain, we optimize the performance of kNN separately for the iF1, iT9P (preferred metric for TREC-9) and iCtrk (official metric for TDT-3) metrics. Through a thorough comparison of our within-domain and cross-domain results, our results demonstrate that the corpus used for TREC-9 is more challenging for an information filtering system than the TDT-3 corpus and strongly suggest that the TDT-3 event tracking task itself is more difficult than the TREC batch filtering task. We also show that optimizing performance in TREC-9 and TDT-3 tends to result in systems with different performance characteristics, confounding any meaningful comparison between the two domains, and that iT9P and iCtrk both have properties that make them undesirable as general information filtering metrics.