Multi-strategy learning for topic detection and tracking: a joint report of CMU approaches to multilingual TDT

  • Authors:
  • Yiming Yang;Jaime Carbonell;Ralf Brown;John Lafferty;Thomas Pierce;Thomas Ault

  • Affiliations:
  • School of Computer Science, Carnegie Mellon University (CMU), Pittsburgh, PA;School of Computer Science, Carnegie Mellon University (CMU), Pittsburgh, PA;School of Computer Science, Carnegie Mellon University (CMU), Pittsburgh, PA;School of Computer Science, Carnegie Mellon University (CMU), Pittsburgh, PA;School of Computer Science, Carnegie Mellon University (CMU), Pittsburgh, PA;School of Computer Science, Carnegie Mellon University (CMU), Pittsburgh, PA

  • Venue:
  • Topic detection and tracking
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This chapter reports on CMU's work in all the five TDT-1999 tasks, including segmentation (story boundary identification), topic tracking, topic detection, first story detection, and story-link detection. We have addressed these tasks as supervised or unsupervised classification problems, and applied a variety of statistical learning algorithms to each problem for comparison. For segmentation we used exponential language models and decision trees; for topic tracking we used primarily k-nearest-neighbors classification (also language models, decision trees and a variant of the Rocchio approach); for topic detection we used a combination of incremental clustering and agglomerative hierarchical clustering, and for first story detection and story link detection we used a cosine-similarity based measure. We also studied the effect of combining the output of alternative methods for producing joint classification decisions in topic tracking. We found that a combined use of multiple methods typically improved the classification of new topics when compared to using any single method. We examined our approaches with multi-lingual corpora, including stories in English, Mandarin and Spanish, and multi-media corpora consisting of newswire texts and the results of automated speech recognition for broadcast news sources. The methods worked reasonably well under all of the above conditions.