Corpora for topic detection and tracking

  • Authors:
  • Christopher Cieri;Stephanie Strassel;David Graff;Nii Martey;Kara Rennert;Mark Liberman

  • Affiliations:
  • Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA;Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA;Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA;Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA;Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA;Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA

  • Venue:
  • Topic detection and tracking
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

The TDT corpora, developed to support the DARPA-sponsored program in Topic Detection and Tracking, combine data collected over a nine month period from 8 English and 3 Chinese sources. The published corpora contain audio, reference text including written news text and transcripts of the broadcast audio, boundary tables segmenting the broadcasts into stories and relevance tables resulting from millions of human judgments. Sections of the corpora have undergone topic-story, first story and story link annotation. Both the TDT-2 and TDT-3 text corpora and the accompanying broadcast audio are now available from the Linguistic Data Consortium. This paper described the raw material collected for the corpora, the annotation of that material to prepare it for research use and the formats in which it is distributed. Special attention is paid to the quality control measures developed for these data sets.