The NXT-format Switchboard Corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue

  • Authors:
  • Sasha Calhoun;Jean Carletta;Jason M. Brenier;Neil Mayo;Dan Jurafsky;Mark Steedman;David Beaver

  • Affiliations:
  • School of Philosophy, Psychology and Language Sciences, University of Edinburgh, Edinburgh, Scotland, UK EH8 9JZ;School of Informatics, University of Edinburgh, Edinburgh, Scotland, UK;Nuance Communications, Inc., Sunnyvale, USA;School of Informatics, University of Edinburgh, Edinburgh, Scotland, UK;Department of Linguistics, Stanford University, Stanford, USA;School of Informatics, University of Edinburgh, Edinburgh, Scotland, UK;Department of Linguistics, University of Texas at Austin, Austin, USA

  • Venue:
  • Language Resources and Evaluation
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a recently completed common resource for the study of spoken discourse, the NXT-format Switchboard Corpus. Switchboard is a long-standing corpus of telephone conversations (Godfrey et al. in SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of ICASSP-92, pp. 517---520, 1992). We have brought together transcriptions with existing annotations for syntax, disfluency, speech acts, animacy, information status, coreference, and prosody; along with substantial new annotations of focus/contrast, more prosody, syllables and phones. The combined corpus uses the format of the NITE XML Toolkit, which allows these annotations to be browsed and searched as a coherent set (Carletta et al. in Lang Resour Eval J 39(4):313---334, 2005). The resulting corpus is a rich resource for the investigation of the linguistic features of dialogue and how they interact. As well as describing the corpus itself, we discuss our approach to overcoming issues involved in such a data integration project, relevant to both users of the corpus and others in the language resource community undertaking similar projects.