The NXT-format Switchboard Corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue

Authors:
Sasha Calhoun;Jean Carletta;Jason M. Brenier;Neil Mayo;Dan Jurafsky;Mark Steedman;David Beaver
Affiliations:
School of Philosophy, Psychology and Language Sciences, University of Edinburgh, Edinburgh, Scotland, UK EH8 9JZ;School of Informatics, University of Edinburgh, Edinburgh, Scotland, UK;Nuance Communications, Inc., Sunnyvale, USA;School of Informatics, University of Edinburgh, Edinburgh, Scotland, UK;Department of Linguistics, Stanford University, Stanford, USA;School of Informatics, University of Edinburgh, Edinburgh, Scotland, UK;Department of Linguistics, University of Texas at Austin, Austin, USA
Venue:
Language Resources and Evaluation
Year:
2010

Citing 7
Cited 8

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
A formal framework for linguistic annotation

Speech Communication - Special issue on speech annotation and corpus tools
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Parallelism in coordination as an instance of syntactic priming: evidence from corpus-based modeling

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Animacy encoding in English: why and how

DiscAnnotation '04 Proceedings of the 2004 ACL Workshop on Discourse Annotation
Learning information status of discourse entities

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction

Supervised noun phrase coreference research: the first fifteen years

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Unsupervised syntactic chunking with acoustic cues: computational models for prosodic bootstrapping

CMCL '11 Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics
Syntax, semantics and pragmatics in communication

Proceedings of the 7th International Conference on Semantic Systems
Learning the information status of noun phrases in spoken dialogues

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning the fine-grained information status of discourse entities

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Automatic animacy classification

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
A bottom-up exploration of the dimensions of dialog state in spoken interaction

SIGDIAL '12 Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Automatically acquiring fine-grained information status distinctions in German

SIGDIAL '12 Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a recently completed common resource for the study of spoken discourse, the NXT-format Switchboard Corpus. Switchboard is a long-standing corpus of telephone conversations (Godfrey et al. in SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of ICASSP-92, pp. 517---520, 1992). We have brought together transcriptions with existing annotations for syntax, disfluency, speech acts, animacy, information status, coreference, and prosody; along with substantial new annotations of focus/contrast, more prosody, syllables and phones. The combined corpus uses the format of the NITE XML Toolkit, which allows these annotations to be browsed and searched as a coherent set (Carletta et al. in Lang Resour Eval J 39(4):313---334, 2005). The resulting corpus is a rich resource for the investigation of the linguistic features of dialogue and how they interact. As well as describing the corpus itself, we discuss our approach to overcoming issues involved in such a data integration project, relevant to both users of the corpus and others in the language resource community undertaking similar projects.