Reconstructing spontaneous speech

  • Authors:
  • Frederick Jelinek;Erin Colleen Fitzgerald

  • Affiliations:
  • The Johns Hopkins University;The Johns Hopkins University

  • Venue:
  • Reconstructing spontaneous speech
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The output of a speech recognition system is often not what is required for subsequent processing, in part because speakers themselves make mistakes (e.g. stuttering, self-correcting, or using filler words). A system would accomplish speech reconstruction of its spontaneous speech input if its output were to represent, in flawless, fluent, and content-preserved English, the message that the speaker intended to convey. These cleaner speech transcripts would allow for more accurate language processing as needed for natural language tasks such as machine translation and conversation summarization, which often assume a grammatical sentence as input. Before attempting to reconstruct speech automatically, we seek to comprehensively understand the problem itself. We quantify the range, complexity, and frequencies of speaker errors common in spontaneous speech given a set of manual reconstruction annotations. This empirical analysis indicates the most frequent and problematic errors and thus suggests areas of focus for future reconstruction research. The surface transformations recorded in our reconstruction annotations reflect underlying influences in the psycholinguistics and speech production models of spontaneous speakers. We review standard theories and seek empirical evidence of both model assumptions and conclusions given the manual reconstructions and corresponding shallow semantic labeling annotation we collect. This investigation of naturally-occurring spontaneous speaker errors with manual semantico-syntactic analysis yields additional insight into the impact of spoken language on semantic structure and how these features can be used in future reconstruction efforts. Finally, given our accumulated knowledge about the types, frequencies, and drivers of speaker-generated errors in spontaneous speech, we build a set of systems to automatically identify and correct a subset of the most frequent errors. Using a conditional random field classification model and lexical, syntactic, and shallow semantic features to train both word-level and utterance-level error classifiers, we show improvement in the correction of these errors over a state-of-the-art system.