Learning to identify fragmented words in spoken discourse

  • Authors:
  • Piroska Lendvai

  • Affiliations:
  • Tilburg University, The Netherlands

  • Venue:
  • EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Disfluent speech adds to the difficulty of processing spoken language utterances. In this paper we concentrate on identifying one disfluency phenomenon: fragmented words. Our data, from the Spoken Dutch Corpus, samples nearly 45,000 sentences of human discourse, ranging from spontaneous chat to media broadcasts. We classify each lexical item in a sentence either as a completely or an incompletely uttered, i.e. fragmented, word. The task is carried out both by the IB 1 and RIPPER machine learning algorithms, trained on a variety of features with an extensive optimization strategy. Our best classifier has a 74.9% F-score, which is a significant improvement over the baseline. We discuss why memory-based learning has more success than rule induction in correctly classifying fragmented words.