Spoken language annotation and data-driven modelling of phone-level pronunciation in discourse context

Authors:
Per-Anders Jande
Affiliations:
Department of Speech, Music and Hearing, School of Computer Science and Communication, KTH Lindstedtsvägen 24, SE-100 44 Stockholm, Sweden
Venue:
Speech Communication
Year:
2008

Citing 5
Cited 1

A Distance-Based Attribute Selection Measure for Decision Tree Induction

Machine Learning
Effects of speaking rate and word frequency on pronunciations in conversational speech

Speech Communication - Special issue on modeling pronunciation variation for automatic speech recognition
Shallow parsing with pos taggers and linguistic features

The Journal of Machine Learning Research
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
The effect of language model probability on pronunciation reduction

ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02

Hybrid statistical pronunciation models designed to be trained by a medium-size corpus

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

A detailed description of the discourse context of a word can be used for predicting word pronunciation in discourse context and also enables studies of the interplay between various types of information on e.g. phone-level pronunciation. The work presented in this paper is aimed at modelling systematic variation in the phone-level realisation of words inherent to a language variety. A data-driven approach based on access to detailed discourse context descriptions is used. The discourse context descriptions are constructed through annotation of spoken language with a large variety of linguistic and related variables in multiple layers. Decision tree pronunciation models are induced from the annotation. The effects of using different types and different amounts of information for model induction are explored. Models generated in a tenfold cross-validation experiment produce on average 8.2% errors on the phone level when they are trained on all available information. Models trained on phoneme level information only have an average phone error rate of 14.2%. This means that including information above the phoneme level in the context description can improve model performance by 42.2%.