Design, creation, and analysis of Czech corpora for structural metadata extraction from speech

Authors:
Jáchym Kolář
Affiliations:
Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Plzeň, Czech Republic 306 14
Venue:
Language Resources and Evaluation
Year:
2011

Citing 13
Cited 0

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Speech repairs, intonational boundaries and discourse markers: modeling speakers' utterances in spoken dialog

Speech repairs, intonational boundaries and discourse markers: modeling speakers' utterances in spoken dialog
M = syntax + prosody: a syntactic prosodic labelling scheme for large spontaneous speech databases

Speech Communication
Transcriber: Development and use of a tool for assisting speech corpora production

Speech Communication - Special issue on speech annotation and corpus tools
Structural event detection for rich transcription of speech

Structural event detection for rich transcription of speech
Automatic sentence structure annotation for spoken language processing

Automatic sentence structure annotation for spoken language processing
The best of two worlds: cooperation of statistical and rule-based taggers for Czech

ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
Genre effects on automatic sentence segmentation of speech: A comparison of broadcast news and broadcast conversations

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Neural network based language models for highly inflective languages

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
The Czech Broadcast Conversation Corpus

TSD '09 Proceedings of the 12th International Conference on Text, Speech and Dialogue
Automatic online subtitling of the czech parliament meetings

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Using Morphological Information for Robust Language Modeling in Czech ASR System

IEEE Transactions on Audio, Speech, and Language Processing
Enriching speech recognition with automatic detection of sentence boundaries and disfluencies

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and downstream automatic processes. The MDE annotation includes inserting boundaries of sentence-like units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and identifying sections of disfluent speech. This paper describes design, creation, and analysis of data resources for structural MDE from spoken Czech. The annotation is based on the LDC's MDE annotation standard for English, with changes applied to accommodate specific phenomena of Czech. In addition to the necessary language-dependent modifications, we further proposed and applied several language-independent modifications slightly refining the original annotation scheme. We created two Czech MDE speech corpora--one in the domain of broadcast news and the other in the domain of broadcast conversations. Both corpora have already been published at LDC. The analysis section of this paper presents a variety of statistics about fillers, edit disfluencies, and sentence-like units. The two Czech corpora are not only compared with each other, but also with statistics relating to the available English MDE corpora. We also report the statistics indicating that edit disfluencies have a different part of speech (POS) distribution in comparison with the overall POS distribution. The findings from the corpus analysis should help guide strategies for developing automatic MDE systems.