Annotations and tools for an activity based Spoken Language Corpus

Authors:
Jens Allwood;Leif Grönqvist
Affiliations:
Göteborgs University, Göteborg, Sweden;Göteborgs University, Göteborg, Sweden
Venue:
SIGDIAL '01 Proceedings of the Second SIGdial Workshop on Discourse and Dialogue - Volume 16
Year:
2001

Citing 1
Cited 1

Foundations of statistical natural language processing

Foundations of statistical natural language processing

Studying discourse and dialogue with SIDGrid

TeachCL '08 Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper contains a description of the Spoken Language Corpus of Swedish at the Department of Linguistics, Göteborg University (GSLC), and a summary of the various types of analysis and tools that have been developed for work on this corpus. Work on the corpus was started in the late 1970:s. It is incrementally growing and presently consists of 1.3 million words from about 25 different social activities. The corpus was initiated to meet a growing interest in naturalistic spoken language data. It is based on the fact that spoken language varies considerably in different social activities with regard to pronunciation, vocabulary, grammar and communicative functions. The goal of the corpus is to include spoken language from as many social activities as possible to get a more complete understanding of the role of language and communication in human social life. This type of spoken language corpus is still fairly unique even for English, since many spoken language corpora (certainly for Swedish) have been collected for special purposes, like speech recognition, phonetics, dialectal variation or interaction with a computerized dialog system in a very narrow domain, e.g. (Map Task (Isard and Carletta (1995), TRAINS (Heeman and Allen 1994), Waxholm (Blomerg et al. 1993). Compared to English corpora, the Göteborg corpus is most similar to the Wellington Corpus of Spoken New Zealand English (Holmes, Vine and Johnson 1998), but also has traits in common with the BNC, the London/Lund corpus (Svartvik 1990) and the Danish BySoc corpus (Gregersen 1991, Henrichsen 1997).