Semantic labeling of nonspeech audio clips

Authors:
Xiaojuan Ma;Christiane Fellbaum;Perry Cook
Affiliations:
Computer Science Department, Princeton University, Princeton, NJ;Computer Science Department, Princeton University, Princeton, NJ;Computer Science Department, Princeton University, Princeton, NJ
Venue:
EURASIP Journal on Audio, Speech, and Music Processing - Special issue on scalable audio-content analysis
Year:
2010

Citing 5
Cited 0

An evaluation of earcons for use in auditory human-computer interfaces

CHI '93 Proceedings of the INTERACT '93 and CHI '93 Conference on Human Factors in Computing Systems
Prediction-driven computational auditory scene analysis

Prediction-driven computational auditory scene analysis
A general audio classifier based on human perception motivated model

Multimedia Tools and Applications
Earcons and icons: their structure and common design principles

Human-Computer Interaction
How well do visual verbs work in daily communication for young and old adults?

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Human communication about entities and events is primarily linguistic in nature. While visual representations of information are shown to be highly effective as well, relatively little is known about the communicative power of auditory nonlinguistic representations. We created a collection of short nonlinguistic auditory clips encoding familiar human activities, objects, animals, natural phenomena, machinery, and social scenes. We presented these sounds to a broad spectrum of anonymous human workers using Amazon Mechanical Turk and collected verbal sound labels. We analyzed the human labels in terms of their lexical and semantic properties to ascertain that the audio clips do evoke the information suggested by their pre-defined captions. We then measured the agreement with the semantically compatible labels for each sound clip. Finally, we examined which kinds of entities and events, when captured by nonlinguistic acoustic clips, appear to be well-suited to elicit information for communication, and which ones are less discriminable. Our work is set against the broader goal of creating resources that facilitate communication for people with some types of language loss. Furthermore, our data should prove useful for future research in machine analysis/synthesis of audio, such as computational auditory scene analysis, and annotating/querying large collections of sound effects.