Educational violin transcription by fusing multimedia streams

Authors:
Ye Wang;Bingjun Zhang;Olaf Schleusing
Affiliations:
National University of Singapore;National University of Singapore;National University of Singapore
Venue:
Proceedings of the international workshop on Educational multimedia and multimedia education
Year:
2007

Citing 8
Cited 1

Visual tracking of bare fingers for interactive surfaces

Proceedings of the 17th annual ACM symposium on User interface software and technology
Analyzing and Capturing Articulated Hand Motion in Image Sequences

IEEE Transactions on Pattern Analysis and Machine Intelligence
Digital violin tutor: an integrated system for beginning violin learners

Proceedings of the 13th annual ACM international conference on Multimedia
Visual methods for the retrieval of guitarist fingering

NIME '06 Proceedings of the 2006 conference on New interfaces for musical expression
2005 Special Issue: Emotion recognition in human-computer interaction

Neural Networks - Special issue: Emotion and brain
Frame-dependent multi-stream reliability indicators for audio-visual speech recognition

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 3 (ICME '03) - Volume 03
Analysis of lip geometric features for audio-visual speech recognition

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Statistical multimodal integration for audio-visual speech processing

IEEE Transactions on Neural Networks

iDVT: an interactive digital violin tutoring system based on audio-visual fusion

MM '08 Proceedings of the 16th ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computer-assisted violin tutoring requires accurate violin transcription. For pitched non-percussive (PNP) sounds such as from the violin, note segmentation is a much more difficult task than pitch detection. This issue is accentuated when the audio is recorded during an instrument practice session at home which is acoustically inferior to a professional recording studio. This paper presents a new approach to the problem by using the correlation between different media streams for e-learning applications. We design a capture mechanism to record one audio and two video streams simultaneously, and exploit the relationships among them for enhanced transcription. State-of-the-art audio methods for note segmentation and pitch estimation are implemented as the audio-only baseline. Two web-cameras are employed to track the right hand (bowing) and the left hand's four fingers (fingering) on the fingerboard, respectively. The audio and visual information is then fused in the feature space. Our new approach is evaluated with an audio-visual violin music database containing 16 complete music pieces of different styles with 2157 notes in total. Experimental results show that our multimodal approach achieves a 10% increase in true positives, and a 8% reduction in false positives of overall transcription performance in comparison with the audio-only baseline.