Visual tracking of bare fingers for interactive surfaces
Proceedings of the 17th annual ACM symposium on User interface software and technology
Analyzing and Capturing Articulated Hand Motion in Image Sequences
IEEE Transactions on Pattern Analysis and Machine Intelligence
Digital violin tutor: an integrated system for beginning violin learners
Proceedings of the 13th annual ACM international conference on Multimedia
Visual methods for the retrieval of guitarist fingering
NIME '06 Proceedings of the 2006 conference on New interfaces for musical expression
2005 Special Issue: Emotion recognition in human-computer interaction
Neural Networks - Special issue: Emotion and brain
Frame-dependent multi-stream reliability indicators for audio-visual speech recognition
ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 3 (ICME '03) - Volume 03
Analysis of lip geometric features for audio-visual speech recognition
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Statistical multimodal integration for audio-visual speech processing
IEEE Transactions on Neural Networks
iDVT: an interactive digital violin tutoring system based on audio-visual fusion
MM '08 Proceedings of the 16th ACM international conference on Multimedia
Hi-index | 0.00 |
Computer-assisted violin tutoring requires accurate violin transcription. For pitched non-percussive (PNP) sounds such as from the violin, note segmentation is a much more difficult task than pitch detection. This issue is accentuated when the audio is recorded during an instrument practice session at home which is acoustically inferior to a professional recording studio. This paper presents a new approach to the problem by using the correlation between different media streams for e-learning applications. We design a capture mechanism to record one audio and two video streams simultaneously, and exploit the relationships among them for enhanced transcription. State-of-the-art audio methods for note segmentation and pitch estimation are implemented as the audio-only baseline. Two web-cameras are employed to track the right hand (bowing) and the left hand's four fingers (fingering) on the fingerboard, respectively. The audio and visual information is then fused in the feature space. Our new approach is evaluated with an audio-visual violin music database containing 16 complete music pieces of different styles with 2157 notes in total. Experimental results show that our multimodal approach achieves a 10% increase in true positives, and a 8% reduction in false positives of overall transcription performance in comparison with the audio-only baseline.