Robot with two ears listens to more than two simultaneous utterances by exploiting harmonic structures

Authors:
Yasuharu Hirasawa;Toru Takahash;Tetsuya Ogata;Hiroshi G. Okuno
Affiliations:
Graduate School of Informatics, Kyoto University, Kyoto, Japan;Graduate School of Informatics, Kyoto University, Kyoto, Japan;Graduate School of Informatics, Kyoto University, Kyoto, Japan;Graduate School of Informatics, Kyoto University, Kyoto, Japan
Venue:
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Year:
2011

Citing 4
Cited 0

Independent component analysis: algorithms and applications

Neural Networks
Analysis of sparse representation and blind source separation

Neural Computation
Blind separation of speech mixtures via time-frequency masking

IEEE Transactions on Signal Processing
Underdetermined blind source separation based on sparse representation

IEEE Transactions on Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In real-world situations, people often hear more than two simultaneous sounds. For robots, when the number of sound sources exceeds that of sensors, the situation is called under-determined, and robots with two ears need to deal with this situation. Some studies on under-determined sound source separation use L1-norm minimization methods, but the performance of automatic speech recognition with separated speech signals is poor due to its spectral distortion. In this paper, a two-stage separation method to improve separation quality with low computational cost is presented. The first stage uses a L1-norm minimization method in order to extract the harmonic structures. The second stage exploits reliable harmonic structures to maintain acoustic features. Experiments that simulate three utterances recorded by two microphones in an anechoic chamber show that our method improves speech recognition correctness by about three points and is fast enough for real-time separation.