Detection and modeling of transient audio signals with prior information

  • Authors:
  • Julius O. Smith, III;Harvey Thornburg

  • Affiliations:
  • Stanford University;Stanford University

  • Venue:
  • Detection and modeling of transient audio signals with prior information
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many musical audio signals are well represented as a sum of sinusoids with slowly varying parameters. This representation has uses in audio coding, time and pitch scale modification, and automated music analysis, among other areas. Transients (events where the spectral content changes abruptly, or regions for which spectral content is best modeled as undergoing persistent change) pose particular challenges for these applications. We aim to detect abrupt-change transients, identify transient region boundaries, and develop new representations utilizing these detection capabilities to reduce perceived artifacts in time and pitch scale modifications. In particular, we introduce a hybrid sinusoidal/source-filter model which faithfully reproduces attack transient characteristics under time and pitch modifications. The detection tasks prove difficult for sufficiently complex and heterogeneous musical signals. Fortunately, musical signals are highly structured—both at the signal level, in terms of the spectrotemporal structure of note events, and at higher levels, in terms of melody and rhythm. These structures generate context useful in predicting attributes such as pitch content, the presence and location of abrupt-change transients associated with musical onsets, and the boundaries of transient regions. To this end, a dynamic Bayesian framework is proposed for which contextual predictions may be integrated with signal information in order to make optimal decisions concerning these attributes. The result is a joint segmentation and melody retrieval for nominally monophonic signals. The system detects note event boundaries and pitches, also yielding a frame-level sub-segmentation of these events into transient/steady-state regions. The approach is successfully applied to notoriously difficult examples like bowed string recordings captured in highly reverberant environments. The proposed transcription engine is driven by a probabilistic model of short-time Fourier transform peaks given pitch content hypotheses. The model proves robust to missing and spurious peaks as well as uncertainties about timbre and inharmonicity. The peaks' likelihood evaluation marginalizes over a number of observation-template linkages exponential in the number of observed peaks; to remedy this, a Markov-chain Monte Carlo (MCMC) traversal is developed which yields virtually identical results with greatly reduced computation.