Speech Enhancement

NMF meets Kalman filter dynamics for high-quality speech enhancement in non-stationary noise.

MERL Researchers: Jonathan Le Roux (Speech & Audio).
Joint work with: Cedric Fevotte (CNRS).

Search MERL publications by keyword: Speech & Audio, speech enhancement, denoising, dynamical system, non-negative matrix factorization, hidden Markov model.

Denoising of a mixture of female speech and helicopter sound.

Non-negative data arise in a variety of important signal processing domains, such as power spectra of signals, pixels in images, and count data. We introduce a novel non-negative dynamical system for sequences of such data, and describe its application to modeling speech and audio power spectra.

This work bridges two active fields, dynamical systems and nonnegative matrix factorization (NMF). Dynamical systems are a longstanding area of research with applications in many scientific fields. A large body of literature is devoted to the case of linear dynamical systems (LDS), which describe an observed sequence as a linear transform of some latent variables perturbed by some additive random (observation) innovation; the latent variables themselves are modeled as a continuous Markov chain, their value at a particular time being modeled as a linear transform of their previous value perturbed again by some additive random (state) innovation. A well-known example of such a model is the Kalman filter. The LDS model however does not naturally apply to the case where the observation and the latent variables are non-negative.

On the other hand, NMF is a more recent research area that has attracted a lot of attention in signal processing and machine learning communities. In the general case, NMF is the problem of finding an approximation of an observed matrix with non-negative entries as the product of two matrices with non-negative entries. In some settings, the columns of the observed matrix form a sequence with evolving dynamics, in the sense of statistical dependencies between elements in the sequence, that standard forms of NMF will fail to capture. Our work brings probabilistic dynamics to NMF, comparable to that of the traditional LDS.

By bringing continuous dynamics to an NMF-like formulation, we hope to obtain the best of both NMF and HMM worlds.

The discrete-state Hidden Markov Model (HMM) is another dynamical system that has been commonly used to handle dynamics of speech, most famously in automatic speech recognition, but also in speech synthesis as well as in speech separation. In this setting, the speech features are usually taken to be cepstral coefficients or other log-spectrum-based features. However, HMMs lead to combinatorial complexity due to the discrete state-space, especially in the co-occurrence of several speakers. Because of the discreteness of the state space and the state-conditional independence of adjacent frames, HMMs also famously do not easily handle gain adaptation and continuity over time. In contrast, standard NMF solves both the computational cost (of linear complexity per iteration) and gain adaptation problems, but it does not handle continuous dynamics. By bringing continuous dynamics to an NMF-like formulation, we hope to obtain the best of both worlds.

Denoising of a mixture of female speech and sound recorded at a railroad crossing.

The model we propose is called non-negative dynamical system (NDS). Its formulation follows that of the linear dynamical systems, but the observation and the latent variables are assumed non-negative, the linear transforms are assumed to involve non-negative coefficients, and the additive random innovations both for the observation and the latent variables are replaced by multiplicate random innovations.

We apply our proposed model to the task of speech enhancement. We learn the parameters of the linear transforms on some training speech data. At test time, we observe a mixture of speech and noise. We assume that the speech is modeled by our proposed NDS model using the parameters learned at training time, and estimate the remaining parameters. All elements that are not well-modeled by our learned speech model are considered to be part of the noise. By modeling the dynamics of the speech, our method is able to better separate noise from speech than previous methods such as the optimal modified minimum mean-square error log-spectral amplitude (OMLSA) method, especially in very challenging environments that involve non-stationary noises. We give here to denoising examples: the mixture of speech with some helicopter noise, and the mixture of speech with noise recorded at a railroad crossing.

Audio Samples

Helicopter: noisy [ wav mp3 ogg ]

Helicopter: omsla [ wav mp3 ogg ]

Helicopter: NDS [ wav mp3 ogg ]

Railroad: noisy [ wav mp3 ogg ]

Railroad: omsla [ wav mp3 ogg ]

Railroad: NDS [ wav mp3 ogg ]

MERL Publications

Fevotte, C., Le Roux, J., Hershey, J.R., "Non-negative Dynamical System with Application to Speech and Audio", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013.
BibTeX TR2013-021 PDF Software
- @inproceedings{Fevotte2013may,
- author = {Fevotte, C. and {Le Roux}, J. and Hershey, J.R.},
- title = {{Non-negative Dynamical System with Application to Speech and Audio}},
- booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
- year = 2013,
- month = may,
- url = {https://www.merl.com/publications/TR2013-021}
- }