On real-time multi-stage speech enhancement systems
Lingjun Meng
Jozef Coldenhoff
Paul Kendrick
Tijana Stojkovic
Andrew Harper
Kiril Ratmanski
Milos Cernak
Logitech Europe S.A., Lausanne, Switzerland
Paper link




Abstract

Recently, multi-stage systems have stood out among deep learning-based speech enhancement methods. However, these systems are always high in complexity, requiring millions of parameters and powerful computational resources, which limits their application for real-time processing in low-power devices. Besides, the contribution of various influencing factors to the success of multi-stage systems remains unclear, which presents challenges to reduce the size of these systems. In this paper, we extensively investigate a lightweight two-stage network with only 560k total parameters. It consists of a Mel-scale magnitude masking model in the first stage and a complex spectrum mapping model in the second stage. We first provide a consolidated view of the roles of gain power factor, post-filter, and training labels for the Mel-scale masking model. Then, we explore several training schemes for the two-stage network and provide some insights into the superiority of the two-stage network. We show that the proposed two-stage network trained by an optimal scheme achieves a performance similar to a four times larger open source model DeepFilterNet2.


DNS Challenge test set


Mixture:

PESQ: 1.27

Enhanced after 1st stage (Mel-scale masking):

PESQ: 1.98

Enhanced after 2nd stage (Complex spectrum learning):

PESQ: 2.11


Mixture:

PESQ: 1.23

Enhanced after 1st stage (Mel-scale masking):

PESQ: 2.17

Enhanced after 2nd stage (Complex spectrum learning):

PESQ: 2.31


Mixture:

PESQ: 1.33

Enhanced after 1st stage (Mel-scale masking):

PESQ: 2.16

Enhanced after 2nd stage (Complex spectrum learning):

PESQ: 2.37


Mixture:

PESQ: 1.28

Enhanced after 1st stage (Mel-scale masking):

PESQ: 2.54

Enhanced after 2nd stage (Complex spectrum learning):

PESQ: 2.71