Recently, the research team led by Prof. Zhang Xueliang from the National and Local Joint Engineering Research Center for Intelligent Information Processing Technology in Mongolian Language at IMU, including doctoral student Pan Jiahui, whose paper "Innovative Directional Encoding in Speech Processing: Leveraging Spherical Harmonics Injection for Multi-Channel Speech Enhancement" has been accepted by the International Joint Conference on Artificial Intelligence (IJCAI 2024). All authors of the paper are from IMU: Pan Jiahui (doctoral student, class of 2021), Shen Pengjie (doctoral student, class of 2022), Zhang Hui (associate professor), and Zhang Xueliang (professor). This research was supported by the National Natural Science Foundation of China.
IJCAI(International Joint Conference on Artificial Intelligence)is one of the world's premier conferences in the field of artificial intelligence and classified as an A-level international conference by the China Computer Federation (CCF), which has been driving the theoretical and practical advancements in AI technology since 1969. Annually, the conference attracts top researchers and practitioners globally to share their latest findings in cutting-edge areas of AI, earning it a high academic reputation and influence.
The paper primarily focuses on multi-channel speech enhancement, which aims to extract target speech signals from background noise using multiple microphones, with effective utilization of spatial cues being pivotal. Although deep learning techniques have shown great potential in multi-channel speech processing, most existing methods still rely on directly manipulating Short-Time Fourier Transform (STFT) coefficients. To address this issue, student Pan Jiahui proposes the adoption of Spherical Harmonics Transformation (SHT) to process multi-channel speech signals. The team conducted evaluations on the TIMIT dataset under various signal-to-noise ratios and reverberation conditions, and the results showed that their developed model outperformed existing baseline models in terms of performance. This achievement not only improved performance but also significantly enhanced the model's generalization ability while reducing computational load and the number of parameters. Further experiments on the MS-SNSD dataset validated the efficacy of the proposed method. The application prospects of this technology are broad, bringing new research directions and solutions to the field of multi-channel speech enhancement.
Taking the spherical harmonic transformation coefficient as the auxiliary input of the model can concisely express the spatial distribution, enabling signals with different numbers of microphones to be converted into coefficients of a unified dimension, which means a single model can adapt to different microphone array configurations without having to design a model for each layout separately. The team designed two model architectures based on SHT auxiliary inputs: parallel and serial. The parallel model includes two encoders that process STFT and SHT data separately, estimating the enhanced STFT by merging the outputs of these two encoders in the decoder, effectively integrating spatial context information. The serial model first applies the SHT transformation to the signal, then uses the transformed signal's STFT as network input. The main contributions of this study include: firstly, integrating spherical harmonics transformation technology into deep learning methods to improve spatial processing capabilities for multi-channel speech enhancement; Secondly, two innovative network architectures were introduced, namely parallel models that independently handle STFT coefficients and SHT, and serial models that jointly process spatial and spectral data; Finally, demonstrating that the proposed models exhibit excellent performance under various environmental conditions and can effectively adapt to different microphone array configurations.
