The 2024 Annual Conference of the International Speech Communication Association (Interspeech) will be held in Greece from September 1 to 5. Interspeech is one of the top conferences in the field of speech research organized by the International Speech Communication Association, also known as the world's largest comprehensive speech signal processing event, which enjoys a high reputation in the world, and has been widely concerned by people in the field of language around the world. Conference website: https://interspeech2024.org/
National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian in the School of Computer Science of our University has 8 papers accepted by Interspeech 2024, covering speech translation, echo cancellation, multimodal speech enhancement, etc. The following is a brief introduction of the papers.
Parameter-Efficient Adapter Based on Pre-trained Models for Speech Translation
Author: Chen Nan, Wang Yonghe, Fei Long*
Authors’ Affiliation: Inner Mongolia University
Abstract:
Multitask learning (MTL) approaches leverage pre-trained models in speech and machine translation to significantly advance speech-to-text translation tasks. However, it introduces a large number of parameters, resulting in increased training costs. Most Parametric Efficient fine-tuning (PEFT) methods only train additional modules to effectively reduce the number of trainable parameters. Nevertheless, in multilingual speech translation settings, the increase in trainable parameters resulting from the PEFT approach is not negligible. In this article, we first propose a parameter sharing adapter that reduces parameters by 7/8 compared to a regular adapter, while reducing performance by only about 0.7%. In order to achieve a balance between the number of model parameters and performance, we propose a model based on neural network search (NAS). The experimental results show that the performance of the adapter is closest to fine tuning, while the performance of LoRA is the worst.

Sign Value Constraint Decomposition for Efficient 1-Bit Quantization of Speech Translation Tasks
Author: Chen Nan, Wang Yonghe, Fei Long*
Authors’ Affiliation: Inner Mongolia University
Abstract:
Speech-to-text translation is essential for converting speech input into text output in different languages. While combining speech and machine translation pre-trained models can improve translation quality, it can also increase the number of parameters, resulting in a significant increase in hardware costs for model training and deployment. To address this challenge, we propose a linear layer 1-bit quantization model based on sign-valued constraint decomposition (SVCD). SVCD approximates the weight matrix of the linear layer as a symbolic matrix and two trainable vectors, which preserves a higher information capacity at a smaller space cost. In addition, we use knowledge distillation to transfer the power of the original fine-tuning model to the quantitative model. The experimental results show that the decoder's attention module is crucial for quantifying the performance of speech translation models.

Knowledge-Preserving Pluggable Modules for Multilingual Speech Translation Tasks
Author: Chen Nan, Wang Yonghe*, Fei Long
Authors’ Affiliation: Inner Mongolia University
Abstract:
Multilingual speech translation tasks usually employ retraining, regularization or resampling methods to add new languages. Retraining the model will significantly increase training time and cost. In addition, using existing regularization or resampling methods to balance performance between the new language and the original language can lead to catastrophic forgetting. This degrades the translation performance of existing languages. To alleviate the above problems, we store the knowledge of the new language in other models. We then introduce them as pluggable modules into existing multilingual speech translation models. This method does not significantly increase the training cost, nor does it affect the translation performance of the existing model. Experimental results show that our approach improves the translation performance of new languages without affecting existing translation tasks.

FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency
Author: Liu Rui, Xi Jiatian, Jiang Ziyue, Li Haizhou
Authors’ Affiliation: Inner Mongolia University
Abstract:
Text-based Speech Editing (TSE) technology is designed to edit the output audio by modifying the input text content, rather than directly editing the audio itself. Although neural network-based TSE technology has made a great progress, the current technology mainly focuses on reducing the difference between the speech fragment generated in the editing area and the reference target, while neglecting its local and global fluency in context and the original statement.
Inspired by the traditional cell selection speech synthesis architecture, this paper proposes FluentEditor, a smooth speech editing model, which introduces the training criteria of fluency and desirability into TSE training to improve the smoothness of editing boundaries from both acoustic and prosodic aspects. Specifically, The Acoustic Consistency Loss aims to smooth the smooth transition between an edited area and its adjacent acoustic segments, making it consistent with real audio. The Prosody Consistency Loss, on the other hand, aims to ensure that the prosody properties in the editing area are consistent with the overall style of the original statement. Subjective and objective experimental results on VCTK show that the FluentEditor model outperforms all advanced baseline methods in terms of naturalness and fluency. Audio samples and code are available at https://github.com/Ai-S2-Lab/FluentEditor.
Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge
Author: Liu Rui, Ma Zening
Authors’ Affiliation: Inner Mongolia University
Abstract:
Speech self-supervised learning (SSL) shows considerable efficacy in a variety of downstream tasks. However, the current popular self-supervised models tend to ignore the prior information related to emotion, thus ignoring the potential to improve the understanding of emotional tasks through the prior knowledge of emotion in speech. In this paper, we propose a method of learning speech representation of emotion perception using intensity knowledge. Specifically, we use the established speech emotion understanding model to extract frame-level emotion intensity. Subsequently, we propose a novel emotion masking strategy (EMS) that incorporates emotional intensity into the masking process. We selected two representative models based on Transformer and CNN, MockingJay and non-autoregressive predictive coding (NPC), and conducted experiments on the IEMOCAP dataset. The experimental results show that in SER task, the representation of our proposed method is better than the original model.

Deep Echo Path Modeling for Acoustic Echo Cancellation
Author: Zhao Fei, Zhang Chenggang, He Shulin, Liu Jinjiang, Zhang Xueliang
Authors’ Affiliation: Inner Mongolia University, Inner Mongolia MINZU University
Abstract:
Acoustic echo cancellation (AEC) is a key audio processing technique that eliminates echoes in microphone inputs, enabling good full-duplex communication. In recent years, deep learning has shown great potential in advancing AEC. However, deep learning methods face challenges in generalizing to complex environments, especially in invisible conditions that are not manifested in training. In this paper, we propose a deep learning-based method to predict the echo path in the time-frequency domain. Specifically, we first estimate echo paths in single-talk scenarios without near-end signals, and then use these predicted echo paths as auxiliary tags to train models in dual-talk scenarios with near-end signals. The experimental results show that our method outperforms the strong baseline model and shows good generalization ability for invisible acoustic scenes. By using deep learning to estimate echo paths, this work improves AEC performance under complex conditions.

SDAEC: Signal Decoupling for Advancing Acoustic Echo Cancellation
Author: Zhao Fei, Liu Jinjiang, Zhang Xueliang
Authors’ Affiliation: Inner Mongolia University
Abstract:
In acoustic echo cancellation methods based on deep learning, the neural network implicitly learns the echo path to cancel the echo. However, under low SNR conditions, the large energy difference between the microphone signal and the reference signal hinders the network's capabilities, resulting in poor performance. In this study, we propose a one-ear acoustic echo cancellation method based on signal decoupling, called SDAEC. Specifically, we model the energy of the reference signal and the microphone signal to obtain the energy scale factor. The reference signal is then multiplied by this energy scale factor and then fed into a subsequent echo cancellation network. This method reduces the difficulty of subsequent echo cancellation steps, thereby improving the overall cancellation performance. Experimental results show that this method improves the performance of multiple baseline models.


Unified Audio Visual Cues for Target Speaker Extraction
Author: Wu Tianci, He Shulin, Pan Jiahui, Huang Haifeng, Mo Zhijian, Zhang Xueliang
Authors’ Affiliation: Inner Mongolia University, Lenovo
Abstract:
Target speaker extraction is designed to separate the target speaker's speech from other interfering speaker speech. Usually, pre-recorded speech or facial video is used as auxiliary information to guide the neural network to focus on the target speaker. Existing methods use one of these cues or fuse the two through attention mechanisms to produce fusion features of the target speaker. Although both of these cues represent the same speaker, they are not from the same Angle. Audio cues record the timbre characteristics of the speaker, while lip movements represent synchrony features. To blend the strengths of different cues and mitigate conflicts between cues, we propose a unified target-speaker extraction network, called Uni-Net, which uses a divide-and-conquer strategy to fuse audio and video cues into different networks to leverage the unique information of each cue. The speech extracted by different cues is used as prior information and further refined by the post-processing network. Our experiments on the public VoxCeleb2 dataset show that Uni-Net achieves optimal performance compared to the baseline method.
