Saturday 01 March 2025
The researchers set out to compare two different approaches to integrating speech recognition into large language models (LLMs). The first method, dense feature prepending (DFP), involves adapting the embedded speech representations to the input feature space of an LLM via a modality adapter and then prepending them to a textual prompt. The second approach is cross-attention, which uses a standard encoder-decoder architecture with attention mechanisms.
The researchers trained all models from scratch using comparable data and parameter settings. They tested both methods on speech-to-text recognition (ASR) and translation (ST) tasks using the MuST-C v1.0 and CoVoST2 datasets.
Interestingly, the results don’t show a clear advantage of DFP over cross-attention. In fact, in some cases, the cross-attention approach performed better. This challenges the widespread adoption of DFP as the default method for integrating speech recognition into LLMs.
One possible explanation is that the attention mechanism in cross-attention allows it to dynamically focus on relevant parts of the input speech signal, whereas DFP relies on a fixed adapter to adapt the speech representations. This could make cross-attention more robust and flexible in real-world scenarios.
Another important finding is that both methods struggled with out-of-vocabulary words and non-native accents. This highlights the need for better handling of these issues in future research.
Overall, this study shows that there is no one-size-fits-all solution for integrating speech recognition into LLMs, and that further exploration of different approaches is necessary to achieve state-of-the-art performance.
The researchers also explored using CTC compression and sequence-level knowledge distillation to improve the models’ performance. These techniques showed promise, but more research is needed to fully understand their benefits and limitations.
As the field continues to evolve, it will be important to develop more sophisticated methods for handling spoken language input. This study represents an important step in that direction, and its findings will likely influence future research in this area.
Cite this article: “Comparing Approaches for Integrating Speech Recognition into Large Language Models”, The Science Archive, 2025.
Large Language Models, Speech Recognition, Dense Feature Prepending, Cross-Attention, Asr, St, Must-C, Covost2, Out-Of-Vocabulary Words, Non-Native Accents







