As LLM becomes the product that major internet companies compete to show their strengths with, car manufacturers are also joining in the trends to train and develop LLM. In fact, implementing LLM in car companies' intelligent voice assistants can indeed greatly enhance user experience, but there are still many challenges to be addressed.
Barriers to Deploying LLM in Automotive Environments
There are some technical barriers to deploying LLM, such as GPT-4 or Whisper, into car voice assistant systems.
• Computational Resource Limitations
When deploying LLM like GPT-4 or Whisper into car voice assistant systems, a key technical challenge is handling the limitation of computational resources and the need for real-time responses. Large deep learning models usually require significant computational resources for real-time data processing. However, compared to servers or cloud computing platforms, car systems typically have limited computing power, including lower processor performance and less memory capacity. This means that car systems might struggle to support the real-time computational demands of LLM. Car voice assistant systems need to be able to respond quickly to user commands and queries. This places high demands on the inference speed of the model. However, LLM, due to their complex network structures, may show slower response times during inference. This delay could affect user experience, especially in driving environments where quick decision-making and responses are needed.
• Domain Mismatch—Scarcity of Relevant Vertical Domain Data
When deploying advanced voice recognition models like Whisper into car systems, even though these models have achieved or are close to human-level recognition on some English datasets, there are still a series of challenges and limitations when directly applied in actual car environments.
These challenges mainly include domain mismatch issues, differences in acoustic environments, and specific voice characteristics and data scarcity issues related to the car environment. Although Whisper performs well on some standard English voice datasets, these datasets are usually recorded in standardized, relatively quiet environments.
However, voice data in car environments have different characteristics, and the Whisper model may not have been exposed to a large amount of car environment data during training, which could lead to a gap in recognition performance in actual car environments compared to training. Various noises present in car environments, such as engine noise, road noise, wind noise, etc., differ greatly in acoustic properties from the standard datasets used in model training. Background noise in such environments can severely affect the accuracy of voice recognition.
Voice interactions in-car systems often contain specific commands and terminology that may differ from the corpus originally used in training the model. Furthermore, high-quality, diverse car environment voice data may be difficult to obtain, limiting the model's training and optimization capabilities in this specific domain.
King-ASR-642: American English Speech Recognition Corpus (Incar)
The data is recorded in an environment with vehicle noise and collected from a total of 40 speakers, including 20 males and 20 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The transcriptions cover domains such as navigation, text messages and news.
https://en.dataoceanai.com/dataset/c52-5794.htm
King-ASR-643: France French Speech Recognition Corpus (Incar)
The data is recorded in an environment with vehicle noise and collected from a total of 41 speakers, including 22 males and 19 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The transcriptions cover domains such as navigation, text messages and news.
https://en.dataoceanai.com/dataset/c52-5796.htm
King-ASR-645: German Speech Recognition Corpus (Incar)
The data is recorded in an environment with vehicle noise and collected from a total of 43 speakers, including 21 males and 22 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The transcriptions cover domains such as navigation, text messages and media.
https://en.dataoceanai.com/dataset/c52-5800.htm
King-ASR-873: Chinese and English Mixed Speech Recognition Corpus
The data is recorded in an indoor environment, with a total of about 500 hours of Chinese and 1400 hours of Chinese-English mixed data, covering a wide range of scenarios (car control, music, general purpose, maps, and casual conversation scenarios).