Enhancing Voice Assistant Intelligence through LLM
Release time:2023/10/26
Back list
"Hi Siri, what do you think of the iPhone 15?"
" Hi Siri, why do fish have eyes on both sides?"
...
Voice Assistants are Present on Almost Every Brand's Smartphone, and interacting with them has become an essential part of our daily lives. But are our voice assistants truly 'intelligent'?
For most of these questions, almost all voice assistants simply perform web searches and display the search results on the smartphone screen.
So, what was promised as a voice assistant can often become just an interface to a web browser. This is where the limitation of voice assistants lies.
While they can understand spoken language and provide brief responses, they lack the capability to access knowledge graphs or summarize information, failing to demonstrate any real 'intelligence.'
As we know large language models develop with high speed, and become more and more smarter, so why don't let them improve voice assistants?
According to internal information obtained by Axios [1], Google is planning to incorporate its latest Large Language Model (LLM) technology into Google Assistant, enhancing its content generation capabilities. Previously, after collaborating with OpenAI, Microsoft embedded generative AI assistants into Edge browsers, Microsoft Office, and Azure cloud services to make their services more intelligent.
Recently, Apple has established its own LLM framework called "Ajax" and has already applied it to functions like maps and Siri.
After introducing LLM into voice assistants, they will have the intelligence of LLM and will be "smarter". However, the combination of the two is not very simple, and there are several challenges.
• Computing resource limitations:
LLM require a large amount of computing resources to run, including high-performance CPUs and GPUs. Mobile phones have relatively limited computing power, so running LLM on mobile phones can cause performance issues such as latency and compromised battery life.
• Model size:
LLM usually take up a lot of storage space. Cell phones have limited storage capacity, so a way must be found to fit and manage these LLM within the limited storage space.
• Real-time and response speed:
Mobile assistants need to respond to user requests in almost real-time, so the model must be able to generate responses very quickly. This may require optimizing the model to make it more responsive.
• Privacy and data security:
LLM require large amounts of data to train, including user-generated text data. In mobile assistants, protecting user privacy and data security is crucial, so measures must be taken to ensure that user data is not misused or leaked.
• Multi-language and multi-dialect support:
Mobile assistants often need to support multiple languages and dialects, which requires the model to be able to handle multi-language input and output, which may increase complexity.
• Online/offline usage:
Users may use the mobile assistant without an internet connection and therefore need a way to handle requests in offline situations.
• Text and voice input:
Mobile assistants usually need to process both text and voice input, so the model must have multi-modal processing capabilities.
• Customization and personalization:
Users hope that mobile assistants can understand their individual needs and preferences, which requires models to be able to perform customized and personalized interactions.
• Compute Resource Constraints:
○ Model Compression and Pruning: Utilizing model compression and pruning techniques can reduce the model's size and computational requirements while maintaining high performance.
○ Hardware Optimization: Leveraging hardware acceleration features of mobile chips, such as Neural Processing Units (NPUs), can enhance the model's runtime speed and reduce battery consumption.
• Model Size:
○ Cloud Deployment: Deploying LLM in the cloud allows the mobile assistant to communicate with cloud services, eliminating the need to store the entire model locally.
○ Model Selection: Choosing an appropriately sized model strikes a balance between performance and storage requirements.
• Real-time Responsiveness and Response Speed:
○ Model Optimization: Employing model optimization techniques like lightweight models or quantization can improve the model's response speed.
○ Local Caching: Caching commonly used responses or data on the mobile device can reduce communication latency with cloud services.
• Privacy and Data Security:
○ End-to-End Encryption: Utilizing end-to-end encryption ensures the security of user data during transmission and storage.
○ Data Anonymization: Employing data anonymization techniques when handling user data helps mitigate the risk of data leakage.
• Multi-language and Dialect Support:
○ Multi-lingual Models: Using multi-lingual models enables support for various languages and dialects.
○ Translation and Conversion: Translation or conversion techniques can be used for interactions between different languages.
• Online/Offline Usage:
○ Offline Mode: Providing offline functionality allows the mobile assistant to perform basic tasks even without an internet connection.
• Text and Voice Input:
○ Multimodal Processing: Supporting simultaneous handling of both text and voice input enhances the user experience.
• Customization and Personalization:
○ User Profiles: Collecting and storing user profiles and preferences enables personalized services.
○ Transfer Learning: Employing transfer learning techniques applies known user preferences to new users.
Especially the last point, personalization, is of utmost importance. Everyone wishes for their phone to be their most loyal servant, understanding their preferences, tastes, and habits, and being able to provide relevant information based on those personal habits. To achieve this, generic LLM require extensive training with mobile-side data to adapt them to the usage scenarios on mobile devices. This adaptation enhances the accuracy of recognition and generation on mobile devices, ultimately improving the user experience.
Our company, DataOcean, provides a wealth of data for mobile-side speech recognition and speech synthesis. Examples are as follows:
King-TTS-262 Dutch Speech Synthesis Dataset - Male
This is a high-quality audio dataset ideal for speech synthesis applications.recorded in 48kHz, 16bit, PCM wav format with a single channel. The audio has been carefully processed to ensure a clean and clear sound, with a background noise level of <-60dB and a signal-to-noise ratio greater than 35dB. Additionally, the reverberation time (RT60) is less than 150ms, ensuring optimal conditions for speech synthesis. This dataset is perfect for a wide range of applications, including text-to-speech systems, voice assistants, and more.
King-TTS-066 British English Multi-speaker Speech Synthesis Corpus
This is a 1-channel British English multi-speaker TTS (Text To Speech) data , with a total size of about 6.50 GB. The data contains the recordings and labelings of 9610 sentences (125628 words). CMU phone set is used for script designing and labeling.
King-ASR-885 Greek Conversational Speech Recognition Dataset
This is a telephony Greek conversational speech data , which is collected over Internet phone.The corpus contains 53 pairs of Greek spontaneous conversational speech, which were from 106 speakers. For this collection, 2 speakers of each group performed the recording in separate quiet rooms. 23 topics were contained in this data.
King-ASR-835 Japanese Business Meeting Conversational Speech Recognition Corpus
The identification data is recorded in both a quiet environment and a noisy environment, and collected from a total of 300 speakers, including 144 males and 156 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as conference recordings.
Reference
[1] https://www.axios.com/2023/07/31/google-assistant-artificial-intelligence-news