1. Introduction to personalized speech recognition technology
For speech recognition in an industrial voice assistant, a general recognition model is usually applied to satisfy the voice requests, trained as a large-scale deep neural network based on the pronunciation commonality of all regular users. However, such a model ignores the particularity of an individual. Once a user's speaking style has some particularity, such as accent with dialect characteristics, slurred speech, speech dysarthria, etc., it will lead to a terrible voice interaction experience.
Personalized speech recognition technology aims to use the representative corpus and text content of a single user to adjust the network connection of the above general model so that the entire network can learn the pronunciation characteristics of this single user, thereby producing a customized and personalized particular neural network to improve its accuracy and the corresponding interactive experience significantly.
2. The behind theory
With the rise of deep learning and neural networks, the representative methods to solve the acoustic adaptation problem include fine-tune, adding an input linear transformation layer (LIN), adding a linear output layer, and doing multi-task learning, using KLD loss function, adversarial or transfer learning, and the LHUC chain model used in this work.
Figure 1 LHUC chain adaptive model
The network structure of the LHUC model is shown in Figure 1. The principle is to keep all the original parameters of the general model. Instead, it is to add a layer of learnable adjustment factors behind each hidden layer to adjust the output of each layer. Using the specific user's speech features, the adaptation model is trained iteratively to complete the adjustments. The entire model architecture is built based on the chain model pioneered by Dr. Daniel Povey, the chief speech scientist of Xiaomi.
3. The technological innovation
The speech team of Xiaomi AI Lab has deep research experience in speech recognition and speaker adaptive technology. The team has advanced technology in dynamic decoding for personalized speech recognition and has ever customized a personalized model for Mr. Lei Jun, Chairman of Xiaomi Group. The accuracy in the product launch scenarios significantly exceeds that of the external speech recognition products. A paper related to this technology has been published at the top conference "Interspeech" in the academic speech field, which has a specific academic influence. Super large-scale users, deep technical accumulation, and superior engineering realization are the advantages of Xiaomi. To be specific, there are several innovations in this work:
The first one is to apply novel anti-overfitting technology. The most representative sentences are automatically selected for training, considering the richness of modeling units as the optimization criterion. Based on the super-large-scale data, more accurate phoneme-level language models are trained to alleviate the sparsity of the single user corpus. The second one is to provide different technical solutions for different accents. The LHUC chain model is used for users with medium and slight accents to maximize the performance of the inherited general model. For users with severe accents or dysphonia, the fine-tune method is applied. The third one is to improve the deployment speed. When the user starts the service, only the adjustment parameters of the personalized model are loaded to speed up the deployment and reasoning of the model.
4. Application prospects
The most valuable applications of personalized speech recognition technology help disabled users with speech dysarthria and users with heavy accents. By improving the accuracy of their speech recognition, it is promising to provide target users a more efficient speech interaction with AI products. So that they can equally live and enjoy the dividends of the digital age as regular users.
In addition, as Xiaomi's AI voice assistant, XiaoAi has been establishing an intelligent world with the powerful IoT(Internet of Things), with more than 1 million DAU(Daily Active Users). Meanwhile, Xiaomi is also thinking about maximizing the application value of personalized ASR technology for users, especially the elderly and children with their specific pronunciation. Through experiments on the data of real online users of XiaoAi, mostly the elderly users with heavy accents, the experimental results show that the application of personalized ASR technology can significantly improve the speech recognition accuracy of the target users. In the future, Xiaomi expects that AI technology can help more users in need so that everyone in the world can enjoy the intelligent and wonderful life brought by innovation.
Trailer of World Internet Conference
Copyright © World Internet Conference. All rights Reserved
Presented by China Daily. 京ICP备13028878号-23
Copyright © World Internet Conference. All rights Reserved Presented by China Daily. 京ICP备13028878号-23