Tips on the hottest AI technology 0

  • Detail

One pigeon AI technology tips

in the era of Internet of things, speech recognition is regarded as a new entry for human-computer interaction, and natural language interaction between humans and robots has become possible. This week, one pigeon science and technology station will introduce the front-end processing of speech recognition to you from the perspective of technology

front end speech processing, which uses signal processing methods to detect and denoise the speaker's speech, so as to obtain the speech that is most suitable for speech recognition engine processing. Its main functions include endpoint detection VAD, streaming speech intelligent sentence breaking and noise elimination

I. endpoint detection

voice endpoint detection is a processing process that analyzes the input audio stream and determines the starting and ending points of the customer's speech. Once the customer is detected to speak, the voice begins to flow to the recognition engine until the customer is detected to promote the upgrading of key materials. This method aims to deepen the understanding and control of graphene performance, so that the recognition engine can start to recognize while the customer is talking, so as to maximize the real-time processing

1 endpoint detection process

1. Based on the characteristics of speech signals, parameters such as energy, zero crossing rate, entropy, pitch and their derived parameters are used to judge speech/non speech signals in the signal flow

2. After the voice signal is detected in the signal flow, judge whether this is the beginning or end point of the statement. In commercial speech systems, it is easier to make pauses (non speech) in sentences due to the changeable background of signals and natural dialogue mode, especially before the outbreak of initials, there will always be silent gaps. Therefore, this kind of start/end judgment is particularly important

2 endpoint detection purpose

reduce the data processing capacity of the recognizer. It can greatly reduce the amount of signal transmission and the computational load of the recognizer, and plays an important role in the real-time recognition of voice conversation

reject non voice signals. The recognition of non speech signals is not only a waste of resources, but also may change the state of the dialogue, causing trouble to users

in systems that require barge in, the starting point of voice is necessary. When the endpoint detection finds the starting point of the voice, the system will stop playing the prompt tone. Complete the interrupt function

3 impact of endpoint detection on recognition system

with the development of speech recognition applications, more and more systems regard interruption function as a convenient and effective application mode

and the interruption function directly depends on endpoint detection. The impact of endpoint detection on the interruption function occurs when there is an error in the process of judging voice/non voice, which is manifested in the false alarm of voice signal generated by overly sensitive endpoint detection will produce false interruption

for example, the prompt tone is very strong, which also means that Pingmei Shenma Group will invest 249 million yuan to subscribe for the background noise of some shares or the speech of others will be interrupted, because the endpoint detection error takes these signals as effective voice signals. On the contrary, if the endpoint detection misses the actual voice part and no voice is detected, the system will show no response

the prompt tone is still playing when the user speaks, and endpoint detection also has a great impact on the recognition effect of the recognition system. The starting point and ending point of speech signal are judged incorrectly, which may affect the integrity of the whole signal. Omit some useful data at the beginning or end of the statement. When this happens, it is likely to have a great impact on the accuracy of recognition, and incomplete information will reduce the recognition rate

4 features that commercial endpoint detection should have

high accuracy of endpoint detection

better background noise and speech model: make the system have a good rejection function for background noise, other speakers and non speech sounds

the default system related parameters have good applicability. In the real environment where there is a need, the system can be adjusted to adapt to the call environment, so that customers will not suffer additional losses in their work and improve the effect of endpoint detection

it has adaptive ability to the channel: it can quickly adapt to the current channel characteristics after the beginning of the conversation, which further improves the accuracy of endpoint detection

the unique recognition server's feedback and non voice duration dual end point determination function effectively improves the voice end point determination, especially for longer sentences

based on reliable endpoint detection technology and intelligent feedback, intelligent interrupt function should not only work well in general environment, but also effectively reject the voice of others in the environment of environmental noise, non voice high-intensity noise (breathing, door closing, etc.)

two stream voice intelligent sentence segmentation

the existing voice processing scheme is to use the voice activity detection module to segment the voice, and then automatically recognize the disconnected voice. However, in voice interaction scenarios, VAD faces two problems:

01 how can we successfully detect the lowest energy voice (sensitivity)

02 how to successfully detect (missed detection rate and false detection rate) in a changeable and complex noise environment

missed detection reflects the original voice but not detected, while false detection rate reflects the probability of being detected as a voice signal whether it is a voice signal or not. Relatively speaking, missed detection is unacceptable, and false detection can be further filtered through the back-end ASR and NLP algorithms, but false detection will lead to increased system resource utilization and delayed response

for the false detection before the target person's interactive voice, the main problem is to increase the amount of ASR recognition processing data, as shown in the following figure:

for the false detection after the target person's interactive voice, it will not only increase the amount of ASR recognition processing data, but also cause delayed response

the existing speech processing schemes have the problem of inaccurate sentence segmentation, which mainly has two main shortcomings:

first, noise and invalid speech cannot be filtered

in addition, the requirements for the speaker are high, and there can be no pause in the middle. If the pause duration between sentences is set too short, it is easy to cause truncation; If the pause duration between sentences is set too long, the response will be delayed

as shown in the figure below:

the streaming voice intelligent sentence breaking module is mainly composed of speech recognition module, information flow aggregation module, dynamic window setting module and sentence breaking recognition module. Among them:

speech recognition module is used to receive and recognize speech real-time stream, and output speech recognition results with timing according to the specified frequency

the information flow aggregation module is used to optimize the speech recognition results with time sequence, and integrate the optimized speech recognition results with time sequence to form the speech recognition result sequence

the dynamic window setting module is used to select the text in the specified range from the speech recognition result sequence, and then use the text in the specified range for sentence breaking analysis


sentence breaking recognition module is used to analyze the semantics of the specified range of text and determine whether to break sentences according to the semantics

three noise elimination

the engine has the function of noise elimination. In the process of practical application, the background noise is a real challenge for the application of speech recognition. Even if the speaker deals with a quiet office environment, there will inevitably be a certain amount of noise in the process of speech calls. The speech recognition system needs to have efficient noise elimination ability to meet the requirements of customers in diverse environments

that's all for the introduction of speech recognition front-end processing. The follow-up of Yige technology will bring you more relevant technical explanations. Please look forward to

Copyright © 2011 JIN SHI