International Journal of Performability Engineering, 2018, 14(12): 2994-3004 doi: 10.23940/ijpe.18.12.p9.29943004

Reliability Modeling of Speech Recognition Tasks

Hui Qiu, Xiangbin Yan, Rui Peng,, Kaiye Gao, and Langtao Wu

Donlinks School of Economics and Management, University of Science and Technology Beijing, Beijing, 100083, China

*Corresponding Author(s): * E-mail address: pengrui1988@ustb.edu.cn

Accepted:  Published:   

Abstract

Speech recognition is becoming the key technology of man-machine interfaces in information technology. The application of voice technology has become a competitive and new high-tech industry. However, due to the big volume of vocabulary, continuous voice, and personalized accents, it is hard to make speech recognition completely accurate. In this paper, a reliability model is proposed to measure the performance of speech recognition. In particular, two types of task failures are suggested and an iterative approach is adopted. Numerical examples are proposed for illustrative purposes.

Keywords: reliability model; speech recognition task; iterative approach

PDF (356KB) Metadata Related articles Export EndNote| Ris| Bibtex

Cite this article

Hui Qiu, Xiangbin Yan, Rui Peng, Kaiye Gao, Langtao Wu. Reliability Modeling of Speech Recognition Tasks . International Journal of Performability Engineering, 2018, 14(12): 2994-3004 doi:10.23940/ijpe.18.12.p9.29943004

1. Introduction

Speech recognition technology has applications in different systems, such as automatic translation telephones, question and answer machines, and intelligence decisions support systems[1-4]. The mechanism of speech recognition lies in the separation of the words and the matching of patterns between the words in the speech and the words in a dictionary [5-6]. Although many measures have been put forward to improve speech recognition accuracy, quantitatively evaluating the performance and reliability of speech recognition tasks is relatively limited [7].

With the advancement of artificial intelligence, the accuracy of speech recognition has been greatly improved. Speech recognition is already applied in everyday lives. For example, the WeChat app can transform voice messages into text messages with an acceptable accuracy. When one receives a voice message on WeChat and is not willing to listen, such as during a noisy occasion, one can choose to transform it into a text message in less than five seconds. Although some progress has been made, the accuracy of speech recognition is still quite difficult for three reasons [8-10].

The first reason that constrains the accuracy of speech recognition is the extensive vocabulary. No matter what language the speech is in, the language can contain at least tens of thousands of words, where at least a few thousands of words are frequently used words. The more words a language has and the more words have similar pronunciations, theeasier one word may be wrongly recognized as another word. For instance, it may be hard to distinguish “what” from “water”,“constellation” from “consternation”, bed ”from“ bird”, etc. The second reason is that the pronunciation for different words may be hard to be separated, especially when the speech speed is very fast. The third reason, which is even harder to cope with, is the personalized accents of different people. It is hard to find any two people who have exactly the same pronunciation for all the words. Even the same person may pronounce words differently at different times, such as by pronouncing words more nasally after catching a cold.

In this paper, we assume that word separation is perfect, but each word has some probability to be recognized wrongly. A success judging criterion is suggested for a speech recognition task. In particular, two failure modes are considered for judging where a speech recognition task is successful: 1) if too many consecutive words are recognized wrongly, the speech recognition task is regarded as failed. Actually, too many wrongly recognized words may make it difficult to achieve the accurate meaning in a speech. 2) If the first failure mode does not happen, then calculate a total score for the errors that have occurred in the speech recognition. In particular, if more consecutive words are recognized wrongly, a bigger increment is added to the score. If the total score of the speech recognition is bigger than a threshold, the speech recognition task is regarded as unsuccessful.

This paper is organized as follows: in Section 2, a reliability model is proposed considering the two types of failures. In Section 3, some specific numerical examples are givenand a comparative analysis is carried out. In Section 4, a summary of this article is made.

2. The Modeling Framework

Consider a speech consisting of N words, where each word has a probability ${{p}_{i}}$,$i\in \{1,2,\cdots ,n\}$, $(0\le {{p}_{i}}\le 1)$ to be recognized wrongly. To start, let all pi’s be equal and independent, whereall pi’sare replaced by P. In the future, this assumption can be readily relaxed to adapt to specific situations.

In case where at least KN consecutive words are recognized wrongly, the speech recognition is regarded as unsuccessful. In fact, too many words recognized wrongly in a speech recognition task may make one miss important information. If one just considers this failure mode, then the reliability of speech recognition task can be seen as the reliability of a linear consecutive K-out-of-N: F system, where “F” denotes the speech recognition task failure. The linear consecutive K-out-of-N: F system consists of N components and fails as long as at least K consecutive components fail. The linear consecutive K-out-of-N reliability system has been studied by many researchers and has been popularized [11-15].

In case where the longest group of consecutive wrongly recognized words is smaller than K, the recognized content from a speech may also be hard to understand correctly if too many places are recognized wrongly. The more places that are recognized wrongly and the more consecutive words that are recognized wrongly for each place, the harder it is to understand a speech correctly. In order to catch this feature, we suggest a scoring mechanism for the speech recognition task. The initial score is set to be zero. For each place recognized wrongly, an increment is added to the score depending on how many consecutive words are recognized wrongly at this place.

The reliability of a speech recognition task is expressed as R(N;p;s(1), ,s(N);S). Here, N is the number of words in the speech of concern and p is a probability of each word to be recognized wrongly. In particular, the score $s(i)>0$ is added to the score if i consecutive words are recognized wrongly at a place, where i=1, 2, , n, $s(i)$ is strictly increasing, and s(0)=0. The task is deemed as failed if the total score is no less than a threshold S. Note that it is implicitly assumed that all the words have the same contribution to the score. This assumption can be relaxed to adapt to specific occasions in the future. For example, a noun may be more important than an adjective, and thus a bigger increment should be added if a noun is recognized wrongly. It can be seen that this judging criterion has actually taken into account both failure modes. If a single string of consecutive wrongly recognized words contributes to a score no less than S, the task fails. On the other hand, if every string of consecutive wrongly recognized words makes a contribution of less than S but the summation of these contributions amounts to S, the task fails.

In order to introduce the general model form of R(N;p;s(1), ,s(N);S), it is necessary to meet the following conditions:

If S0, R(N;p;s(1), , s(N);S)=0. Actually,S0 implies that the score is already reached.

If S>0and s(N) <S, this means that N consecutive words are allowed to make mistakes; at this time R(N;p;s(1), ,s(N);S)=1, where s(0)=0. Actually, when S(N) <S, the task is still regarded as successful, even ifall the words are recognized wrongly.

If S>0and s(N) S, the implicit form for R(N;p;s(1), ,s(N);S) is difficultto obtain; however, an iterative approach can be adoptedto readily solve it. Use MN+1 to denote the first correctly recognized words in the speech, where M=N+1 corresponds to the case where all the words are recognized wrongly.The probability for any M=i can be expressed as

$\text{pr}\left( M=i \right)={{p}^{i-1}}\left( 1-p \right),\text{ }i=1,\cdots ,N$

In particular, when M=N+1,$\text{pr}\left( M=N+1 \right)={{P}^{N}}.$

In the case where M=i, the conditional reliability of the speech recognition task is R(N-i; p; K; s(1), ,s(N-i);S-s(I-1)). Therefore, it is easy to obtain

$\begin{matrix} & R\left( N;P;s\left( 1 \right),\cdots ,s\left( N \right);S \right)=\sum\limits_{i=1}^{N}{{{P}^{i-1}}\left( 1-p \right)}R\left( N-i;p;s\left( 1 \right),\cdots ,s\left( N-i \right);S-s\left( i-1 \right) \right) \\ & +{{p}^{N}}R\left( 0;P;s\left( 1 \right),\cdots ,s\left( N \right);S-s\left( N \right) \right) \end{matrix}$

Becauses(N) S, it is easy to obtain

$R\left( 0;p;s\left( 1 \right),\cdots ,s\left( N \right);S-s\left( N \right) \right)=0$

In the case where $K=\min \left\{ i\left| s\left( i \right)\ge S,\text{ }i=1,2,\cdots ,N \right. \right\}$, where at least KN consecutive words are recognized wrongly, the speech recognition is regarded as unsuccessful.

Thus, $R\left( N;p;s\left( 1 \right),\cdots ,s\left( N \right);S \right)$ can be degenerated into $R\left( N;p;K;s\left( 1 \right),\cdots ,s\left( K-1 \right);S \right)$.

In short, when s(K) S, the speech recognition task fails directly and there can be no continuous K errors, that is, at most (k-1), so the continuous errors score will only use s(1), , s(k-1). The N term is decomposed into the K term, and the degenerate formula is related to the parameter K. $R\left( N;p;s\left( 1 \right),\cdots ,s\left( N \right);S \right)$ can be degenerated into $R\left( N;p;K;s\left( 1 \right),\cdots ,s\left( K-1 \right);S \right)$. That is, the first item is correct, the first error the second is correct, and the previous (k-1) item error the last K is correct and a total of K items.

It is easy to see that when $i-1\ge k$,$S-s\left( i-1 \right)\le 0$,$R\left( N-i;p;s\left( 1 \right),\cdots ,s\left( N-i \right);S-s\left( i-1 \right) \right)=0$.

As a result, we can obtain that M must be 1,2, , K in order for the task to be successful.

Through the above analysis, Equation (2) can be simplified to:

$\begin{matrix} & R\left( N;p;s\left( 1 \right),\cdots ,s\left( N \right);S \right) \\ & =R\left( N;p;K;s\left( 1 \right),\cdots ,s\left( K-1 \right);S \right) \\ & =\sum\limits_{i=1}^{K}{{{p}^{i-1}}\left( 1-p \right)}R\left( N-i;p;s\left( 1 \right),\cdots ,s\left( K-1 \right);S-s\left( i-1 \right) \right) \\ \end{matrix}$

In Equation(3), we finally see that $R\left( N;p;s\left( 1 \right),\cdots ,s\left( N \right);S \right)$ has been decomposed into the form consistingof $R\left( N-i;p;s\left( 1 \right),\cdots ,s\left( N-i \right);S-s\left( i-1 \right) \right)$.

In particular, $R\left( N;p;K;s\left( 1 \right),\cdots ,s\left( K-1 \right);S \right)$ can be obtained by iteratively decomposing it. In order to illustrate the procedures, numerical examples are presented in the next section.

3. The Modeling Framework

In order to get a thorough understanding of the given reliability model, we propose two strategic numerical examples. For the first strategic numerical example, we analyse the same $s(i)$ score array under different thresholds. For the second strategic numerical example, and we analyse different $s(i)$ scorearrays under the same threshold.

3.1 An Numerical Example Analysis of the First Strategy

Given a speech example of N=8 word, assume that{s(0),s(1), ,s(4),s(5),s(6),s(7),s(8)}={0,10,30,60,90,120,150,180,210}. The speech recognition task fails if the total score reaches S=100. That is to say, the threshold S=100. Thus, the speech recognition task fails if at least K=5 consecutive words are recognized wrongly.

$R\left( 8;p;10,30,60,90,120,150,180,210;100 \right)$ can be degenerated into $R\left( 8;p;5;10,30,60,90;100 \right)$. The reliability of the speech recognition task can be denoted as $R\left( 8;p;5;10,30,60,90;100 \right)$. It can be further decomposed into

$\begin{matrix} & R\left( 8;p;5;10,30,60,90;100 \right) \\ & =\left( 1-p \right)R\left( 7;p;10,30,60,90;100 \right)+p\left( 1-p \right)R\left( 6;p;10,30,60,90;90 \right) \\ & +{{p}^{2}}\left( 1-p \right)R\left( 5;p;10,30,60,90;70 \right)+{{p}^{3}}\left( 1-p \right)R\left( 4;p;10,30,60,90;40 \right) \\ & +{{p}^{4}}\left( 1-p \right)R\left( 3;p;10,30,60,90;10 \right)\text{ } \\ \end{matrix}$

Where $R\left( 7;p;10,30,60,90;100 \right)$ can be degenerated into $R\left( 7;p;5;10,30,60,90;100 \right)$, $R\left( 6;p;10,30,60,90;90 \right)$ can be degenerated into $R\left( 6;p;4;10,30,60;90 \right)$, $R\left( 5;p;10,30,60,90;70 \right)$ can be degenerated into $R\left( 5;p;4;10,30,60;70 \right)$, $R\left( 4;p;10,30,60,90;40 \right)$ can be degenerated into $R\left( 4;p;3;10,30;40 \right)$, and $R\left( 3;p;10,30,60,90;10 \right)$ can be degenerated into $R\left( 3;p;1;0;10 \right)$.

Equation (4) is converted as follows:

$\begin{matrix} & R\left( 8;p;5;10,30,60,90;100 \right) \\ & =\left( 1-p \right)R\left( 7;p;5;10,30,60,90;100 \right)+p\left( 1-p \right)R\left( 6;p;4;10,30,60;90 \right) \\ & +{{p}^{2}}\left( 1-p \right)R\left( 5;p;4;10,30,60;70 \right)+{{p}^{3}}\left( 1-p \right)R\left( 4;p;3;10,30;40 \right) \\ & +{{p}^{4}}\left( 1-p \right)R\left( 3;p;1;0;10 \right)\text{ } \\ \end{matrix}$

Note that $R\left( 3;p;1;0;10 \right)$ corresponds to the probability that no more wordsare recognized wrongly starting from the 6th word, and it equals ${{\left( 1-p \right)}^{3}}$. Thus, we have

$\begin{matrix} & R\left( 8;p;5;10,30,60,90;100 \right) \\ & =\left( 1-p \right)R\left( 7;p;5;10,30,60,90;100 \right)+p\left( 1-p \right)R\left( 6;p;4;10,30,60;90 \right) \\ & +{{p}^{2}}\left( 1-p \right)R\left( 5;p;4;10,30,60;70 \right)+{{p}^{3}}\left( 1-p \right)R\left( 4;p;3;10,30;40 \right)+{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{ } \\ \end{matrix}$

Furthermore, we have

$\begin{matrix} & R\left( 4;p;3;10,30;40 \right) \\ & =\left( 1-p \right)R\left( 3;p;3;10,30;40 \right)+p\left( 1-p \right)R\left( 2;p;2;10;30 \right)+{{p}^{2}}\left( 1-p \right)R\left( 1;p;1;0;10 \right) \\ & =\left( 1-p \right)\left( 1-{{p}^{3}} \right)+p\left( 1-p \right)\left( 1-{{p}^{2}} \right)+{{p}^{2}}{{\left( 1-p \right)}^{2}}\text{ } \\ \end{matrix}$

$\begin{matrix} & R\left( 5;p;4;10,30,60;70 \right) \\ & =\left( 1-p \right)R\left( 4;p;4;10,30,60;70 \right)+p\left( 1-p \right)R\left( 3;p;3;10,30;60 \right)+{{p}^{2}}\left( 1-p \right)R\left( 2;p;3;10,30;40 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 1;p;1;0;10 \right) \\ & =\left( 1-p \right)\left( 1-{{p}^{4}} \right)\text{+}p\left( 1-p \right)\left( 1-{{p}^{3}} \right)\text{+}{{p}^{2}}\left( 1-p \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\text{ } \\ \end{matrix}$

$\begin{matrix} & R\left( 6;p;4;10,30,60;90 \right) \\ & =\left( 1-p \right)R\left( 5;p;4;10,30,60;90 \right)+p\left( 1-p \right)R\left( 4;p;4;10,30,60;80 \right)+{{p}^{2}}\left( 1-p \right)R\left( 3;p;3;10,30;60 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 2;p;2;10;30 \right) \\ & =\left( 1-p \right)\left( 1-{{p}^{5}}-2\left( 1-p \right){{p}^{4}} \right)\text{+}p\left( 1-p \right)\left( 1-{{p}^{4}} \right)+{{p}^{2}}\left( 1-p \right)\left( 1-{{p}^{3}} \right)+{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{2}} \right)\text{ } \\ \end{matrix}$

$\begin{matrix} & R\left( 7;p;5;10,30,60,90;100 \right) \\ & =\left( 1-p \right)R\left( 6;p;5;10,30,60,90;100 \right)+p\left( 1-p \right)R\left( 5;p;4;10,30,60;90 \right)+{{p}^{2}}\left( 1-p \right)R\left( 4;p;4;10,30,60;70 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 3;p;3;10,30;40 \right)+{{p}^{4}}\left( 1-p \right)R\left( 2;p;1;0;10 \right)\text{ } \\ & \text{=}{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}+{{p}^{4}}-3{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-{{p}^{4}} \right) \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-3{{p}^{4}} \right)\text{+}{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{3}} \right)\text{+}{{p}^{4}}{{\left( 1-p \right)}^{3}}\text{ } \\ \end{matrix}$

Lastly, the reliability of the speech recognition task can be expressed as follows:

$\begin{matrix} & R\left( 8;p;5;10,30,60,90;100 \right) \\ & ={{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}+{{p}^{3}}+{{p}^{4}}-3{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-3{{p}^{4}} \right) \\ & \text{+}{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{3}} \right)\text{+}{{p}^{4}}{{\left( 1-p \right)}^{3}}+p\left( 1-p \right)\left( 1-p \right)\left( 1-{{p}^{5}}-2\left( 1-p \right){{p}^{4}} \right)\text{+}p\left( 1-p \right)\left( 1-{{p}^{4}} \right)+{{p}^{2}}\left( 1-p \right)\left( 1-{{p}^{3}} \right) \\ & +{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{2}} \right)+{{p}^{2}}\left( 1-p \right)\left( 1-p \right)\left( 1-{{p}^{4}} \right)\text{+}p\left( 1-p \right)\left( 1-{{p}^{3}} \right)\text{+}{{p}^{2}}\left( 1-p \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}} \\ & +{{p}^{3}}\left( 1-p \right)\left( 1-p \right)\left( 1-{{p}^{3}} \right)+p\left( 1-p \right)\left( 1-{{p}^{2}} \right)+{{p}^{2}}{{\left( 1-p \right)}^{2}}+{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{ } \\ \end{matrix}$

Figure 1 shows the reliability of a speech consisting of N=8 words and S=100 threshold value as a function of p, the probability for wrongly recognizing a word. It can be seen that the smaller the probability for each word to be recognized wrongly, the greater the reliability. This is consistent with our definite theory.

Figure 2

Figure 2.   Reliability of speech recognition task (N=8,S=100)


In order to further analyse the properties of the reliability, we choose different thresholds to compare and analyse. For the above numerical example, we keep the values of N and {s(1), ,s(N)} but try different values of threshold values S={10,20,30,40,50,60,70,80,90,100}. We calculate the reliability under different thresholds. The calculation process is as follows:

Based on the above example, we assume that the other conditions are constant, and the threshold value becomes S=90.

That is to say, the speech recognition task fails if the total score reaches S=90. Thus, the speech recognition task fails if at least K=4 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;90 \right)$ can be degenerated into $R\left( 8;p;4;10,30,60;90 \right)$. We can obtain the reliability formula when the threshold value S=90.

$\begin{matrix} & R\left( 8;p;4;10,30,60;90 \right) \\ & =\left( 1-p \right)R\left( 7;p;4;10,30,60;90 \right)+p\left( 1-p \right)R\left( 6;p;4;10,30,60;80 \right)+{{p}^{2}}\left( 1-p \right)R\left( 5;p;3;10,30;60 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 4;p;2;10;30 \right)\text{ } \\ & \text{=}{{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-2{{p}^{4}}-2{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}-{{p}^{3}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1+p-{{p}^{2}} \right)\text{+}p{{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{4}} \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{3}} \right) \\ & \text{+}{{p}^{4}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{2}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}-{{p}^{3}} \right)+{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{3}} \right)+{{p}^{4}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{2}} \right) \\ & +{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1+p-{{p}^{2}} \right)+{{p}^{4}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{2}} \right)\text{ } \\ \end{matrix}$

Similarly, assume the threshold value becomes S=80, that is, the speech recognition task fails if the total score reaches S=80. Thus, the speech recognition task fails if at least K=4 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;80 \right)$ can be degenerated into $R\left( 8;p;4;10,30,60;80 \right)$. We can get the following formula when the threshold value S=80:

$\begin{matrix} & R\left( 8;p;4;10,30,60;80 \right) \\ & =\left( 1-p \right)R\left( 7;p;4;10,30,60;80 \right)+p\left( 1-p \right)R\left( 6;p;4;10,30,60;70 \right)+{{p}^{2}}\left( 1-p \right)R\left( 5;p;3;10,30;50 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 4;p;2;10;20 \right)\text{ } \\ & \text{=}{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-3{{p}^{4}}-8{{p}^{5}}+7{{p}^{6}} \right)\text{+}p\left( 1-p \right)\left( 1-9{{p}^{4}}+12{{p}^{5}}-4{{p}^{6}} \right)\text{+}{{p}^{2}}\left( 1-p \right)\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right) \\ & \text{+}{{p}^{3}}\left( 1-p \right)\left( 1-6{{p}^{2}}+8{{p}^{3}}-3{{p}^{4}} \right) \\ \end{matrix}$

Suppose the threshold value S=70 and the other conditions do not change. The speech recognition task fails if the total score reaches S=70. Thus, the speech recognition task fails if at least K=4 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;70 \right)$ can be degenerated into $R\left( 8;p;4;10,30,60,90;70 \right)$. We can get the following formula when the threshold value S=70:

$\begin{matrix} & R\left( 8;p;4;10,30,60,90;70 \right) \\ & =\left( 1-p \right)R\left( 7;p;4;10,30,60;70 \right)+p\left( 1-p \right)R\left( 6;p;3;10,30;60 \right)+{{p}^{2}}\left( 1-p \right)R\left( 5;p;3;10,30;40 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 4;p;2;10;20 \right)\text{ } \\ & \text{=}{{\left( 1-p \right)}^{2}}\left( 1-9{{p}^{4}}+12{{p}^{5}}-4{{p}^{6}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{5}}\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-2{{p}^{3}}+{{p}^{4}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)+{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{+}{{p}^{3}}{{\left( 1-p \right)}^{5}} \\ \end{matrix}$

Suppose the threshold value S=60 and the other conditions do not change. The speech recognition task fails if the total score reaches S=60. Thus, the speech recognition task fails if at least K=3 consecutive words are recognized wrongly.

$R\left( 8;p;10,30,60,90,120,150,180,210;60 \right)$ can be degenerated into $R\left( 8;p;3;10,30;60 \right)$. We can get the following formula when the threshold value S=60:

$\begin{matrix} & R\left( 8;p;3;10,30;60 \right) \\ & \text{=}{{\left( 1-p \right)}^{3}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{3}}\left( 1-2{{p}^{3}}+{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right) \\ & \text{+}p{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right) \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)+{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right) \\ \end{matrix}$

Suppose the threshold value S=50 and the other conditions do not change. The speech recognition task fails if the total score reaches S=50. Thus, the speech recognition task fails if at least K=3 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;50 \right)$ can be degenerated into $R\left( 8;p;3;10,30;50 \right)$. We can get the following formula when the threshold value S=50:

$\begin{matrix} & R\left( 8;p;3;10,30;50 \right) \\ & \text{=}{{\left( 1-p \right)}^{3}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{3}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right) \\ & \text{+}p{{\left( 1-p \right)}^{3}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)+{{p}^{3}}{{\left( 1-p \right)}^{5}}\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-6{{p}^{2}}+8{{p}^{3}}-3{{p}^{4}} \right) \\ & \text{+}p{{\left( 1-p \right)}^{3}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)+{{p}^{3}}{{\left( 1-p \right)}^{5}} \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)+2{{p}^{3}}{{\left( 1-p \right)}^{5}} \\ \end{matrix}$

Suppose the threshold value S=40 and the other conditions do not change. The speech recognition task fails if the total score reaches S=40. Thus, the speech recognition task fails if at least K=3 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;40 \right)$ can be degenerated into$R\left( 8;p;3;10,30;40 \right)$. We can get the following formula when the threshold value S=40:

$\begin{matrix} & R\left( 8;p;3;10,30;40 \right) \\ & \text{=}{{\left( 1-p \right)}^{4}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right)\text{+}p{{\left( 1-p \right)}^{4}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{6}} \\ & \text{+}p{{\left( 1-p \right)}^{4}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{4}}\left( 1-{{p}^{2}} \right)+{{p}^{3}}{{\left( 1-p \right)}^{6}} \\ & \text{+}p{{\left( 1-p \right)}^{4}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+2}{{p}^{2}}{{\left( 1-p \right)}^{4}}\left( 1-{{p}^{2}} \right)+{{p}^{3}}{{\left( 1-p \right)}^{5}} \\ & +{{p}^{2}}{{\left( 1-p \right)}^{5}}\text{+}p{{\left( 1-p \right)}^{4}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+3}{{p}^{2}}{{\left( 1-p \right)}^{4}}\left( 1-{{p}^{2}} \right) \\ & +2{{p}^{3}}{{\left( 1-p \right)}^{5}}+{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{2}} \right)+{{p}^{2}}{{\left( 1-p \right)}^{6}} \\ \end{matrix}$

Suppose the threshold value S=30 and the other conditions do not change. The speech recognition task fails if the total score reaches S=30. Thus, the speech recognition task fails if at least K=2 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;30 \right)$ can be degenerated into $R\left( 8;p;2;10;30 \right)$. Therefore, we get the reliability formula as follows when the threshold value S=30:

$\begin{matrix} & R\left( 8;p;2;10;30 \right) \\ & \text{=}{{\left( 1-p \right)}^{5}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+4}p{{\left( 1-p \right)}^{5}}\left( 1-{{p}^{2}} \right)+3{{p}^{2}}{{\left( 1-p \right)}^{6}}\text{+2}{{p}^{2}}{{\left( 1-p \right)}^{4}}\left( 1-{{p}^{2}} \right) \\ & +p{{\left( 1-p \right)}^{5}}\left( 1-{{p}^{2}} \right)+3{{p}^{2}}{{\left( 1-p \right)}^{6}}+{{p}^{2}}{{\left( 1-p \right)}^{4}}\left( 1-{{p}^{2}} \right) \\ \end{matrix}$

Suppose the threshold value S=20 and the other conditions do not change. The speech recognition task fails if the total score reaches S=20. Thus, the speech recognition task fails if at least K=2 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;20 \right)$ can be degenerated into $R\left( 8;p;2;10;20 \right)$. Thus, we get the reliability formula as follows when the threshold value S=20:

$\begin{matrix} & R\left( 8;p;2;10;20 \right) \\ & \text{=}{{\left( 1-p \right)}^{5}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)+5p{{\left( 1-p \right)}^{7}} \\ \end{matrix}$

Suppose the threshold value S=10 and the other conditions do not change. The speech recognition task fails if the total score reaches S=10. Thus, the speech recognition task fails if at least K=1 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;10 \right)$ can be degenerated into $R\left( 8;p;1;0;10 \right)$. Therefore, we get the reliability formula as follows when the threshold value S=10:

$R\left( 8;p;1;0;10 \right)\text{=}{{\left( 1-p \right)}^{8}}$

Figure 2 shows the comparison and analysis of the reliability of a speech consisting of N=8 words under 10 different thresholds, where S={10,20,30,40,50,60,70,80,90,100}. We can see that as the threshold decreases, the reliability of the speech recognition task decreases. WhenS=10, the speed of the reliability of the speech recognition task decline is the fastest. When the threshold is very small, only a few consecutive wrong words are allowed to appear, so the reliability is smaller. By comparing these 10 diagrams, it can be seen that the reliability of the threshold S=100 is the largest.

Figure 2

Figure 2.   Comparison and analysis of the reliability ofspeech recognition tasks under 10 different thresholds


Above isa comparative analysisof the same $s(i)$ scorearray under the different thresholds, where {$s(i)$}={s(0),s(1), ,s(4),s(5),s(6),s(7),s(8)}={0,10,30,60,90,120,150,180,210}.

In order to further study the reliability of speech recognition tasks, we try to analyse the conditions under the same threshold and different $s(i)$ score arrays.

3.2 An numerical example Analysis of the Second Strategy

Below, we analyse the conditions under the same threshold and different $s(i)$ score arrays. The above score array{$s(i)$}={s(0),s(1), ,s(4),s(5),s(6),s(7),s(8)}={0,10,30,60,90,120,150,180,210} is increasing, but it is increasing irregularly. Now, we analyse other three regular increments, that is, $s(i)$ is a function value. The analysis for the other three types is as follows:

Suppose the first numerical example {$s(i)$}={ s(0),s(1), ,s(4),s(5),s(6),s(7),s(8)}={0,10,30,60,90,120,150,180,210}is the first type. We then analyse the second type: suppose that $s(i)$ is a linear growth function, where $s(i)=22.5(i)$. In that way,{s(1), ,s(4), s(5)}={22.5,45,67.5,90,112.5}. The speech recognition task fails if the total score reaches S=100. Thus, the speech recognition task fails if at least K=5 consecutive words are recognized wrongly.

$R\left( 8;p;22.5,45,67.5,90;100 \right)$ can be degenerated into $R\left( 8;p;5;22.5,45,67.5,90;100 \right)$. The reliability of the speech recognition task can be expressed as follows:

$\begin{matrix} & R\left( 8;p;5;22.5,45,67.5,90;100 \right) \\ & \text{=}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{3}} \right)\left( 1+{{p}^{2}} \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{5}}+2{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{+2}p{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{4}} \right) \\ & +4{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{2}} \right)+2{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{3}} \right)+4{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{+}p{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{4}} \right) \\ & +2{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{3}} \right)+3{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{2}} \right)+4{{p}^{4}}{{\left( 1-p \right)}^{4}}+{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{3}} \right) \\ & +2{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{2}} \right)+3{{p}^{4}}{{\left( 1-p \right)}^{4}}+{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{2}} \right)+3{{p}^{4}}{{\left( 1-p \right)}^{4}} \\ \end{matrix}$

Next, we analyse the third type. Suppose that $s(i)$ is quadratic function, where $s(i)=5.625{{(i)}^{2}}$; in that way, {s(1), ,s(4), s(5)}={5.625,22.5,50.625,90,140.625}. The speech recognition task fails if the total score reaches S=100.Thus, the speech recognition task fails if at least K=5 consecutive words are recognized wrongly.

$R\left( 8;p;5.625,22.5,50.625,90,140.625;100 \right)$ can be degenerated into $R\left( 8;p;5;5.625,22.5,50.625,90;100 \right)$. Therefore, we get the reliability formula as follows:

$\begin{matrix} & R\left( 8;p;5;5.625,22.5,50.625,90;100 \right) \\ & \text{=}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{4}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{5}} \right) \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{4}} \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{3}} \right)\text{+}{{p}^{4}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{2}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{5}} \right) \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{4}} \right)\text{+}{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{3}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{4}} \right)\text{+}{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{3}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{3}} \right)\text{+}{{p}^{4}}\left( 1-p \right)\left( 1-{{p}^{2}} \right)\text{+}{{p}^{4}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{2}} \right)\text{+}{{p}^{5}}{{\left( 1-p \right)}^{3}} \\ \end{matrix}$

Finally, we analyse the fourth type. Assume that $s(i)$ isa square root function, where $s(i)=45{{(i)}^{\frac{1}{2}}}$ and{s(1), ,s(4),s(5)}={45,63.6396,77.9423,90,100.6231}. The speech recognition task fails if the total score reaches S=100. Thus, the speech recognition task fails if at least K=5 consecutive words are recognized wrongly.

$R\left( 8;p;45,63.6396,77.9423,90,100.6231;100 \right)$ can be degenerated into $R\left( 8;p;5;45,63.6396,77.9423,90;100 \right)$. The reliability of the speech recognition task can be denotedas:

$\begin{matrix} & R\left( 8;p;5;45,63.6396,77.9423,90;100 \right) \\ & \text{=}{{\left( 1-p \right)}^{4}}\text{+3}p{{\left( 1-p \right)}^{5}}\left( 1-{{p}^{2}} \right)\text{+9}{{p}^{2}}{{\left( 1-p \right)}^{6}}\text{+3}{{p}^{3}}{{\left( 1-p \right)}^{5}} \\ & \text{+3}{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{+}p{{\left( 1-p \right)}^{5}}\left( 1-{{p}^{2}} \right)\text{+4}{{p}^{2}}{{\left( 1-p \right)}^{6}} \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{6}}\text{+}{{p}^{3}}{{\left( 1-p \right)}^{5}}\text{+}{{p}^{4}}{{\left( 1-p \right)}^{4}} \\ \end{matrix}$

Figure 3 shows the curves for four types of score functions. The first curve corresponds to the $s(i)$ value of the first numerical example, where $\{s\left( i \right)\}=\{s(0),s(1),s(2),s(3),s(4),s(5),s(6),s(7),s(8)\}=\{0,10,30,60,90,120,150,180,210\}. $These four types of curves represent different situations. Though all these curves are increasing, some are increasing faster and faster, some are increasing slower and slower, and some are increasing constantly.

Figure 3

Figure 3.   Four types of score functions


Figure 4 shows the reliability curves for the four types of score functions given that the threshold is the same. It can be seen from the graph that the fourth types of speech recognition tasks have the minimum reliability. That is because the values of the fourth type of function $s(i)=45{{(i)}^{\frac{1}{2}}}$ are greater than those of the other types from $s(1)$ to $s(3)$. Actually, it is more likely to have only a few words in a consecutively wrongly recognized word string, so thus the values of $s(1)-s(3)$ may have dominant effects on the task reliability. The reliability of the third type of speech recognition tasks is the biggest. That is because the values of the third type of function $s(i)=5.625{{(i)}^{2}}$ are smaller than those of the other types from $s(1)$ to $s(3)$. It can be seen that the reliability of the speech recognition task is related to the threshold value and the value of the $s(i)$.

Figure 4

Figure 4.   Reliability curves for the four types of score functions


4. Conclusions

The reliability of speech recognition tasks is discussed in this paper. The reliability model of speech recognition task failure is obtained by analysis. In particular, a success criterion is proposed based on two types of failures modes. If too many consecutive words are recognized wrongly in the speech recognition task, the task is deemed as failed. Otherwise, one can calculate a total score for the speech recognition task. In particular, for every group of consecutive words recognized wrongly, an increment is added to the score. The more the wrongly recognized consecutive words are, the bigger the increment should be. The task is regarded as failed if the score is bigger than a threshold. An iterative approach is proposed to evaluate the reliability of the speech recognition task, and numerical examples are proposed to illustrate the application.

In the future, the error of word separation can be taken into account. The words can be classified as different types based on the difficulty of recognition and the influence on speech understanding. Also, the reliability of different speech recognition technologies can be investigated. Finally, historical training data of realistic speech recognition tasks can be analysed to assess the probability of wrongly recognizing different types of words and the influences of them on speech understanding. This information can be further utilized to predict the reliability of new speech recognition tasks.

Acknowledgements

This research was supported by the National Natural Science Foundation of China (No.71671016) and the Fundamental Research Funds for the Central Universities (No. FRF-GF-17-B14). The suggestions from reviewers are very much appreciated.

Reference

A. Abdelaziz and D. Kolossa, “

General Hybrid Framework for Uncertainty-Decoding-based Automatic Speech Recognition Systems

,” Speech Communication, Vol. 79, pp. 1-13, 2016

[Cited within: 1]

H. Dui, S. Si, and R. Yam , “

Importance Measures for Optimal Structure in Linear Consecutive-K-out-of-N Systems

,” Reliability Engineering &System Safety, Vol. 169, pp. 339-350, 2018

Y. He, J. Han, T. Zhang, and G. Sun , “

A New Framework for Robust Speech Recognition in Complex Channel Environments

,” Digital Signal Processing, Vol. 32, pp. 109-123, 2014

M. Khademian and M. MehdiHomayounpour, “

Monaural Multi-Talker Speech Recognition Using Factorial Speech Processing Models

,” Speech Communication, Vol. 98, pp. 1-16, 2018

[Cited within: 1]

J. Matthews and J. Cheng, “

Recognitionof High Frequency Words From speech as apredictor of L2 Listening Comprehension

,” System, Vol. 52, pp. 1-13, 2015

[Cited within: 1]

S. Mattys and L. Wiget, “

Effects of Cognitive Load on Speech Recognition

,” Journal of Memory and Language, Vol. 65, No. 2, pp. 145-160, 2011

[Cited within: 1]

F. Mohammadi, E. Sáenz-de-Cabezón, and H. Wynn, “

Efficient Multicut Enumeration of K-out-of-N:F and Consecutive K-out-of-N: Fsystems

,” Pattern Recognition Letters, Vol. 102, pp. 82-88, 2018

[Cited within: 1]

N.U. Nair and T. V. Sreenivas, “

Joint Evaluation of Multiple Speech Patterns for Speech Recognition and Training

,” Computer Speech & Language, Vol. 25, No. 2, pp. 307-340, 2010

[Cited within: 1]

A. Ogawa and A. Nakamura, “

Joint Estimation of Confidence and Error Causes in Speech Recognition

,” Speech Communication,Vol. 54, No. 9, pp. 1024-1028, 2012

A. Ogawa and T. Hori, “

Error Detection and Accuracy Estimation in Automatic Speech Recognition Using Deep Bidirectional Recurrent Neural Networks

,” Speech Communication, Vol. 89, pp. 70-83, 2017

[Cited within: 1]

J. Ooster, R. Huber, B. Kollmeier, and B. T. Meyer ,“

Evaluation of an Automated Speech-Controlled Listening Test with Spontaneous and Read Responses

,” Speech Communication, Vol. 98, pp. 85-94, 2018

[Cited within: 1]

R. Peng, M. Xie, S.H. Ng, and G. Levitin , “

Element Maintenance and Allocation for Linear Consecutively Connected Systems

,” IIE Transactions, Vol. 44, No. 11, pp. 964-973, 2012

S. Srinivasan and D. Wang, “

Robust Peeche Cognitiony Integrating Seech Eparation and Hypothesis Testing

,” Speech Communication, Vol. 52, No. 1, pp. 72-81, 2010

H. Xiao, R. Peng, W. Wang, and F. Zhao , “

Optimal Element Loading for Linear Sliding Window Systems

,” Journal of Risk and Reliability, Vol. 230, No. 1, pp. 75-84, 2015

Y. Xiang and G. Levitin, “

Combinedm-Consecutive and K-out-of-N Sliding Window Systems

,” European Journal of Operational Research, Vol. 219, No. 1, pp. 105-113, 2012

[Cited within: 1]

/