We see on this graph that the decoding time increases linearly with the beam, while the WER decreases linearly until it reaches a threshold value. The choice of an interesting beam is a compromise between precision and speed. Here a value of 9 seems correct.
It is also possible to highlight the impact of the size of the models that are evaluated with the WER. In the graph below, we study the impact of the beam on the WER for different sizes of language models. The language model (LM) defines a probability distribution over a sequence of words. In the context of automatic speech recognition, the model that is mainly used is the
N-gram model. It makes it possible to predict in terms of probability the appearance of a word conditioned by the previous N-1.
For example, a 3-gram model will give the probability of a word followed by a 2-word context. If it is well trained, he may, for instance, conclude that the word “soon” is more likely to appear after “see you” than the word “sentence”.
Thus, the language model is constructed by calculating the probabilities of occurrence of different groups of words, for N generally ranging from 1 to 4.
However, these language models are too expensive in terms of storage space. They increase the size of the decoding graph, and therefore, the time required to provide a transcription. They can be reduced by, for instance, limiting the size of the vocabulary or by imposing a threshold to eliminate N-grams with a too low probability (called pruning), but these techniques therefore reduce the quality of transcription, as illustrated in the graph below: