|
This article is cited in 12 scientific papers (total in 12 papers)
Artificial Intelligence, Knowledge and Data Engineering
An analytic survey of end-to-end speech recognition systems
N. M. Markovnikova, I. S. Kipyatkovaba a St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS)
b Saint Petersburg State University of Aerospace Instrumentation (SUAI)
Abstract:
This article presents an analytic survey of various end-to-end speech recognition
systems, as well as some approaches to their construction, training and optimization. We
consider models based on connectionist temporal classification (CTC) as a loss function for
neural networks, models based on encoder-decoder architecture with attention mechanism.
Also, we describe neural networks models built using conditional random field (CRF), that is a
generalization of hidden markov models that allows to fix some drawbacks of standard hybrid
speech recognition systems like an assumption of independency of elements from speech
frames sequences. We also describe integration possibilities with language models at a stage of
decoding for end-to-end systems. Also, various modification and improvements of standard
end-to-end models, for example, like generalization of connectionist temporal classification
and regularization using at attention-based encoder-decoder models. We see that such an
approach significantly reduces recognition error rates for end-to-end models. A survey of
research works in this subject area reveals that end-to-end systems allow achieving results
close to that of the state-of-the-art hybrid models. Nevertheless, end-to-end models use simple
configuration and demonstrate a high speed of learning and decoding. In addition, we consider
popular frameworks and toolkits for creating speech recognition systems like TensorFlow,
Eesen, Kaldi, etc. Theirs comparing was provided by simplicity and accessibility of
implementation end-to-end speech recognition system.
Keywords:
speech recognition, end-to-end models, neural networks, deep learning.
Received: 28.11.2017
Citation:
N. M. Markovnikov, I. S. Kipyatkova, “An analytic survey of end-to-end speech recognition systems”, Tr. SPIIRAN, 58 (2018), 77–110
Linking options:
https://www.mathnet.ru/eng/trspy1007 https://www.mathnet.ru/eng/trspy/v58/p77
|
|