How to build automatic speech recognition system
Implementing the-state-of-the-art ASR (automatic speech recognition) systems is challenging in two aspects: (1) How to collect large amount of data and label them and (2) How to do effecitive training with these data.
Although general-purpose deep learning tookits (e.g., Tensorflow, Torch, Theano, Mxnet, Caffe) are powered by GPUs with friendly interface ofstochastic optimization (refer to my previous notes), there are certain circumstances that these toolkits are not efficient enough. Can you imagine what are these?
Sequence Alignment for Accoustic Features
As discussed by Xiaodong’s lecture, Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) are powerful tools to model accoustic features. However, it is not a trivial task to extract accoustic features (e.g., MFCC +LDA+MLLT+fMLLR, 40-dimensional features) and to implement a robust and reliable HMM. Pioneering companies, including IBM/Microsoft/Google, have their in-house built tools for GMM/HMM, as part of the ASR systems. Their tools are not open-source. However, you are suggested to consider Kaldi package, which was created by our ex-IBM colleague Daniel Povey. Kaldi has been a popular ASR toolkit for academic and start-ups in the past decade.
You may use Kaldi only to build ASR systems. Or if you think Kaldi’s DNN interface is not easy to use, you can combine Kaldi for the HMM task and Theano/Torch/Keras/MxNet for the deep network task, like in the following examples:
End-to-end Learning: Connectionist Temporal Classification
Another direction of ASR is to give up the traditional HMM with accoustic features but pursue end-to-end learning. However, bear in mind that end-to-end learning only works when you have enough data (e.g. thousands of hours of annotated speech data). The following figure is borrowed from Adam Coates’ talk, which suggests end-to-end learning may obtain inferior performance in small scale learning.
In our class Markus gave a nice lecture on end-to-end ASR. One important module of end-to-end learning, named Connectionist Temporal Classification (CTC), is very difficult to implement, especially on GPUs. The following are some open source codes you can refer to:
- Baidu’s warp CTC, which is efficient on both GPU and CPU
- Kaldi’s CTC
- Eesen: CTC + WFST decoding
- Tensorflow’s CTC implementation: only works with sparse tensors
- Karas has a nice example of OCR recognition using CTC
- Paddle has an very easy to use interface for CTC
Since Paddle is less popular, I’d include an example to show how easily it can be to implement CRC:
ctc = ctc_layer(input=output, label=label, size=class_dim+1) outputs(ctc) # prediction eval = ctc_error_evaluator(input=output, label=label) # evaluating CTC
Seq2Seq and Attention Models
In addition to CTC, another popular model is attention model. Google has devoted a lot of efforts on that, so it is safe to focus on Tensorflow’s code and the corresponding Keras wrapper:
We will discuss more on seq2seq model in later classes.
This is one from a series of notes taken when I teach the course of Columbia E6489.