Volume 15, Issue 3 (12-2018)                   JSDP 2018, 15(3): 13-30 | Back to browse issues page


XML Persian Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Zoughi T, Homayounpour M M. Adaptive Windows Convolutional Neural Network for Speech Recognition. JSDP 2018; 15 (3) :13-30
URL: http://jsdp.rcisp.ac.ir/article-1-706-en.html
Amirkabir University of Technology
Abstract:   (4284 Views)
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov model (HMM) leads to considerable performance achievement in speech recognition problem because deep networks model complex correlations between features. The main aim of this paper is to achieve a better acoustic modeling by changing the structure of deep Convolutional Neural Network (CNN) in order to adapt speaking variations. In this way, existing models and corresponding inference task have been improved and extended.
Here, we propose adaptive windows convolutional neural network (AWCNN) to analyze joint temporal-spectral features variation. AWCNN changes the structure of CNN and estimates the probabilities of HMM states. We propose adaptive windows convolutional neural network in order to make the model more robust against the speech signal variations for both single speaker and among various speakers. This model can better model speech signals. The AWCNN method applies to the speech spectrogram and models time-frequency varieties.
This network handles speaker feature variations, speech signal varieties, and variations in phone duration. The obtained results and analysis on FARSDAT and TIMIT datasets show that, for phone recognition task, the proposed structure achieves 1.2%, 1.1% absolute error reduction with respect to CNN models respectively, which is a considerable improvement in this problem. Based on the results obtained by the conducted experiments, we conclude that the use of speaker information is very beneficial for recognition accuracy.
Full-Text [PDF 6851 kb]   (1263 Downloads)    
Type of Study: Research | Subject: Paper
Received: 2018/04/4 | Accepted: 2018/12/17 | Published: 2018/12/19 | ePublished: 2018/12/19

References
1. [1] Bi Jen Khan, J. Sheykhzadegan, "Persian speech dataset", in Machine Translation in Persian, 2006, pp. 261-247.
2. [2] B. BabaAli, "A state-of-the-art and efficient framework for Persian speech recognition", Signal and Data Processing, Vol. 13, No. 3, pp. 1-13, 2015.
3. [3] S. Z. Seyyedsalehi, and A. Seyyedsalehi, "Improving the nonlinear manifold separator model to the face recognition by a single image of per person." Signal and Data Processing, Vol. 12, No.1, pp. 3-16, 2015.
4. [4] Z. Ansari, and A. Seyyedsalehi, "Deep Modular Neural Networks with Double Spatio-temporal Association Structure for Persian Continuous Speech Recognition." Signal and Data Processing, Vol. 13, No.1, pp. 39-56, 2016.
5. [5] S. Z. Seyyedsalehi, and A. Seyyedsalehi, "A new fast pre training method for training of deep neural network." Signal and Data Processing, Vol. 10, No.1, pp. 13-26, 2013.
6. [6] Y. Hifny and S. Renals, "Speech recognition using augmented conditional random fields," IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 2, pp. 354–365, 2009. [DOI:10.1109/TASL.2008.2010286]
7. [7] K. H. Davis, R. Biddulph, and S. Balashek, "Automatic Recognition of Spoken Digits," The Journal of the Acoustical Society of America, vol. 24, no. 6, p. 637–‎642‎, 1952.
8. [8] L. Rabiner and B. Juang, Fundamentals of Speech Recognition: Prentice Hall, vol. 22. 1993.
9. [9] R. P. Lippmann, "Speech recognition by machines and humans," Speech Communication, vol. 22, no. 7, pp. 1–15, 1997. [DOI:10.1016/S0167-6393(97)00021-6]
10. [10] O. Scharenborg, "Reaching over the gap: A review of efforts to link human and automatic speech recognition research," Speech Communication, vol. 49, no. 5, pp. 336–347, 2007. [DOI:10.1016/j.specom.2007.01.009]
11. [11] M. Ostendorf, "Moving Beyond the 'Beads-on-a-String' Model of Speech," in IEEE Automatic Speech Recognition and Understanding Workshop‎, 1999, pp. 79–83.
12. [12] H. Bourlard, H. Hermansky, and N. Morgan, "Towards increasing speech recognition error rates," Speech Communication, vol. 18, no. 3, pp. 205–231, 1996. [DOI:10.1016/0167-6393(96)00003-9]
13. [13] N. Morgan, Q. Zhu, and A. Stolcke, "Pushing the envelope-aside," Signal Processing Magazine‎, vol. 22, no. 5, pp. 81–88, 2005. [DOI:10.1109/MSP.2005.1511826]
14. [14] C. J. Leggetter and P. C. Woodland, "Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models," Computer Speech & Language, vol. 9, no. 2, pp. 171–185, 1995. [DOI:10.1006/csla.1995.0010]
15. [15] L. Lee and R. C. Rose, "Speaker normalization using efficient frequency warping procedures," in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996, vol. 1, pp. 356–1996. [DOI:10.1109/ICASSP.1996.541105]
16. [16] L. Welling, S. Kanthak and H. Ney, "Improved methods for vocal tract normalization," in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999, vol. 2, pp. 761–764. [DOI:10.1109/ICASSP.1999.759780]
17. [17] D. Povey. Discriminative Training for Large Vocabulary Speech Recognition. PhD thesis, Cambridge University, 2003.
18. [18] S. F. Chen and J. Goodman, "An empirical study of smoothing techniques for language modeling," in Proceedings of the 34th annual meeting on Association for Computational Linguistics, 1996, pp. 310–318. [DOI:10.3115/981863.981904] [PMID]
19. [19] G. E. Dahl, D. Yu, L. Deng and A. Acero, "Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 1, pp. 30–42, 2012. [DOI:10.1109/TASL.2011.2134090]
20. [20] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012. [DOI:10.1109/MSP.2012.2205597]
21. [21] A. R. Mohamed, G. Hinton and G. Penn, "Understanding how deep belief networks perform acoustic modelling," in IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, 2012, pp. 4273–4276. [DOI:10.1109/ICASSP.2012.6288863]
22. [22] R. Salakhutdinov and G. Hinton, "An Efficient Learning Procedure for Deep Boltzmann Machines," Neural Computation, vol. 24, no. 8, pp. 1967–2006, 2012. [DOI:10.1162/NECO_a_00311] [PMID]
23. [23] R. Salakhutdinov, "Learning deep generative models‎," PHD thesis, Toronto, Ont., Canada‎, 2009.
24. [24] G. E. Hinton, "Training products of experts by minimizing contrastive divergence," Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002. [DOI:10.1162/089976602760128018] [PMID]
25. [25] G. E. Hinton, S. Osindero, and Y. W. Teh, "A fast learning algorithm for deep belief nets," Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006. [DOI:10.1162/neco.2006.18.7.1527] [PMID]
26. [26] P. Ramesh and J. G. Wilpon, "Modeling state durations in hidden Markov models for automatic speech recognition," in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992, vol. 1, pp. 381–384. [DOI:10.1109/ICASSP.1992.225892]
27. [27] P. N. Justine T. Kao, Geoffrey Zweig, "Discriminative duration modeling for speech recognition with segmental conditional random fields," in ICASSP, 2011. PP. 4476-4479.
28. [28] S. Z. Yu, "Hidden semi-Markov models," Artificial Intelligence, vol. 174, no. 2. pp. 215–243, 2010. [DOI:10.1016/j.artint.2009.11.011]
29. [29] S. J. Rennie, P. Fousek, and P. L. Dognin, "factorial hidden restricted boltzmann machines for noise robust speech recognition," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 4297–4300. [DOI:10.1109/ICASSP.2012.6288869]
30. [30] J. Huang and B. Kingsbury, "Audio-visual deep learning for noise robust speech recognition," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7596–7599. [DOI:10.1109/ICASSP.2013.6639140]
31. [31] A. Maas and Q. Le, "Recurrent Neural Networks for Noise Reduction in Robust ASR.," in Interspeech, 2012, pp. 3–6.
32. [32] H. Bourlard and N. Morgan, "Continuous speech recognition by connectionist statistical methods," IEEE Transactions on Neural Net-works, vol. 4, no. 6, pp. 893–909, 1993. [DOI:10.1109/72.286885] [PMID]
33. [33] A. J. Robinson, G. D. Cook, D. P. W. Ellis, E. Fosler-Lussier, S. J. Renals, and D. A. G. Williams, "Connectionist speech recognition of Broadcast News," Speech Communication, vol. 37, no. 1–2, pp. 27–45, 2002. [DOI:10.1016/S0167-6393(01)00058-9]
34. [34] Y. H. Sung and D. Jurafsky, "Hidden conditional random fields for phone recognition," in IEEE Workshop on Automatic Speech Recognition and Understanding, 2009, pp. 107–112. [DOI:10.1109/ASRU.2009.5373329]
35. [35] T. N. Sainath, A. R. Mohamed, B. Kingsbury and B. Ramabhadran, "Deep convolutional neural networks for LVCSR," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 8614–8618. [DOI:10.1109/ICASSP.2013.6639347]
36. [36] T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A.-R. Mohamed, G. Dahl, and B. Ramabhadran, "Deep Convolutional Neural Networks for Large-scale Speech Tasks.," Neural networks, vol. 64, pp. 39–48, Sep. 2014. [DOI:10.1016/j.neunet.2014.08.005] [PMID]
37. [37] T. N. Sainath, B. Kingsbury, H. Soltau and B. Ramabhadran, "Optimization techniques to improve training speed of deep neural networks for large speech tasks," IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 11, pp. 2267–2276, 2013. [DOI:10.1109/TASL.2013.2284378]
38. [38] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, "Convolutional deep belief networks for scalable unsupervised learning of hierarchical represent-tations," in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, vol. 2008, pp. 1–8.
39. [39] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition," IEEE Transactions on Speech and Audio Processing, vol. 22, no. 10, pp. 1533–1545, 2014. [DOI:10.1109/TASLP.2014.2339736]
40. [40] G. Heigold, "A log-linear discriminative mode-ling framework for speech recognition," PhD dissertation, Aachen, Germany, 2010.
41. [41] M. Russell and A. Cook, "Experimental evaluation of duration modelling techniques for automatic speech recognition," in IEEE Inter-national Conference on Acoustics, Speech, and Signal Processing, 1987, vol. 12, pp. 2376–2379. [DOI:10.1109/ICASSP.1987.1169918]
42. [42] H. Lee and H. Kwon, "Going Deeper with Contextual CNN for Hyperspectral Image Classification," IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 4843–4855, 2017. [DOI:10.1109/TIP.2017.2725580] [PMID]
43. [43] C. Dong, C. C. Loy, K. He, and X. Tang, "Image Super-Resolution Using Deep Convolutional Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016. [DOI:10.1109/TPAMI.2015.2439281] [PMID]
44. [44] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [DOI:10.1109/CVPR.2016.90]
45. [45] A. Graves, A. Mohamed, and G. Hinton, "Speech Recognition with Deep Recurrent Neural Networks," in Acoustics, Speech and Signal Processing (ICASSP), 2013, no. 3, pp. 6645–6649.
46. [46] Y. Miao, M. Gowayyed, and F. Metze, "EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding," in IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings, 2016, pp. 167–174.
47. [47] W. Song and J. Cai, "End-to-End Deep Neural Network for Automatic Speech Recognition," CS224d: Deep Learning for Natural Language Processing, pp. 1–8, 2015.
48. [48] S. Kapadia, V. Valtchev and S. J. Young, "MMI training for continuous phoneme recognition on the TIMIT database," in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993, vol. 2, pp. ‎491–494‎. [DOI:10.1109/ICASSP.1993.319349]
49. [49] M. Bijankhan, J. Sheikhzadegan, and M. R. Roohani, Y. Samareh, "FARSDAT- the speech database of farsi spoken language," in proceedings Australian conference on speech science and technology, 1994, vol. 2, pp. 826–830.
50. [50] B. H. Juang, W. Chou, and C. H. Lee, "Minimum classification error rate methods for speech recognition," IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, pp. 257–265, 1997. [DOI:10.1109/89.568732]
51. [51] E. McDermott, T. J. Hazen, J. Roux, A. Nakamura and S. Katagiri, "Discriminative Training for Large-Vocabulary Speech Recog-nition Using Minimum Classification Error," IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 1, pp. 203–223, 2007. [DOI:10.1109/TASL.2006.876778]
52. [52] F. Sha and L. Saul, "Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition," in IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2006, vol. 1, pp. ‎265–268‎.
53. [53] G. Zweig, P. Nguyen, D. Van Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell, M. Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen, S. Thomas, S. Bowman and J. Kao, "Speech recognition with segmental conditional random fields," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2011, pp. 5044–5047.

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

© 2015 All Rights Reserved | Signal and Data Processing