پایه‌گذاری بستری نو و کارآمد در حوزه بازشناسی گفتار فارسی

BabaAli, Bagher

doi:10.18869/acadpub.jsdp.13.3.51

Volume 13, Issue 3 (12-2016) JSDP 2016, 13(3): 51-62 | Back to browse issues page

‎ 10.18869/acadpub.jsdp.13.3.51

Mendeley

Zotero

RefWorks

BabaAli B. پایه‌گذاری بستری نو و کارآمد در حوزه بازشناسی گفتار فارسی. JSDP 2016; 13 (3) :51-62
URL: http://jsdp.rcisp.ac.ir/article-1-348-en.html

پایه‌گذاری بستری نو و کارآمد در حوزه بازشناسی گفتار فارسی

Bagher BabaAli ^*

University of Tehran

Abstract: (6719 Views)

Although researches in the field of Persian speech recognition claim a thirty-year-old history in Iran which has achieved considerable progresses, due to the lack of well-defined experimental framework, outcomes from many of these researches are not comparable to each other and their accurate assessment won’t be possible. The experimental framework includes ASR toolkit and speech database which consists of training, development and test datasets. In recent years, as a state-of-the-art open-source ASR toolkit; Kaldi has been very well-received and welcomed in the community of the world-ranked speech researchers around the world. considering all aspects, Kaldi is the best option among all of the other ASR toolkits to establish a framework to do research in all languages, including Persian.
In this paper, we chose Fardat as the speech database which is the counterpart of TIMIT for Persian language because not only it has got a standard form but it’s also accessible for all researchers around the world. Similar to the recipe on TIMIT database, we defined these three sets on the Farsdat: Training, Development and Test sets. After a survey on Kaldi’s components and features, we applied most of state-of-the-art ASR techniques in the Kaldi on the Farsdat based on three sets definition. The best phone error rate on development and test set have been 20.3% and 19.8%. All of the codes and the recipe that was written by author have been submitted to Kaldi repository and they are accessible for free, so all the reported results will be easily replicable if you have access to Farsdat database.

Keywords: Persian Continuous Speech Recognition, FarsDat Database, Kaldi Toolkit

Full-Text [PDF 2790 kb] (1577 Downloads)

Type of Study: Research | Subject: Paper
Received: 2015/03/26 | Accepted: 2016/03/5 | Published: 2017/04/23 | ePublished: 2017/04/23

References

1. [1] B. BabaAli and H. Sameti, The Sharif speaker-independent large vocabulary speech recognition system, in Proceedings of the 2nd Workshop on Information Technology & Its Disciplines (WITID '04), pp. 24–26, Iran, 2004.

2. [2] H. Sameti, H. Veisi, M. Bahrani, B. Babaali, K. Hosseinzadeh, A large vocabulary continuous speech recognition system for Persian language, in EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:6

3. [3] F. Almasganj, S.A. Seyyed Salehi, M. Bijankhan, H. Razizade, M. Asghari, Shenava 2: a Persian continuous speech recognition software, in The first workshop on Persian language and Computer, pp. 77–82, Tehran, 2004.

4. [4] M. Sheikhan, M. Tebyani, M. Lotfizad, Continuous speech recognition and syntactic processing in iranian farsi language. Inter J Speech Technol. 1(2), 135 (1997). doi:10.1007/BF02277194 [DOI:10.1007/BF02277194]

5. [5] S.M. Ahadi, Recognition of continuous Persian speech using a medium-sized vocabulary speech corpus, in European Conference on Speech communication and technology (Eurospeech'99), pp. 863–866, Switzerland, 1999.

6. [6] N. Srinivasamurthy, S.S. Narayanan, Language-adaptive Persian speech recognition, in European Conference on Speech Communication and Technology (Eurospeech'03), Switzerland, 2003.

7. [7] H. Sameti, H. Veisi, M. Bahrani, B. Babaali, K. Hosseinzadeh, Nevisa, a Persian continuous speech recognition system, in Communications in Computer and Information Science (Springer Berlin Heidelberg), pp. 485–492, 2008. [DOI:10.1007/978-3-540-89985-3_60]

8. [8] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, The Kaldi Speech Recognition Toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011.

9. [9] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012. [DOI:10.1109/MSP.2012.2205597]

10. [10] F. Seide, G. Li, and D. Yu, Conversational speech transcription using context-dependent deep neural networks. in Proc. of Interspeech, pp. 437–440, 2011.

11. [11] K. Vesely, A. Ghoshal, L. Burget, and D. Povey, Sequence discriminative training of deep neural networks, in Proc. of Interspeech, pp. 2345–2349, 2013.

12. [12] D. Povey, X. Zhang, and S. Khudanpur, Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging, in Proc. of 3rd International Conference on Learning Representations (ICLR2015), USA, 2015

13. [13] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, Recurrent neural network based language model, in Proc. of Interspeech, pp. 1045-1048, Japan, 2010

14. [14] S.Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book (for version 3.4). Cambridge Univ. Eng. Dept., 2009.

15. [15] B. Pellom, SONIC: The University of Colorado Continuous Speech Recognizer, Technical Report TRCSLR-2001-01, Center for Spoken Language Research, University of Colorado, USA, 2001.

16. [16] B. Pellom and K. Hacioglu, Recent Improvements in the CU SONIC ASR System for Noisy Speech: The SPINE Task, in Proc. of ICASSP, Hong Kong, 2003. [DOI:10.1109/ICASSP.2003.1198702]

17. [17] K.F. Lee, H.W. Hon, and R. Reddy, An overview of the SPHINX speech recognition system, in IEEE Transactions on Acoustics, Speech and Signal Processing 38.1, pp.35-45, 1990. [DOI:10.1109/29.45616]

18. [18] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel, Sphinx-4: A flexible Open Source Framework for Speech Recognition, Sun Microsystems Inc., Technical Report SML1 TR2004-0811, 2004.

19. [19] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. Lööf, R. Schlüter, and H. Ney, The RWTH Aachen University Open Source Speech Recognition System, in Proc. of Interspeech, pp. 2111–2114, 2009.

20. [20] D. Rybach, S. Hahn, P. Lehnen, D. Nolden, M. Sundermeyer, Z. Tüske, S. Wiesler, R. Schlüter, and H. Ney: RASR - The RWTH Aachen University Open Source Speech Recognition Toolkit, In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), USA, 2011.

21. [21] A. Lee, T. Kawahara, and K. Shikano, JULIUS - an open source real-time large vocabulary recognition engine, in Proc. of INTERSPEECH, pp. 1691-1694, 2001.

22. [22] K. Demuynck, J. Roelens, D.V. Compernolle and P. Wambacq, SPRAAK: An Open Source SPeech Recognition and Automatic Annotation Kit, In Proc. Interspeech, pp. 495-498, Australia, 2008.

23. [23] D. Bola-os, The Bavieca Open-Source Speech Recognition Toolkit, in Proc. of IEEE Workshop on Spoken Language Technology (SLT), Miami, FL, USA, 2012.

24. [24] M. Bijankhan and M. J. Sheikhzadegan, FARSDAT—the speech database of Farsi spoken language, in Proceedings of the 5th Australian International Conference on Speech Science and Technology (SST '94), pp. 826–829, Perth, Australia, December 1994.

25. [25] V. Zue, S. Seneff, and J. Glass, Speech database development at MIT: TIMIT and beyond, Speech Communication, vol. 9, no. 4, pp. 351–356, 1990. [DOI:10.1016/0167-6393(90)90010-7]

26. [26] K.F. Lee, and H.W. Hon, Speaker-independent phone recognition using hidden Markov models, in IEEE Transactions on Acoustics, Speech and Signal Processing, vol 37, no. 11, pp. 1641-1648, 1989. [DOI:10.1109/29.46546]

27. [27] M. Mohri, F. Pereira, and M. Riley,Weighted finite-state transducers in speech recognition, in Computer Speech & Language 16, no. 1, pp. 69-88, 2002. [DOI:10.1006/csla.2001.0184]

28. [28] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, OpenFst: a general and efficient weighted finite-state transducer library, in Proc. CIAA, 2007. [DOI:10.1007/978-3-540-76336-9_3]

29. [29]D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafit, A. Rastrow, R. C. Rose, P. Schwarz, and S. Thomas, The subspace Gaussian mixture model -A structured model for speech recognition, in Computer Speech & Language, vol. 25, no. 2, pp. 404 – 439, 2011. [DOI:10.1016/j.csl.2010.06.003]

30. [30] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, Journal of the Acoustical Society of America, vol. 87, no.4, pp. 1738–1752, 1990. [DOI:10.1121/1.399423] [PMID]

31. [31] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, Probabilistic and bottle-neck features for LVCSR of meetings, in Proc. ICASSP, pp. 757–760, 2007. [DOI:10.1109/ICASSP.2007.367023]

32. [32] M. J. F. Gales, Maximum likelihood linear transformations for HMM based speech recognition, in Computer Speech and Language, vol. 12, no. 2, pp. 75–98, 1998. [DOI:10.1006/csla.1998.0043]

33. [33] S. Furui, Cepstral analysis technique for automatic speaker verification, in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, no. 2, pp. 254–272, 1981. [DOI:10.1109/TASSP.1981.1163530]

34. [34] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Front-end factor analysis for speaker verification, in IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011. [DOI:10.1109/TASL.2010.2064307]

35. [35] L. Lee and R. Rose, Speaker Normalization Using Efficient Frequency Warping Procedure, in Proc. of ICASSP, pp. I–353–356, Atlanta, USA, 1996, [DOI:10.1109/ICASSP.1996.541105]

36. [36] D. Kim, S. Umesh, M. Gales, T. Hain, and P. Woodland, Using VTLN for Broadcast News Transcription, In Proc. of the 8th ICSLP, Jeju Island, Korea, 2004.

37. [37] L. Burget, Combination of Speech Features Using Smoothed Heteroscedastic Linear Discriminant Analysis. In Proc. of the 8th ICSLP, Jeju Island, Korea, pp. 2549–2552, 2004

38. [38] K. Visweswariah, S. Axelrod, R.A. Gopinath, Acoustic modeling with mixtures of subspace constrained exponential models. In Proc. of the 7th Eurospeech'2003, Geneva, Switzerland, pp. 2613–2616, 2003.

39. [39] Xin Lei, Modeling Lexical Tones for Mandarin Large Vocabulary Continuous Speech Recognition, Ph.D. thesis, University of Washington, 2006.

40. [40] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal, and S. Khudanpur, A pitch extraction algorithm tuned for automatic speech recognition, In Proc. of ICASSP, pp. 2494-2498, Italy, 2014.

41. [41] S. Renals, N. Morgan, H. Bourlard, M. Cohen, and H. Franco, Connectionist probability estimators in HMM speech recognition, in IEEE Trans. Speech Audio Process., vol. 2, no. 1, pp. 161–174, 1994. [DOI:10.1109/89.260359]

42. [42] A. J. Robinson, An application of recurrent nets to phone probability estimation," in IEEE Trans. on Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.

43. [43] G. E. Dahl, D. Yu, L. Deng, and A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, in IEEE Trans. Audio Speech Lang. Process., vol. 20, no. 1, pp. 30–42, 2012. [DOI:10.1109/TASL.2011.2134090]

44. [44] N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, Application of pretrained deep neural networks to large vocabulary speech recognition, in Proc. INTERSPEECH, Septem-ber 2012.

45. [45] T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, and A. Mohamed, Making deep belief networks effective for large vocabulary continuous speech recognition, in Proc. IEEE ASRU, pp. 30–35, December 2011. [DOI:10.1109/ASRU.2011.6163900]

46. [46] G. E. Hinton, S. Osindero, and Y. Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol. 18, pp. 1527–1554, 2006. [DOI:10.1162/neco.2006.18.7.1527] [PMID]

47. [47] M. Gales, The generation and use of regression class trees for MLLR adaptation, University of Cambridge, Department of Engineering, 1996.

48. [48] D. Povey, G. Zweig, and A. Acero, The Exponential Transform as a generic substitute for VTLN, in Proc. IEEE ASRU, December 2011.

49. [49] S.J. Young, P.C. Woodland, The use of state tying in continuous speech recognition, in European Conference on Speech Communication and Technology (EUROSPEECH'93), pp. 2203–2206, Germany, September 1993,

50. [50] M. Federico, N. Bertoldi, and M. Cettolo, IRSTLM: an Open Source Toolkit for Handling Large Scale Language Models, in Proc. of Interspeech, Brisbane, Australia, 2008.

51. [51] A. Stolcke, SRILM-an extensible language modeling toolkit, in Proc. of INTERSPEECH, 2002.

52. [52] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, Boosted MMI for model and feature-space discriminative training, in Proc. of ICASSP, pp. 4057–4060, 2008. [DOI:10.1109/ICASSP.2008.4518545]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote