پایه‌گذاری بستری نو و کارآمد در حوزه بازشناسی گفتار فارسی

باباعلی, باقر

doi:10.18869/acadpub.jsdp.13.3.51

***************«بسم الله الرحمن الرحیم» نشریه علمی «پردازش علائم و داده‌ها» با مجوز رسمی از کمیسیون نشریات وزارت علوم، تحقیقات و فناوری، صاحب امتیاز: پژوهشگاه توسعه فناوری‌های پیشرفته ***************

Signal and Data Processing Journal A scientific journal officially licensed by the Commission for Scientific Publications of the (MSRT). Publisher: Research Ceter for Developmen of Technologies

EN FA

دوره 13، شماره 3 - ( 9-1395 ) جلد 13 شماره 3 صفحات 62-51 | برگشت به فهرست نسخه ها

‎ 10.18869/acadpub.jsdp.13.3.51

Mendeley

Zotero

RefWorks

باباعلی باقر. پایه‌گذاری بستری نو و کارآمد در حوزه بازشناسی گفتار فارسی. پردازش علائم و داده‌ها. 1395; 13 (3) :51-62

URL: http://jsdp.rcisp.ac.ir/article-1-348-fa.html

پایه‌گذاری بستری نو و کارآمد در حوزه بازشناسی گفتار فارسی

باقر باباعلی^*

دانشگاه تهران

چکیده: (8213 مشاهده)

برخلاف پیشینۀ سی‌سالۀ پژوهش در حوزۀ بازشناسی گفتار فارسی در ایران و دست‌یافتن به پیشرفت‌های در خور توجه، نتایج عمده کارهای انجام‌شده به‌دلیل عدم وجود بستر یکسان، قابل مقایسه و ارزیابی دقیق نیستند. بستر بیش‌تر شامل سامانۀ بازشناسی و دادگان با تعریف مشخص مجموعه‌های آموزش، توسعه و ارزیابی است. سامانۀ متن‌باز کلدی با وجود نوظهور‌بودن آن ویژگی‌های منحصر‌به‌فردی دارد که در سال‌های اخیر مورد توجه اکثر آزمایشگاه‌های تراز نخست پردازش گفتار دنیا قرار گرفته است و با لحاظ همه جوانب، بهترین انتخاب موجود در راستای پایه‌گذاری این بستر برای تمامی زبان‌ها از جمله زبان فارسی است. در این مقاله پس از بررسی خصوصیات، توانمندی‌ها و اجزای مختلف نرم‌افراز کلدی؛ دادگان فارس‌دات را به‌دلیل ثبت رسمی و قابل دسترس‌بودن آن برای همگان از سراسر دنیا به‌عنوان بخش دیگر این بستر انتخاب کرده و به تأسی از انتخاب انجام‌شده بر روی دادگان TIMIT به تعریف مجموعه‌های آموزش، توسعه و ارزیابی می‌پردازیم. در‌نهایت بیش‌تر قریب به اتفاق تکنیک‌ها و روش‌های موجود در کلدی بر روی دادگان فارس‌دات، مطابق تعریف صورت گرفته، مورد آزمایش قرار گرفته‌اند. بهترین میزان خطای حاصل در بازشناسی واج برای مجموعه توسعه 3/20 درصد و برای مجموعه آزمون 8/19 بوده است. دسترسی به کدهای نوشته در جهت فراهم‌سازی این بستر، در نرم‌افزار کلدی موجود است که با توجه به متن‌باز‌بودن آن، دسترسی به آنها به‌منظور بازسازی نتایج آمده در این مقاله در‌صورت در‌اختیارداشتن دادگان فارس‌دات به‌راحتی قابل انجام است.

واژه‌های کلیدی: بازشناسی گفتار پیوسته فارسی، دادگان فارس دات، نرم‌افزار متن‌باز کلدی.

متن کامل [PDF 2790 kb] (1886 دریافت)

نوع مطالعه: پژوهشي | موضوع مقاله: مقالات پردازش گفتار
دریافت: 1394/1/6 | پذیرش: 1394/12/15 | انتشار: 1396/2/3 | انتشار الکترونیک: 1396/2/3

فهرست منابع

1. [1] B. BabaAli and H. Sameti, The Sharif speaker-independent large vocabulary speech recognition system, in Proceedings of the 2nd Workshop on Information Technology & Its Disciplines (WITID '04), pp. 24–26, Iran, 2004.

2. [2] H. Sameti, H. Veisi, M. Bahrani, B. Babaali, K. Hosseinzadeh, A large vocabulary continuous speech recognition system for Persian language, in EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:6

3. [3] F. Almasganj, S.A. Seyyed Salehi, M. Bijankhan, H. Razizade, M. Asghari, Shenava 2: a Persian continuous speech recognition software, in The first workshop on Persian language and Computer, pp. 77–82, Tehran, 2004.

4. [4] M. Sheikhan, M. Tebyani, M. Lotfizad, Continuous speech recognition and syntactic processing in iranian farsi language. Inter J Speech Technol. 1(2), 135 (1997). doi:10.1007/BF02277194 [DOI:10.1007/BF02277194]

5. [5] S.M. Ahadi, Recognition of continuous Persian speech using a medium-sized vocabulary speech corpus, in European Conference on Speech communication and technology (Eurospeech'99), pp. 863–866, Switzerland, 1999.

6. [6] N. Srinivasamurthy, S.S. Narayanan, Language-adaptive Persian speech recognition, in European Conference on Speech Communication and Technology (Eurospeech'03), Switzerland, 2003.

7. [7] H. Sameti, H. Veisi, M. Bahrani, B. Babaali, K. Hosseinzadeh, Nevisa, a Persian continuous speech recognition system, in Communications in Computer and Information Science (Springer Berlin Heidelberg), pp. 485–492, 2008. [DOI:10.1007/978-3-540-89985-3_60]

8. [8] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, The Kaldi Speech Recognition Toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011.

9. [9] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012. [DOI:10.1109/MSP.2012.2205597]

10. [10] F. Seide, G. Li, and D. Yu, Conversational speech transcription using context-dependent deep neural networks. in Proc. of Interspeech, pp. 437–440, 2011.

11. [11] K. Vesely, A. Ghoshal, L. Burget, and D. Povey, Sequence discriminative training of deep neural networks, in Proc. of Interspeech, pp. 2345–2349, 2013.

12. [12] D. Povey, X. Zhang, and S. Khudanpur, Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging, in Proc. of 3rd International Conference on Learning Representations (ICLR2015), USA, 2015

13. [13] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, Recurrent neural network based language model, in Proc. of Interspeech, pp. 1045-1048, Japan, 2010

14. [14] S.Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book (for version 3.4). Cambridge Univ. Eng. Dept., 2009.

15. [15] B. Pellom, SONIC: The University of Colorado Continuous Speech Recognizer, Technical Report TRCSLR-2001-01, Center for Spoken Language Research, University of Colorado, USA, 2001.

16. [16] B. Pellom and K. Hacioglu, Recent Improvements in the CU SONIC ASR System for Noisy Speech: The SPINE Task, in Proc. of ICASSP, Hong Kong, 2003. [DOI:10.1109/ICASSP.2003.1198702]

17. [17] K.F. Lee, H.W. Hon, and R. Reddy, An overview of the SPHINX speech recognition system, in IEEE Transactions on Acoustics, Speech and Signal Processing 38.1, pp.35-45, 1990. [DOI:10.1109/29.45616]

18. [18] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel, Sphinx-4: A flexible Open Source Framework for Speech Recognition, Sun Microsystems Inc., Technical Report SML1 TR2004-0811, 2004.

19. [19] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. Lööf, R. Schlüter, and H. Ney, The RWTH Aachen University Open Source Speech Recognition System, in Proc. of Interspeech, pp. 2111–2114, 2009.

20. [20] D. Rybach, S. Hahn, P. Lehnen, D. Nolden, M. Sundermeyer, Z. Tüske, S. Wiesler, R. Schlüter, and H. Ney: RASR - The RWTH Aachen University Open Source Speech Recognition Toolkit, In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), USA, 2011.

21. [21] A. Lee, T. Kawahara, and K. Shikano, JULIUS - an open source real-time large vocabulary recognition engine, in Proc. of INTERSPEECH, pp. 1691-1694, 2001.

22. [22] K. Demuynck, J. Roelens, D.V. Compernolle and P. Wambacq, SPRAAK: An Open Source SPeech Recognition and Automatic Annotation Kit, In Proc. Interspeech, pp. 495-498, Australia, 2008.

23. [23] D. Bola-os, The Bavieca Open-Source Speech Recognition Toolkit, in Proc. of IEEE Workshop on Spoken Language Technology (SLT), Miami, FL, USA, 2012.

24. [24] M. Bijankhan and M. J. Sheikhzadegan, FARSDAT—the speech database of Farsi spoken language, in Proceedings of the 5th Australian International Conference on Speech Science and Technology (SST '94), pp. 826–829, Perth, Australia, December 1994.

25. [25] V. Zue, S. Seneff, and J. Glass, Speech database development at MIT: TIMIT and beyond, Speech Communication, vol. 9, no. 4, pp. 351–356, 1990. [DOI:10.1016/0167-6393(90)90010-7]

26. [26] K.F. Lee, and H.W. Hon, Speaker-independent phone recognition using hidden Markov models, in IEEE Transactions on Acoustics, Speech and Signal Processing, vol 37, no. 11, pp. 1641-1648, 1989. [DOI:10.1109/29.46546]

27. [27] M. Mohri, F. Pereira, and M. Riley,Weighted finite-state transducers in speech recognition, in Computer Speech & Language 16, no. 1, pp. 69-88, 2002. [DOI:10.1006/csla.2001.0184]

28. [28] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, OpenFst: a general and efficient weighted finite-state transducer library, in Proc. CIAA, 2007. [DOI:10.1007/978-3-540-76336-9_3]

29. [29]D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafit, A. Rastrow, R. C. Rose, P. Schwarz, and S. Thomas, The subspace Gaussian mixture model -A structured model for speech recognition, in Computer Speech & Language, vol. 25, no. 2, pp. 404 – 439, 2011. [DOI:10.1016/j.csl.2010.06.003]

30. [30] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, Journal of the Acoustical Society of America, vol. 87, no.4, pp. 1738–1752, 1990. [DOI:10.1121/1.399423] [PMID]

31. [31] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, Probabilistic and bottle-neck features for LVCSR of meetings, in Proc. ICASSP, pp. 757–760, 2007. [DOI:10.1109/ICASSP.2007.367023]

32. [32] M. J. F. Gales, Maximum likelihood linear transformations for HMM based speech recognition, in Computer Speech and Language, vol. 12, no. 2, pp. 75–98, 1998. [DOI:10.1006/csla.1998.0043]

33. [33] S. Furui, Cepstral analysis technique for automatic speaker verification, in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, no. 2, pp. 254–272, 1981. [DOI:10.1109/TASSP.1981.1163530]

34. [34] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Front-end factor analysis for speaker verification, in IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011. [DOI:10.1109/TASL.2010.2064307]

35. [35] L. Lee and R. Rose, Speaker Normalization Using Efficient Frequency Warping Procedure, in Proc. of ICASSP, pp. I–353–356, Atlanta, USA, 1996, [DOI:10.1109/ICASSP.1996.541105]

36. [36] D. Kim, S. Umesh, M. Gales, T. Hain, and P. Woodland, Using VTLN for Broadcast News Transcription, In Proc. of the 8th ICSLP, Jeju Island, Korea, 2004.

37. [37] L. Burget, Combination of Speech Features Using Smoothed Heteroscedastic Linear Discriminant Analysis. In Proc. of the 8th ICSLP, Jeju Island, Korea, pp. 2549–2552, 2004

38. [38] K. Visweswariah, S. Axelrod, R.A. Gopinath, Acoustic modeling with mixtures of subspace constrained exponential models. In Proc. of the 7th Eurospeech'2003, Geneva, Switzerland, pp. 2613–2616, 2003.

39. [39] Xin Lei, Modeling Lexical Tones for Mandarin Large Vocabulary Continuous Speech Recognition, Ph.D. thesis, University of Washington, 2006.

40. [40] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal, and S. Khudanpur, A pitch extraction algorithm tuned for automatic speech recognition, In Proc. of ICASSP, pp. 2494-2498, Italy, 2014.

41. [41] S. Renals, N. Morgan, H. Bourlard, M. Cohen, and H. Franco, Connectionist probability estimators in HMM speech recognition, in IEEE Trans. Speech Audio Process., vol. 2, no. 1, pp. 161–174, 1994. [DOI:10.1109/89.260359]

42. [42] A. J. Robinson, An application of recurrent nets to phone probability estimation," in IEEE Trans. on Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.

43. [43] G. E. Dahl, D. Yu, L. Deng, and A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, in IEEE Trans. Audio Speech Lang. Process., vol. 20, no. 1, pp. 30–42, 2012. [DOI:10.1109/TASL.2011.2134090]

44. [44] N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, Application of pretrained deep neural networks to large vocabulary speech recognition, in Proc. INTERSPEECH, Septem-ber 2012.

45. [45] T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, and A. Mohamed, Making deep belief networks effective for large vocabulary continuous speech recognition, in Proc. IEEE ASRU, pp. 30–35, December 2011. [DOI:10.1109/ASRU.2011.6163900]

46. [46] G. E. Hinton, S. Osindero, and Y. Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol. 18, pp. 1527–1554, 2006. [DOI:10.1162/neco.2006.18.7.1527] [PMID]

47. [47] M. Gales, The generation and use of regression class trees for MLLR adaptation, University of Cambridge, Department of Engineering, 1996.

48. [48] D. Povey, G. Zweig, and A. Acero, The Exponential Transform as a generic substitute for VTLN, in Proc. IEEE ASRU, December 2011.

49. [49] S.J. Young, P.C. Woodland, The use of state tying in continuous speech recognition, in European Conference on Speech Communication and Technology (EUROSPEECH'93), pp. 2203–2206, Germany, September 1993,

50. [50] M. Federico, N. Bertoldi, and M. Cettolo, IRSTLM: an Open Source Toolkit for Handling Large Scale Language Models, in Proc. of Interspeech, Brisbane, Australia, 2008.

51. [51] A. Stolcke, SRILM-an extensible language modeling toolkit, in Proc. of INTERSPEECH, 2002.

52. [52] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, Boosted MMI for model and feature-space discriminative training, in Proc. of ICASSP, pp. 4057–4060, 2008. [DOI:10.1109/ICASSP.2008.4518545]

ارسال پیام به نویسنده مسئول

بازنشر اطلاعات
	این مقاله تحت شرایط Creative Commons Attribution-NonCommercial 4.0 International License قابل بازنشر است.