دوره 15، شماره 3 - ( 9-1397 )                   جلد 15 شماره 3 صفحات 30-13 | برگشت به فهرست نسخه ها


XML English Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Zoughi T, Homayounpour M M. Adaptive Windows Convolutional Neural Network for Speech Recognition. JSDP 2018; 15 (3) :13-30
URL: http://jsdp.rcisp.ac.ir/article-1-706-fa.html
ذوقی تکتم، همایون پور محمد مهدی. شبکه عصبی پیچشی با پنجره‌های قابل تطبیق برای بازشناسی گفتار. پردازش علائم و داده‌ها. 1397; 15 (3) :13-30

URL: http://jsdp.rcisp.ac.ir/article-1-706-fa.html


دانشگاه صنعتی امیرکبیر
چکیده:   (4537 مشاهده)
در حالی‌که سامانه‌های بازشناسی گفتار به‌طور پیوسته در حال ارتقا می‌باشند و شاهد استفاده گسترده از آن‌ها می‌باشیم، اما دقت این سامانه‌ها فاصله زیادی نسبت به توان بازشناسی انسان دارد و در شرایط ناسازگار این فاصله افزایش مییابد. یکی از علل اصلی این مسئله تغییرات زیاد سیگنال گفتار است. در سال‌های اخیر، استفاده از شبکه‌های عصبی عمیق در ترکیب با مدل مخفی مارکف، موفقیت‌های قابل توجهی در حوزه پردازش گفتار داشته ‌است. این مقاله به‌دنبال مدل‌کردن بهتر گفتار با استفاده از تغییر ساختار در شبکه عصبی پیچشی عمیق است؛ به‌نحوی که با تنوعاتِ بیان گویندگان در سیگنال گفتار  منطبق‌تر شود. در این راه، مدل‌های موجود و انجام استنتاج بر روی آن‌ها را بهبود و گسترش خواهیم داد. در این مقاله با ارائه شبکه پیچشی عمیق با پنجره­های قابل تطبیق سامانه بازشناسی گفتار را نسبت به تفاوت بیان در بین گویندگان و تفاوت در بیان‌های یک گوینده مقاوم خواهیم کرد. تحلیل­ها و نتایج آزمایش‌های صورت‌گرفته بر روی دادگان گفتار فارس­دات و TIMIT نشان داد که روش پیشنهادی خطای مطلق بازشناسی واج را نسبت به شبکه پیچشی عمیق به­ترتیب به میزان 2/1 و 1/1 درصد کاهش میدهد که این مقدار در مسئله بازشناسی گفتار مقدار قابل توجهی است. 
متن کامل [PDF 6851 kb]   (1466 دریافت)    
نوع مطالعه: پژوهشي | موضوع مقاله: مقالات پردازش گفتار
دریافت: 1397/1/15 | پذیرش: 1397/9/26 | انتشار: 1397/9/28 | انتشار الکترونیک: 1397/9/28

فهرست منابع
1. [1] ج. شیخ زادگان ،م. بیژن خان، "دادگان گفتاری زبان فارسی"، دومین کارگاه پژوهشی زبان فارسی و رایانه، 1385، ص 261-247.
2. [2] ب. باباعلی، "پایه گذاری بستری نو و کارآمد در حوزه بازشناسی گفتار فارسی"، پردازش علائم و داده‌ها، دوره 13، ش. 3، ص. 1-13، 1394.
3. [3] س. ز. سیدصالحی و س.ع. سیدصالحی، "بهبود مدل تفکیککننده منیفلدهای غیرخطی بهمنظور بازشناسی چهره با یک تصویر از هر فرد"، پردازش علائم و داده‌ها، دوره 12، ش. 1، ص. 3-16، 1394.
4. [4] ز. انصاری و ع. سید صالحی، "معرفی شبکه های عصبی پیمانه ای عمیق با ساختار فضایی-زمانی دوگانه جهت بهبود بازشناسی گفتار پیوسته فارسی"، پردازش علائم و داده‌ها، دوره 13، ش. 1، ص. 39-56، 1395.
5. [5] س. ز. سیدصالحی و س. ع. سیدصالحی، "روش پیشتعلیم سریع بر مبنای کمینهسازی خطا برای همگرائی یادگیری شبکههای عصبی با ساختار عمیق"، پردازش علائم و داده‌ها، دوره 10، ش. 1، ص. 13-26، 1392.
6. [1] Bi Jen Khan, J. Sheykhzadegan, "Persian speech dataset", in Machine Translation in Persian, 2006, pp. 261-247.
7. [2] B. BabaAli, "A state-of-the-art and efficient framework for Persian speech recognition", Signal and Data Processing, Vol. 13, No. 3, pp. 1-13, 2015.
8. [3] S. Z. Seyyedsalehi, and A. Seyyedsalehi, "Improving the nonlinear manifold separator model to the face recognition by a single image of per person." Signal and Data Processing, Vol. 12, No.1, pp. 3-16, 2015.
9. [4] Z. Ansari, and A. Seyyedsalehi, "Deep Modular Neural Networks with Double Spatio-temporal Association Structure for Persian Continuous Speech Recognition." Signal and Data Processing, Vol. 13, No.1, pp. 39-56, 2016.
10. [5] S. Z. Seyyedsalehi, and A. Seyyedsalehi, "A new fast pre training method for training of deep neural network." Signal and Data Processing, Vol. 10, No.1, pp. 13-26, 2013.
11. [6] Y. Hifny and S. Renals, "Speech recognition using augmented conditional random fields," IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 2, pp. 354–365, 2009. [DOI:10.1109/TASL.2008.2010286]
12. [7] K. H. Davis, R. Biddulph, and S. Balashek, "Automatic Recognition of Spoken Digits," The Journal of the Acoustical Society of America, vol. 24, no. 6, p. 637–‎642‎, 1952.
13. [8] L. Rabiner and B. Juang, Fundamentals of Speech Recognition: Prentice Hall, vol. 22. 1993.
14. [9] R. P. Lippmann, "Speech recognition by machines and humans," Speech Communication, vol. 22, no. 7, pp. 1–15, 1997. [DOI:10.1016/S0167-6393(97)00021-6]
15. [10] O. Scharenborg, "Reaching over the gap: A review of efforts to link human and automatic speech recognition research," Speech Communication, vol. 49, no. 5, pp. 336–347, 2007. [DOI:10.1016/j.specom.2007.01.009]
16. [11] M. Ostendorf, "Moving Beyond the 'Beads-on-a-String' Model of Speech," in IEEE Automatic Speech Recognition and Understanding Workshop‎, 1999, pp. 79–83.
17. [12] H. Bourlard, H. Hermansky, and N. Morgan, "Towards increasing speech recognition error rates," Speech Communication, vol. 18, no. 3, pp. 205–231, 1996. [DOI:10.1016/0167-6393(96)00003-9]
18. [13] N. Morgan, Q. Zhu, and A. Stolcke, "Pushing the envelope-aside," Signal Processing Magazine‎, vol. 22, no. 5, pp. 81–88, 2005. [DOI:10.1109/MSP.2005.1511826]
19. [14] C. J. Leggetter and P. C. Woodland, "Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models," Computer Speech & Language, vol. 9, no. 2, pp. 171–185, 1995. [DOI:10.1006/csla.1995.0010]
20. [15] L. Lee and R. C. Rose, "Speaker normalization using efficient frequency warping procedures," in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996, vol. 1, pp. 356–1996. [DOI:10.1109/ICASSP.1996.541105]
21. [16] L. Welling, S. Kanthak and H. Ney, "Improved methods for vocal tract normalization," in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999, vol. 2, pp. 761–764. [DOI:10.1109/ICASSP.1999.759780]
22. [17] D. Povey. Discriminative Training for Large Vocabulary Speech Recognition. PhD thesis, Cambridge University, 2003.
23. [18] S. F. Chen and J. Goodman, "An empirical study of smoothing techniques for language modeling," in Proceedings of the 34th annual meeting on Association for Computational Linguistics, 1996, pp. 310–318. [DOI:10.3115/981863.981904] [PMID]
24. [19] G. E. Dahl, D. Yu, L. Deng and A. Acero, "Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 1, pp. 30–42, 2012. [DOI:10.1109/TASL.2011.2134090]
25. [20] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012. [DOI:10.1109/MSP.2012.2205597]
26. [21] A. R. Mohamed, G. Hinton and G. Penn, "Understanding how deep belief networks perform acoustic modelling," in IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, 2012, pp. 4273–4276. [DOI:10.1109/ICASSP.2012.6288863]
27. [22] R. Salakhutdinov and G. Hinton, "An Efficient Learning Procedure for Deep Boltzmann Machines," Neural Computation, vol. 24, no. 8, pp. 1967–2006, 2012. [DOI:10.1162/NECO_a_00311] [PMID]
28. [23] R. Salakhutdinov, "Learning deep generative models‎," PHD thesis, Toronto, Ont., Canada‎, 2009.
29. [24] G. E. Hinton, "Training products of experts by minimizing contrastive divergence," Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002. [DOI:10.1162/089976602760128018] [PMID]
30. [25] G. E. Hinton, S. Osindero, and Y. W. Teh, "A fast learning algorithm for deep belief nets," Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006. [DOI:10.1162/neco.2006.18.7.1527] [PMID]
31. [26] P. Ramesh and J. G. Wilpon, "Modeling state durations in hidden Markov models for automatic speech recognition," in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992, vol. 1, pp. 381–384. [DOI:10.1109/ICASSP.1992.225892]
32. [27] P. N. Justine T. Kao, Geoffrey Zweig, "Discriminative duration modeling for speech recognition with segmental conditional random fields," in ICASSP, 2011. PP. 4476-4479.
33. [28] S. Z. Yu, "Hidden semi-Markov models," Artificial Intelligence, vol. 174, no. 2. pp. 215–243, 2010. [DOI:10.1016/j.artint.2009.11.011]
34. [29] S. J. Rennie, P. Fousek, and P. L. Dognin, "factorial hidden restricted boltzmann machines for noise robust speech recognition," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 4297–4300. [DOI:10.1109/ICASSP.2012.6288869]
35. [30] J. Huang and B. Kingsbury, "Audio-visual deep learning for noise robust speech recognition," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7596–7599. [DOI:10.1109/ICASSP.2013.6639140]
36. [31] A. Maas and Q. Le, "Recurrent Neural Networks for Noise Reduction in Robust ASR.," in Interspeech, 2012, pp. 3–6.
37. [32] H. Bourlard and N. Morgan, "Continuous speech recognition by connectionist statistical methods," IEEE Transactions on Neural Net-works, vol. 4, no. 6, pp. 893–909, 1993. [DOI:10.1109/72.286885] [PMID]
38. [33] A. J. Robinson, G. D. Cook, D. P. W. Ellis, E. Fosler-Lussier, S. J. Renals, and D. A. G. Williams, "Connectionist speech recognition of Broadcast News," Speech Communication, vol. 37, no. 1–2, pp. 27–45, 2002. [DOI:10.1016/S0167-6393(01)00058-9]
39. [34] Y. H. Sung and D. Jurafsky, "Hidden conditional random fields for phone recognition," in IEEE Workshop on Automatic Speech Recognition and Understanding, 2009, pp. 107–112. [DOI:10.1109/ASRU.2009.5373329]
40. [35] T. N. Sainath, A. R. Mohamed, B. Kingsbury and B. Ramabhadran, "Deep convolutional neural networks for LVCSR," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 8614–8618. [DOI:10.1109/ICASSP.2013.6639347]
41. [36] T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A.-R. Mohamed, G. Dahl, and B. Ramabhadran, "Deep Convolutional Neural Networks for Large-scale Speech Tasks.," Neural networks, vol. 64, pp. 39–48, Sep. 2014. [DOI:10.1016/j.neunet.2014.08.005] [PMID]
42. [37] T. N. Sainath, B. Kingsbury, H. Soltau and B. Ramabhadran, "Optimization techniques to improve training speed of deep neural networks for large speech tasks," IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 11, pp. 2267–2276, 2013. [DOI:10.1109/TASL.2013.2284378]
43. [38] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, "Convolutional deep belief networks for scalable unsupervised learning of hierarchical represent-tations," in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, vol. 2008, pp. 1–8.
44. [39] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition," IEEE Transactions on Speech and Audio Processing, vol. 22, no. 10, pp. 1533–1545, 2014. [DOI:10.1109/TASLP.2014.2339736]
45. [40] G. Heigold, "A log-linear discriminative mode-ling framework for speech recognition," PhD dissertation, Aachen, Germany, 2010.
46. [41] M. Russell and A. Cook, "Experimental evaluation of duration modelling techniques for automatic speech recognition," in IEEE Inter-national Conference on Acoustics, Speech, and Signal Processing, 1987, vol. 12, pp. 2376–2379. [DOI:10.1109/ICASSP.1987.1169918]
47. [42] H. Lee and H. Kwon, "Going Deeper with Contextual CNN for Hyperspectral Image Classification," IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 4843–4855, 2017. [DOI:10.1109/TIP.2017.2725580] [PMID]
48. [43] C. Dong, C. C. Loy, K. He, and X. Tang, "Image Super-Resolution Using Deep Convolutional Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016. [DOI:10.1109/TPAMI.2015.2439281] [PMID]
49. [44] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [DOI:10.1109/CVPR.2016.90]
50. [45] A. Graves, A. Mohamed, and G. Hinton, "Speech Recognition with Deep Recurrent Neural Networks," in Acoustics, Speech and Signal Processing (ICASSP), 2013, no. 3, pp. 6645–6649.
51. [46] Y. Miao, M. Gowayyed, and F. Metze, "EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding," in IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings, 2016, pp. 167–174.
52. [47] W. Song and J. Cai, "End-to-End Deep Neural Network for Automatic Speech Recognition," CS224d: Deep Learning for Natural Language Processing, pp. 1–8, 2015.
53. [48] S. Kapadia, V. Valtchev and S. J. Young, "MMI training for continuous phoneme recognition on the TIMIT database," in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993, vol. 2, pp. ‎491–494‎. [DOI:10.1109/ICASSP.1993.319349]
54. [49] M. Bijankhan, J. Sheikhzadegan, and M. R. Roohani, Y. Samareh, "FARSDAT- the speech database of farsi spoken language," in proceedings Australian conference on speech science and technology, 1994, vol. 2, pp. 826–830.
55. [50] B. H. Juang, W. Chou, and C. H. Lee, "Minimum classification error rate methods for speech recognition," IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, pp. 257–265, 1997. [DOI:10.1109/89.568732]
56. [51] E. McDermott, T. J. Hazen, J. Roux, A. Nakamura and S. Katagiri, "Discriminative Training for Large-Vocabulary Speech Recog-nition Using Minimum Classification Error," IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 1, pp. 203–223, 2007. [DOI:10.1109/TASL.2006.876778]
57. [52] F. Sha and L. Saul, "Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition," in IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2006, vol. 1, pp. ‎265–268‎.
58. [53] G. Zweig, P. Nguyen, D. Van Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell, M. Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen, S. Thomas, S. Bowman and J. Kao, "Speech recognition with segmental conditional random fields," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2011, pp. 5044–5047.

ارسال نظر درباره این مقاله : نام کاربری یا پست الکترونیک شما:
CAPTCHA

ارسال پیام به نویسنده مسئول


بازنشر اطلاعات
Creative Commons License این مقاله تحت شرایط Creative Commons Attribution-NonCommercial 4.0 International License قابل بازنشر است.

کلیه حقوق این تارنما متعلق به فصل‌نامة علمی - پژوهشی پردازش علائم و داده‌ها است.