بازشناسی آوای فارسی با استفاده از شاخص‌های صوتی و روش‌های جبران‌سازی تنوعاتِ مبتنی بر شبکه‌های عصبی

رضا, شقایق; سید صالحی, علی; سید صالحی, زهره

doi:10.61186/jsdp.19.4.173

دوره 19، شماره 4 - ( 12-1401 ) جلد 19 شماره 4 صفحات 196-173 | برگشت به فهرست نسخه ها

‎ 10.61186/jsdp.19.4.173

Mendeley

Zotero

RefWorks

Reza S, Seyyedsalehi A, Seyyedsalehi Z. Persian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods. JSDP 2023; 19 (4) : 12
URL: http://jsdp.rcisp.ac.ir/article-1-1172-fa.html

رضا شقایق، سید صالحی علی، سید صالحی زهره. بازشناسی آوای فارسی با استفاده از شاخص‌های صوتی و روش‌های جبران‌سازی تنوعاتِ مبتنی بر شبکه‌های عصبی. پردازش علائم و داده‌ها. 1401; 19 (4) :173-196

URL: http://jsdp.rcisp.ac.ir/article-1-1172-fa.html

بازشناسی آوای فارسی با استفاده از شاخص‌های صوتی و روش‌های جبران‌سازی تنوعاتِ مبتنی بر شبکه‌های عصبی

شقایق رضا

، علی سید صالحی^*

، زهره سید صالحی

چکیده: (2075 مشاهده)

شواهد و آزمایشات گفتاری نشان می‌دهد که اطلاعات در سیگنال گفتار به صورت غیر یکنواخت توزیع شده و انسان با تمرکز به نواحی پُر اطلاعات آن قادر است به صورت مقاوم گفتار را بازشناسی کند. در این راستا در این تحقیق، یک سامانه‌‌ی بازشناسی آوای فارسی مبتنی بر تمرکز روی بازشناسی مقاوم نواحی پُراطلاعات و مجزای صوتی ارائه شده است. این نواحی شاخص‌های صوتی نامیده می‌شوند. بدین منظور ابتدا برای سیگنال گفتارِ زبان فارسی یک مجموعه از شاخص‌های مناسب صوتی انتخاب شده و به یک شبکه‌ی عصبی عمیق آموزش داده شده‌اند. سپس، به منظور حذف تنوعات شاخص‌های صوتی، تغییراتی در ساختار مدل و شیوه‌ی آموزش آن در چهار طرح مختلف انجام شده است. در طرح اول، از یک شبکه‌ی عصبی جداگانه و در طرح دوم از یک ساختار یادگیری چند تکلیفی برای جبرانسازی غیرخطی تنوعات شاخصهای صوتی استفاده شده است. در طرح سوم نیز از یک اتصال بازگشتی در لایهی پنهان شبکه برای بازسازی ورودی و در طرح چهارم از یک ساختار مبتنی بر شبکههای جاذبدار عمیق برای کاهش تنوعات ناخواسته استفاده شده است. در این مقاله آزمایش‌ها روی مجموعه دادگانِ گفتاری فارسی "فارس‌دات" انجام شده است و نتایج بازشناسی به صورت خطای بازشناسی آوا گزارش شده است. بهترین مدل آموزش یافته، یک شبکه‌‌ی عصبی جلوسو با پنج لایه‌‌ی پنهان است. خطای بازشناسی آوای این ساختار روی دادگان آزمون برابر 74/21 درصد به دست آمد. همچنین استفاده از چهارطرحِ پالایش تنوعات به ترتیب خطای بازشناسی آوا را به طور مطلق 39/0، 58/0، 43/0 و 3/1 درصد کاهش داده است.

شماره‌ی مقاله: 12

واژه‌های کلیدی: بازشناسی آوا، شاخص‌های صوتی، یادگیری عمیق، بازشناسی مقاوم، پالایش غیر‌خطی

متن کامل [PDF 768 kb] (704 دریافت)

نوع مطالعه: پژوهشي | موضوع مقاله: مقالات پردازش گفتار
دریافت: 1399/6/17 | پذیرش: 1400/6/3 | انتشار: 1401/12/29 | انتشار الکترونیک: 1401/12/29

فهرست منابع

1. [1] ب. باباعلی، پایه‌گذاری بستری نو و کارآمد در حوزة بازشناسی گفتار فارسی، مجلة پردازش علائم و داده‌ها، جلد 13، صفحات 62-51، 1395. [DOI:10.18869/acadpub.jsdp.13.3.51]

2. B. Babaali, A state-of-the-art and effitient framework for persian speech recognition, Signal and Data Processing, Vol. 13, pp. 51-62, 2016. [DOI:10.18869/acadpub.jsdp.13.3.51]

3. [2] ی.ثمره، آواشناسی زبان فارسی، تهران، مرکز نشر دانشگاهی، 1364.

4. Y. Samareh, Persian language phonology, Tehran, university publishing center, 1985.

5. [3] م. رحیمی‌نژاد و س. ع. سیدصالحی، مقایسه و ارزیابی کارایی انواع روش‌های استخراج پارامترهای بازنمایی و هنجارسازی در بازشناسی مستقل از گویندة گفتار، نشریة علمی پژوهشی امیرکبیر، 1382.

6. M. Rahiminezhad, S. A. Seyyedsalehi, Comparision and assessment of different feature extraction and normalization methods in speaker independent speech recognition, Amirkabir journal of science and research, 2000.

7. [4] س. ع. سید صالحی، ا. نژادقلی، ف. توحیدخواه، افزایش کارایی بازشناخت الگوی شبکه‌های عصبی جلوسو از طریق توسعة روش‌هایی برای دوسویه کردن عملکرد آنها، گزارش طرح مستقل پژوهشی، 1383.

8. S. A. Seyyedsalehi, I. Nejadgholi, F. Tohidkhah, Boostingt pattern recognition performance of neural networks with deleoping bidirectional methods, independent research report, 2004.

9. [5] ش. کرمی، بازشناسی واج‌های گفتار پيوستة فارسی به‌وسیلة شبکه‌های عصبی به‌صورت مستقل از گوينده با ترکیب اطلاعات نواحی گذرا و يکنواخت واج‌ها، پایان‌نامة کارشناسی ارشد مهندسی پزشکی، دانشگاه صنعتی امیرکبیر، 1379.

10. S. Karami, Speaker independent persian phone recognition using a neural network model with a combination of steady and transition parts of phones, M.Sc. thesis, Biomedical engineering faculty, Amirkabir University, 2000.

11. [6] م. یزدیان، بازشناسی گفتار پیوسته فارسی بر مبنای مدل‌سازی وقایع گسستة صوتی، پایان‌نامة کارشناسی ارشد مهندسی پزشکی، دانشگاه صنعتی امیرکبیر، 1380.

12. M. Yazdiyan, Persian continous speech recognition based on discrete acoustic events modeling, M.Sc. thesis, Biomedical engineering faculty, Amirkabir University, 2001.

13. [7] S. Alisamir, S. M. Ahadi, and S. Seyedin, An end-to-end deep learning model to recognize Farsi speech from raw input, 4th Iranian Conference on Signal Processing and Intelligent Systems, pp. 1-5, 2018. [DOI:10.1109/ICSPIS.2018.8700538]

14. [8] N. Amini, S. A. Seyyedsalehi, Manipulation of attractors in feed-forward autoassociative neural networks for robust learning, Iranian Conference on Electrical Engineering (ICEE), 2017. [DOI:10.1109/IranianCEE.2017.7985469] [PMID] []

15. [9] Z. Ansari and S. A. Seyyedsalehi, Toward growing modular deep neural networks for continuous speech recognition, Neural Computing and Applications, pp.1177-1196, 2017. [DOI:10.1007/s00521-016-2438-x]

16. [10] S. Babaei, , A. Geranmayeh, and S. A. Seyyedsalehi, Protein secondary structure prediction using modular reciprocal bidirectional recurrent neural networks, Computer methods and programs in biomedicine, 100(3), pp.237-247, 2010. [DOI:10.1016/j.cmpb.2010.04.005] [PMID]

17. [11] M. Bijankhan, J. Sheikhzadegan, M. R. Roohani, FARSDAT-the speech database of Farsi spoken language, proccedings australian conference on speech science and technology, 1994.

18. [12] S. Borysand M. Hasegawa-Johnson, SVM-HMM landmark based speech recognition, 2009.

19. [13] Z. Chen, Y., Luo and N. Mesgarani, Deep attractor network for single-microphone speaker separation, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 246-250, 2017. [DOI:10.1109/ICASSP.2017.7952155]

20. [14] J. Chorowski, D. Bahdanau, K. Cho and Y. Bengio, End-to-end continuous speech recognition using attention-based recurrent NN: first results, arXiv, pp.1412.1602, 2014.

21. [15] G. Dahl, M. A. Ranzato, A. R. Mohamed and G. E. Hinton, Phone recognition with the mean-covariance restricted Boltzmann machine, Advances in neural information processing systems, pp. 469-477, 2010.

22. [16] Z. D. Doolab, S. A. Seyyedsalehi, and N. S. Dehaghani, , Nonlinear Normalization of Input Patterns to Handwritten Character Variability in Handwriting Recognition Neural Network, International Conference on Biomedical Engineering and Biotechnology, pp. 848-851, 2012. [DOI:10.1109/iCBEB.2012.284]

23. [17] L. Dehyadegary, S. A. Seyyedsalehi and I. Nejadgholi, Nonlinear enhancement of noisy speech using continuous attractor dynamics formed in recurrent neural networks, Neurocomputing. 2011. [DOI:10.1016/j.neucom.2010.12.044]

24. [18] B. Delgutte and N. Y. Kiang, Speech coding in the auditory nerve: IV. Sounds with consonant‐like dynamic characteristics, The Journal of the Acoustical Society of America, pp.897-907, 1984. [DOI:10.1121/1.390599] [PMID]

25. [19] S. Firooz, F. Almasganj, and Y. Shekofteh, Improvement of automatic speech recognition systems via nonlinear dynamical features evaluated from the recurrence plot of speech signals, Computers & Electrical Engineering, pp. 215-226, 2017. [DOI:10.1016/j.compeleceng.2016.07.006]

26. [20] D. Gillick, S. Wegmann and L. Gillick, Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework, international conference on acoustics, speech and signal processing (ICASSP), pp. 4745-4748, 2012. [DOI:10.1109/ICASSP.2012.6288979]

27. [21] A.H. Hadjahmadi, and M. M. Homayounpour, Robust feature extraction and uncertainty estimation based on attractor dynamics in cyclic deep denoising autoencoders, Neural Computing and Applications, 31(11), pp.7989-8002, 2019. [DOI:10.1007/s00521-018-3623-x]

28. [22] M. Hasegawa-Johnson, J. Baker, S. Borys, K. Chen, E. Coogan, S. Greenberg, A. Juneja, K. Kirchhoff, K. Livescu, S. Mohan and J. Muller, Landmark-based speech recognition, Report of the 2004 Johns Hopkins summer workshop, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2005.

29. [23] D. He, B. P. Lim, X. Yang, M. Hasegawa-Johnson and D. Chen, Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model, The Journal of the Acoustical Society of America, pp. 3207-3219, 2018. [DOI:10.1121/1.5039837] [PMID]

30. [24] D. He, X. Yang, B. P. Lim, Y. Liang, M. Hasegawa-Johnson and D. Chen, When CTC training meets acoustic landmarks., International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5996-6000, 2019. [DOI:10.1109/ICASSP.2019.8683607]

31. [25] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath and B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, IEEE signal processing magazine, pp. 82-97, 2012. [DOI:10.1109/MSP.2012.2205597]

32. [26] A. Juneja and C. Espy-Wilson, A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition, journal of the acoustical society of America, pp. 1154-1168, 2008. [DOI:10.1121/1.2823754] [PMID]

33. [27] J. Kahn, A. Lee and A. Hannun, Self-training for end-to-end speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7084-7088, 2020. [DOI:10.1109/ICASSP40776.2020.9054295]

34. [28] M. A. Kermanshahi and M. M. Homayounpour, Improving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM, Journal of AI and Data Mining, pp.137-147, 2019.

35. [29] R., Kumar, Y. Luo, and N. Mesgarani, Music Source Activity Detection and Separation Using Deep Attractor Network, In INTERSPEECH, pp. 347-351, 2018. [DOI:10.21437/Interspeech.2018-2326]

36. [30] J. W. Lee, J. Y. Choi and H. G. Kang, Classifcation of stop place in consonant-vowel contexts using feature extrapolation of acoustic-phonetic features in telephone speech, The Journal of the Acoustical Society of America, Vol. 131, 2012. [DOI:10.1121/1.3672706] [PMID]

37. [31] Y. Luo, Z. Chen and N. Mesgarani, Speaker-independent speech separation with deep attractor network, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4), 787-796, 2018. [DOI:10.1109/TASLP.2018.2795749]

38. [32] M. Meister and M. J. Berry, The neural code of the retina, Neuron, pp.435-450, 1999. [DOI:10.1016/S0896-6273(00)80700-X] [PMID]

39. [33] N. Morgan, J. Cohen, S. H. Krishnan, S. Changand S. Wegmann, Final Report: OUCH Project (Outing Unfortunate Characteristics of HMMs), 2013.

40. [34] T.S. Nguyen, S. Stüker, J. Niehues, and A. Waibel, Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation, IEEE International Conference on Acoustics, Speech and Signal Processing, 2020. [DOI:10.1109/ICASSP40776.2020.9054130]

41. [35] C. Niu, J. Zhang, X. Yang and Y. Xie, A study on landmark detection based on CTC and its application to pronunciation error detection, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 636-640, 2017. [DOI:10.1109/APSIPA.2017.8282103] []

42. [36] S. Parveen, and P. Green, Speech enhancement with missing data techniques using recurrent neural networks, IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. I-733, 2004.

43. [37] M. Ravanelli, T. Parcollet and Y. Bengio, The pytorch-kaldi speech recognition toolkit, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6465-6469, 2019. [DOI:10.1109/ICASSP.2019.8683713]

44. [38] S. Reza, S. A. Seyyedsalehi, S. Z. Seyyedsalehi, A Persian Language Phone Recognition Based on Robust Extraction of Acoustic Landmarks, 27th Iranian Conference on Biomedical Engineering, 2020. [DOI:10.1109/ICBME51989.2020.9319436]

45. [39] S. Reza, S. A. Seyyedsalehi, S. Z. Seyyedsalehi, Attractor Manipulation in Denoising Autoencoders for Robust Phone Recognition, 29th Iranian Conference on Electrical Engineering, 2021. [DOI:10.1109/ICEE52715.2021.9543707]

46. [40] T. N. Sainath, Island-driven search using broad phonetic classes, automatic speech recognition & understanding, pp. 287-292, 2009. [DOI:10.1109/ASRU.2009.5373547]

47. [41] T. N. Sainath, B. Kingsbury and B. Ramabhadran, Auto-encoder bottleneck features using deep belief networks, IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4153-4156, 2012. [DOI:10.1109/ICASSP.2012.6288833]

48. [42] L. San, N. Moritz, T. Hori, and J. L. Roux, Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR, International Conference on Acoustics, Speech and Signal Processing, 2020.

49. [43] S. A. Seyyedsalehi, A modular neural network speech recognizer based on the both acoustic steady portions and transitions, international conference of spoken language processing (ICSLP), 2000. [DOI:10.21437/ICSLP.2000-408]

50. [44] S. Z. Seyyedsalehi, and S. A. Seyyedsalehi, Attractor analysis in associative neural networks and its application to facial image analysis, Computational Intelligence in Electrical Engineering, Vol. 9, No. 1, 2018

51. [45] K. N. Stevens, S. J. Keyser, and H. Kawasaki, Toward a phonetic and phonological theory of redundant features , Ph.D. thesis. MIT, camberidge, 1986.

52. [46] K. N. Stevens, From acoustic cues to segments, features and words, international conference on Spoken Language Processing (ICSLP), pp. A1-A8, 2000.

53. [47] J. Vaněk, J. Michálek and J. Psutka, Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task, International Conference on Speech and Computer, pp. 728-736, 2018. [DOI:10.1007/978-3-319-99579-3_74]

54. [48] H. Veisi, and A. H. Mani, Persian speech recognition using deep learning, International Journal of Speech Technology, 23(4), pp. 893-905, 2020. [DOI:10.1007/s10772-020-09768-x]

55. [49] P. Vincent, H. Larochelle, Y. Bengio and P. A. Manzagol, Extracting and composing robust features with denoising autoencoders, Proceedings of the 25th international conference on Machine learning, pp. 1096-1103), 2008. [DOI:10.1145/1390156.1390294]

56. [50] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of machine learning research, pp. 3371-3408, 2010.

57. [51] M. Zapotoczny, P. Pietrzak, A. Lancucki and J. Chorowski, 2019. Lattice generation in attention-based speech recognition models, pp.2225-2229, 2019. [DOI:10.21437/Interspeech.2019-2667]

58. [52] T. Yoshimura, T. Hayashi, K. Takeda and S. Watanabe, End-to-end automatic speech recognition integrated with CTC-based voice activity detection, arXiv, 2020. [DOI:10.1109/ICASSP40776.2020.9054358]

59. [53] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve and E. Dupoux, Learning filterbanks from raw speech for phone recognition, International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5509-5513, 2018. [DOI:10.1109/ICASSP.2018.8462015]

ارسال پیام به نویسنده مسئول

بازنشر اطلاعات
	این مقاله تحت شرایط Creative Commons Attribution-NonCommercial 4.0 International License قابل بازنشر است.

کلیه حقوق این تارنما متعلق به فصل‌نامة علمی - پژوهشی پردازش علائم و داده‌ها است.

نظر شما در مورد قالب جدید چیست؟
	خوب
	متوسط
	ضعیف

پایگاه‌های مرتبط

واژگان کلیدی

نظرسنجی