Persian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods

Reza, Shaghayegh; Seyyedsalehi, Ali; Seyyedsalehi, Zohreh

doi:10.61186/jsdp.19.4.173

Volume 19, Issue 4 (3-2023) JSDP 2023, 19(4): 173-196 | Back to browse issues page

‎ 10.61186/jsdp.19.4.173

Mendeley

Zotero

RefWorks

Reza S, Seyyedsalehi A, Seyyedsalehi Z. Persian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods. JSDP 2023; 19 (4) : 12
URL: http://jsdp.rcisp.ac.ir/article-1-1172-en.html

Persian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods

Shaghayegh Reza

, Ali Seyyedsalehi ^*

, Zohreh Seyyedsalehi

Abstract: (1657 Views)

Speech recognition is a subfield of artificial intelligence that develops technologies to convert speech utterance into transcription. So far, various methods such as hidden Markov models and artificial neural networks have been used to develop speech recognition systems. In most of these systems, the speech signal frames are processed uniformly, while the information is not evenly distributed in all of them. Auditory experiments have also shown that the human brain pays more attention to information-rich areas. By focusing on these areas instead of uniform processing, the brain can more robustly recognize speech in intrinsic and environmental speech variations such as speaker and noise. In contrast, the performance of most speech recognition systems degrades dramatically in these conditions. Therefore, to boost speech recognition systems' robustness, some researchers have focused on developing speech recognition systems by modeling these informative parts of the speech signal named landmarks. Similarly, in this article, we implemented a landmark-based system to obtain a robust Persian speech recognition system inspired by human brain perception. We also conducted neural networks-based variation compensation methods to boost its performance.
In this article, acoustic landmarks are classified into two categories of events and states with the following definitions. Events are defined as areas of the speech signal in which the spectral characteristics change drastically while their length does not change a lot. The transition areas between some adjacent pairs of phones (phones' borders) are primarily selected as events. States are also defined as areas of the speech signal that spectral characteristics do not change significantly. Here the nuclei of phones are considered as the states. Previous research, linguistic sources, and implementation results have been used to determine the Persian language's appropriate landmarks. Finally, a set of 313 landmarks was selected and used in our acoustic landmarks-based phone recognition system.
The neural network structure used to recognize acoustic landmarks is a feed-forward fully connected structure with ReLU function in its hidden layers and a linear function in its final layer. The number of layers and neurons of this structure has been determined experimentally. The best structure is composed of 5 fully connected layers with 1000 neurons per layer. In this study, instead of considering 313 neurons to express each of the 313 landmarks, a heuristic labeling method is used to reduce the number of output neurons and utilize the shared information between the landmarks. The landmark recognition model slides on the speech feature sequence in the test phase to produce the output landmark sequence. Finally, to convert the obtained landmark sequence to a phone sequence, three rule-based post-processing steps are performed.
Variabilities are among the essential quality degradation sources in speech recognition; therefore, we proposed two approaches to reduce them and boost phone recognition quality in our landmark-based system. To this aim, we have utilized the nonlinear filtering characteristic of neural networks by implementing four neural network schemes. In scheme 1, a feed-forward neural network is first trained to map training landmarks to their corresponding well-recognized samples. Then this structure can act as a nonlinear filter before the landmark recognition block. In scheme 2, a unified structure is simultaneously trained to learn landmark labels and the filtering part. In both of these schemes, we used a recursive loop to increase the chance of attractor manipulation in the structures. In scheme 3, a recursive loop is added to one hidden layer. This loop acts as an input variability simulator and forces the network to recognize the input data and its variations correctly. Finally, in scheme four, a deep attractor neural network-based structure is proposed to shape the structure’s hidden layer components so that it can compensate for variabilities.
The experiments are implemented on a Persian database named Farsdat, and the results are reported using phone error rate (PER) criteria. From every 25-millisecond speech frame, an acoustic feature called LHCB is extracted and combined with delta and delta-delta features of that frame. Every frame's features are concatenated with fourteen adjacent frames and are finally fed to our neural network-based landmark extraction model. The best-trained model obtained the PER of 21.74% on test data. Using scheme one to four, we achieved an absolute PER decrease by 0.39, 0.58, 0.43 and 1.30 percent, respectively. Comparing our landmark-based system's performance with other Persian phone recognition systems shows that this method could perform efficiently as a Persian phone recognition system.
In our future works, we intend to compare our acoustic-based phone recognition system's performance with conventional methods such as CTC in noisy conditions. Besides, it seems that acoustic landmarks can be used to create an alignment of the input speech sequence and the output transcription. Therefore, we will present a combination of CTC-based methods and acoustic landmarks to utilize acoustic landmarks' complementary information. This information might boost the performance and speed of CTC-based speech recognition methods, particularly in low resource languages.

Article number: 12

Keywords: Phone Recognition, Acoustic Landmarks, Deep Learning, Robust Recognition, Nonlinear Filtering

Full-Text [PDF 768 kb] (553 Downloads)

Type of Study: Research | Subject: Paper
Received: 2020/09/7 | Accepted: 2021/08/25 | Published: 2023/03/20 | ePublished: 2023/03/20

References

1. [1] B. Babaali, A state-of-the-art and effitient framework for persian speech recognition, Signal and Data Processing, Vol. 13, pp. 51-62, 2016. [DOI:10.18869/acadpub.jsdp.13.3.51]

2. [2] Y. Samareh, Persian language phonology, Tehran, university publishing center, 1985.

3. [3] M. Rahiminezhad, S. A. Seyyedsalehi, Comparision and assessment of different feature extraction and normalization methods in speaker independent speech recognition, Amirkabir journal of science and research, 2000.

4. [4] S. A. Seyyedsalehi, I. Nejadgholi, F. Tohidkhah, Boostingt pattern recognition performance of neural networks with deleoping bidirectional methods, independent research report, 2004.

5. [5] S. Karami, Speaker independent persian phone recognition using a neural network model with a combination of steady and transition parts of phones, M.Sc. thesis, Biomedical engineering faculty, Amirkabir University, 2000.

6. [6] M. Yazdiyan, Persian continous speech recognition based on discrete acoustic events modeling, M.Sc. thesis, Biomedical engineering faculty, Amirkabir University, 2001.

7. [7] S. Alisamir, S. M. Ahadi, and S. Seyedin, An end-to-end deep learning model to recognize Farsi speech from raw input, 4th Iranian Conference on Signal Processing and Intelligent Systems, pp. 1-5, 2018. [DOI:10.1109/ICSPIS.2018.8700538]

8. [8] N. Amini, S. A. Seyyedsalehi, Manipulation of attractors in feed-forward autoassociative neural networks for robust learning, Iranian Conference on Electrical Engineering (ICEE), 2017. [DOI:10.1109/IranianCEE.2017.7985469] [PMID] []

9. [9] Z. Ansari and S. A. Seyyedsalehi, Toward growing modular deep neural networks for continuous speech recognition, Neural Computing and Applications, pp.1177-1196, 2017. [DOI:10.1007/s00521-016-2438-x]

10. [10] S. Babaei, , A. Geranmayeh, and S. A. Seyyedsalehi, Protein secondary structure prediction using modular reciprocal bidirectional recurrent neural networks, Computer methods and programs in biomedicine, 100(3), pp.237-247, 2010. [DOI:10.1016/j.cmpb.2010.04.005] [PMID]

11. [11] M. Bijankhan, J. Sheikhzadegan, M. R. Roohani, FARSDAT-the speech database of Farsi spoken language, proccedings australian conference on speech science and technology, 1994.

12. [12] S. Borysand M. Hasegawa-Johnson, SVM-HMM landmark based speech recognition, 2009.

13. [13] Z. Chen, Y., Luo and N. Mesgarani, Deep attractor network for single-microphone speaker separation, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 246-250, 2017. [DOI:10.1109/ICASSP.2017.7952155]

14. [14] J. Chorowski, D. Bahdanau, K. Cho and Y. Bengio, End-to-end continuous speech recognition using attention-based recurrent NN: first results, arXiv, pp.1412.1602, 2014.

15. [15] G. Dahl, M. A. Ranzato, A. R. Mohamed and G. E. Hinton, Phone recognition with the mean-covariance restricted Boltzmann machine, Advances in neural information processing systems, pp. 469-477, 2010.

16. [16] Z. D. Doolab, S. A. Seyyedsalehi, and N. S. Dehaghani, , Nonlinear Normalization of Input Patterns to Handwritten Character Variability in Handwriting Recognition Neural Network, International Conference on Biomedical Engineering and Biotechnology, pp. 848-851, 2012. [DOI:10.1109/iCBEB.2012.284]

17. [17] L. Dehyadegary, S. A. Seyyedsalehi and I. Nejadgholi, Nonlinear enhancement of noisy speech using continuous attractor dynamics formed in recurrent neural networks, Neurocomputing. 2011. [DOI:10.1016/j.neucom.2010.12.044]

18. [18] B. Delgutte and N. Y. Kiang, Speech coding in the auditory nerve: IV. Sounds with consonant‐like dynamic characteristics, The Journal of the Acoustical Society of America, pp.897-907, 1984. [DOI:10.1121/1.390599] [PMID]

19. [19] S. Firooz, F. Almasganj, and Y. Shekofteh, Improvement of automatic speech recognition systems via nonlinear dynamical features evaluated from the recurrence plot of speech signals, Computers & Electrical Engineering, pp. 215-226, 2017. [DOI:10.1016/j.compeleceng.2016.07.006]

20. [20] D. Gillick, S. Wegmann and L. Gillick, Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework, international conference on acoustics, speech and signal processing (ICASSP), pp. 4745-4748, 2012. [DOI:10.1109/ICASSP.2012.6288979]

21. [21] A.H. Hadjahmadi, and M. M. Homayounpour, Robust feature extraction and uncertainty estimation based on attractor dynamics in cyclic deep denoising autoencoders, Neural Computing and Applications, 31(11), pp.7989-8002, 2019. [DOI:10.1007/s00521-018-3623-x]

22. [22] M. Hasegawa-Johnson, J. Baker, S. Borys, K. Chen, E. Coogan, S. Greenberg, A. Juneja, K. Kirchhoff, K. Livescu, S. Mohan and J. Muller, Landmark-based speech recognition, Report of the 2004 Johns Hopkins summer workshop, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2005.

23. [23] D. He, B. P. Lim, X. Yang, M. Hasegawa-Johnson and D. Chen, Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model, The Journal of the Acoustical Society of America, pp. 3207-3219, 2018. [DOI:10.1121/1.5039837] [PMID]

24. [24] D. He, X. Yang, B. P. Lim, Y. Liang, M. Hasegawa-Johnson and D. Chen, When CTC training meets acoustic landmarks., International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5996-6000, 2019. [DOI:10.1109/ICASSP.2019.8683607]

25. [25] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath and B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, IEEE signal processing magazine, pp. 82-97, 2012. [DOI:10.1109/MSP.2012.2205597]

26. [26] A. Juneja and C. Espy-Wilson, A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition, journal of the acoustical society of America, pp. 1154-1168, 2008. [DOI:10.1121/1.2823754] [PMID]

27. [27] J. Kahn, A. Lee and A. Hannun, Self-training for end-to-end speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7084-7088, 2020. [DOI:10.1109/ICASSP40776.2020.9054295]

28. [28] M. A. Kermanshahi and M. M. Homayounpour, Improving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM, Journal of AI and Data Mining, pp.137-147, 2019.

29. [29] R., Kumar, Y. Luo, and N. Mesgarani, Music Source Activity Detection and Separation Using Deep Attractor Network, In INTERSPEECH, pp. 347-351, 2018. [DOI:10.21437/Interspeech.2018-2326]

30. [30] J. W. Lee, J. Y. Choi and H. G. Kang, Classifcation of stop place in consonant-vowel contexts using feature extrapolation of acoustic-phonetic features in telephone speech, The Journal of the Acoustical Society of America, Vol. 131, 2012. [DOI:10.1121/1.3672706] [PMID]

31. [31] Y. Luo, Z. Chen and N. Mesgarani, Speaker-independent speech separation with deep attractor network, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4), 787-796, 2018. [DOI:10.1109/TASLP.2018.2795749]

32. [32] M. Meister and M. J. Berry, The neural code of the retina, Neuron, pp.435-450, 1999. [DOI:10.1016/S0896-6273(00)80700-X] [PMID]

33. [33] N. Morgan, J. Cohen, S. H. Krishnan, S. Changand S. Wegmann, Final Report: OUCH Project (Outing Unfortunate Characteristics of HMMs), 2013.

34. [34] T.S. Nguyen, S. Stüker, J. Niehues, and A. Waibel, Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation, IEEE International Conference on Acoustics, Speech and Signal Processing, 2020. [DOI:10.1109/ICASSP40776.2020.9054130]

35. [35] C. Niu, J. Zhang, X. Yang and Y. Xie, A study on landmark detection based on CTC and its application to pronunciation error detection, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 636-640, 2017. [DOI:10.1109/APSIPA.2017.8282103] []

36. [36] S. Parveen, and P. Green, Speech enhancement with missing data techniques using recurrent neural networks, IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. I-733, 2004.

37. [37] M. Ravanelli, T. Parcollet and Y. Bengio, The pytorch-kaldi speech recognition toolkit, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6465-6469, 2019. [DOI:10.1109/ICASSP.2019.8683713]

38. [38] S. Reza, S. A. Seyyedsalehi, S. Z. Seyyedsalehi, A Persian Language Phone Recognition Based on Robust Extraction of Acoustic Landmarks, 27th Iranian Conference on Biomedical Engineering, 2020. [DOI:10.1109/ICBME51989.2020.9319436]

39. [39] S. Reza, S. A. Seyyedsalehi, S. Z. Seyyedsalehi, Attractor Manipulation in Denoising Autoencoders for Robust Phone Recognition, 29th Iranian Conference on Electrical Engineering, 2021. [DOI:10.1109/ICEE52715.2021.9543707]

40. [40] T. N. Sainath, Island-driven search using broad phonetic classes, automatic speech recognition & understanding, pp. 287-292, 2009. [DOI:10.1109/ASRU.2009.5373547]

41. [41] T. N. Sainath, B. Kingsbury and B. Ramabhadran, Auto-encoder bottleneck features using deep belief networks, IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4153-4156, 2012. [DOI:10.1109/ICASSP.2012.6288833]

42. [42] L. San, N. Moritz, T. Hori, and J. L. Roux, Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR, International Conference on Acoustics, Speech and Signal Processing, 2020.

43. [43] S. A. Seyyedsalehi, A modular neural network speech recognizer based on the both acoustic steady portions and transitions, international conference of spoken language processing (ICSLP), 2000. [DOI:10.21437/ICSLP.2000-408]

44. [44] S. Z. Seyyedsalehi, and S. A. Seyyedsalehi, Attractor analysis in associative neural networks and its application to facial image analysis, Computational Intelligence in Electrical Engineering, Vol. 9, No. 1, 2018

45. [45] K. N. Stevens, S. J. Keyser, and H. Kawasaki, Toward a phonetic and phonological theory of redundant features , Ph.D. thesis. MIT, camberidge, 1986.

46. [46] K. N. Stevens, From acoustic cues to segments, features and words, international conference on Spoken Language Processing (ICSLP), pp. A1-A8, 2000.

47. [47] J. Vaněk, J. Michálek and J. Psutka, Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task, International Conference on Speech and Computer, pp. 728-736, 2018. [DOI:10.1007/978-3-319-99579-3_74]

48. [48] H. Veisi, and A. H. Mani, Persian speech recognition using deep learning, International Journal of Speech Technology, 23(4), pp. 893-905, 2020. [DOI:10.1007/s10772-020-09768-x]

49. [49] P. Vincent, H. Larochelle, Y. Bengio and P. A. Manzagol, Extracting and composing robust features with denoising autoencoders, Proceedings of the 25th international conference on Machine learning, pp. 1096-1103), 2008. [DOI:10.1145/1390156.1390294]

50. [50] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of machine learning research, pp. 3371-3408, 2010.

51. [51] M. Zapotoczny, P. Pietrzak, A. Lancucki and J. Chorowski, 2019. Lattice generation in attention-based speech recognition models, pp.2225-2229, 2019. [DOI:10.21437/Interspeech.2019-2667]

52. [52] T. Yoshimura, T. Hayashi, K. Takeda and S. Watanabe, End-to-end automatic speech recognition integrated with CTC-based voice activity detection, arXiv, 2020. [DOI:10.1109/ICASSP40776.2020.9054358]

53. [53] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve and E. Dupoux, Learning filterbanks from raw speech for phone recognition, International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5509-5513, 2018. [DOI:10.1109/ICASSP.2018.8462015]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote