Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Veisi, Hadi; Ghoreishi, Sayed Akbar; Bastanfard, Azam

doi:10.29252/jsdp.17.4.67

Volume 17, Issue 4 (2-2021) JSDP 2021, 17(4): 67-88 | Back to browse issues page

‎ 10.29252/jsdp.17.4.67

Mendeley

Zotero

RefWorks

Veisi H, Ghoreishi S A, Bastanfard A. Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting. JSDP 2021; 17 (4) :67-88
URL: http://jsdp.rcisp.ac.ir/article-1-922-en.html

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Hadi Veisi ^*

, Sayed Akbar Ghoreishi

, Azam Bastanfard

Faculty of New Sciences and Technologies, University of Tehran

Abstract: (4556 Views)

Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIB's archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting. The aim of this research is to design a content retrieval engine for the IRIB’s media and production using spoken term detection (STD) or keyword spotting. The goal of an STD system is to search for a set of keywords in a set of speech documents. One of the methods for STD is using a speech recognition system in which speech is recognized and converted into text and then, the text is searched for the keywords. Variety of speech documents and the limitation of speech recognition vocabulary are two challenges of this approach. Large vocabulary continuous speech recognition systems (LVCSR) usually have limited but large vocabulary and these systems can't recognize out of vocabulary (OOV) words. Therefore, LVCSR-based STD systems suffer OOV problem and can't spotting the OOV keywords. Methods such as the use of sub-word units (e.g., phonemes or syllables) and proxy words have been introduced to overcome the vocabulary limitation and to deal with the out of vocabulary (OOV) keywords.
This paper proposes a Persian (Farsi) STD system based on speech recognition and uses the proxy words method to deal with OOV keywords. To improve the performance of this method, we have used Long Short-Term Memory-Connectionist Temporal Classification (LSTM-CTC) network.
In our experiments, we have designed and implemented a large vocabulary continuous speech recognition systems for Farsi language. Large FarsDat dataset is used to train the speech recognition system. FarsDat contains 80 hours voices from 100 speakers. Kaldi toolkit is used to implement speech recognition system. Since limited dataset, Subspace Gaussian Mixture Models (SGMM) is used to train acoustic model of the speech recognition. Acoustic model is trained based context tri-phones and language model is probability tri-gram words model. Word Error Rate (WER) of Speech recognition system is 2. 71% on FARSDAT test set and also 28.23% on the Persian news collected from IRIB data.
Term detection is designed based on weighted finite-state transducers (WFST). In this method, first a speech document is converted to a lattice by the speech recognizer (the lattice contains the full probability of speech recognition system instead of the most probable one), and then the lattice is converted to WFST. This WFST contains the full probability of words that speech recognition computed. Then, text retrieval is used to index and search over the WFST output. The proxy words method is used to deal with OOV. In this method, OOV words are represented by similarly pronunciation in-vocabulary words. To improve the performance of the proxy words methods, an LSTM-CTC network is proposed. This LSTM-CTC is trained based on charterers of words separately (not a continuous sentence). This LSTM-CTC recomputed the probabilities and re-verified proxy outputs. It improves proxy words methods dues to the fact that proxy words method suffers false alarms. Since LSTM-CTC is an end-to-end network and is trained based on the characters, it doesn't need a phonetic lexicon and can support OOV words. As the LSTM-CTC is trained based on the separate words, it reduces the weight of the language model and focuses on acoustic model weight.
The proposed STD achieve 0.9206 based Actual Term Weighted Value (ATWV) for in vocabulary keywords and for OOV keywords ATWV is 0.2 using proxy word method. Applying the proposed LSTM-CTC improves the ATWV rate to 0.3058. On Persian news dataset, the proposed method receives ATWV of 0.8008.

Keywords: Persian Spoken Term Detection, IRIB, Persian News, Keyword Spotting, Speech Recognition, Kaldi

Full-Text [PDF 6621 kb] (1124 Downloads)

Type of Study: Applicable | Subject: Paper
Received: 2018/10/30 | Accepted: 2020/08/18 | Published: 2021/02/22 | ePublished: 2021/02/22

References

1. [1] L. Lee, J. Glass, H. Lee, and C. Chan, "Spoken Content Retrieval-Beyond Cascading Speech Recognition with Text Retrieval," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 9, pp. 1389-1420, Sep. 2015. [DOI:10.1109/TASLP.2015.2438543]

2. [2] M. Larson and G. J. F. Jones, "Spoken Content Retrieval: A Survey of Techniques and Technologies," Found. Trends® Inf. Retr., vol. 5, no. 3, pp. 235-422, 2012. [DOI:10.1561/1500000020]

3. [3] J. G. Fiscus, J. Ajot, J. S. Garofolo, and G. Doddingtion, "Results of the 2006 Spoken Term Detection Evaluation," Proc. ACM SIGIR Work. Search. Spontaneous Conversational., pp. 51-55, 2006.

4. [4] J. Tejedor et al., "ALBAYZIN 2016 spoken term detection evaluation: an international open competitive evaluation in Spanish," EURASIP J. Audio, Speech, Music Process., vol. 2017, no. 1, p. 22, 2017. [DOI:10.1186/s13636-017-0119-z]

5. [5] J. S. Garofolo, C. G. P. Auzanne, and E. M. Voorhees, "The TREC Spoken Document Retrieval Track: A Success Story," Proc. TREC-8, vol. 8940, no. 500-246, pp. 109-130, 1999.

6. [6] J. Trmal et al., "The Kaldi OpenKWS System : Improving Low Resource Keyword Search," Interspeech2017, pp. 3597-3601, 2017. [DOI:10.21437/Interspeech.2017-601]

7. [7] X. Anguera, L. J. Rodriguez-Fuentes, A. Buzo, F. Metze, I. Szoke, and M. Penagarikano, "QUESST2014: Evaluating Query-by-Example Speech Search in a zero-resource setting with real-life queries," ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2015-Augus, pp. 5833-5837, 2015. [DOI:10.1109/ICASSP.2015.7179090]

8. [8] T. Alumäe et al., "The 2016 BBN Georgian telephone speech keyword spotting system," ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 5755-5759, 2017. [DOI:10.1109/ICASSP.2017.7953259]

9. [9] Z. Gomar, Discriminative Articulatory Models for Spoken Term Detection in Low-Resource Conditions, M.S. Thesis, Sharif University of Technology, 2016.

10. [10] M. Crochemore, "Transducers and repetitions," Theor. Comput. Sci., vol. 45, pp. 63-86, 1986. [DOI:10.1016/0304-3975(86)90041-1]

11. [11] J. S. Bridle, "An efficient elastic-template method for detecting given words in running speech," Brit. Acoust. Soc. Meet., pp. 1-4, 1973.

12. [12] A. Mandal, K. R. Prasanna Kumar, and P. Mitra, "Recent developments in spoken term detection: a survey," Int. J. Speech Technol., vol. 17, no. 2, pp. 183-198, Jun. 2014. [DOI:10.1007/s10772-013-9217-1]

13. [13] J. Bridle, "An efficient elastic template method for detecting given keywords in the running speech," Proc. Br. Acoust. Soc. Meet., pp. 1-4, 1973.

14. [14] C. Parada, A. Sethy, and B. Ramabhadran, "Query-by-example spoken term detection for OOV terms," Proc. 2009 IEEE Work. Autom. Speech Recognit. Understanding, ASRU 2009, pp. 404-409, 2009. [DOI:10.1109/ASRU.2009.5373341]

15. [15] J. Tejedor, I. Szöke, and M. Fapso, "Novel methods for query selection and query combination in query-by-example spoken term detection," Proc. 2010 Int. Work. Search. spontaneous conversational speech - SSCS '10, pp. 15-20, 2010. [DOI:10.1145/1878101.1878106]

16. [16] M. C. Madhavi and H. A. Patil, "Partial matching and search space reduction for QbE-STD," Comput. Speech Lang., vol. 45, pp. 58-82, Sep. 2017. [DOI:10.1016/j.csl.2017.03.004]

17. [17] Y. Zhang and J. R. Glass, "Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams," Proc. 2009 IEEE Work. Autom. Speech Recognit. Understanding, ASRU 2009, pp. 398-403, 2009. [DOI:10.1109/ASRU.2009.5372931]

18. [18] M. Huijbregts, M. McLaren, and D. Van Leeuwen, "Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection," ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 4436-4439, 2011. [DOI:10.1109/ICASSP.2011.5947338]

19. [19] P. Fousek and H. Hermansky, "Towards ASR Based on Hierarchical Posterior-Based Keyword Recognition," 2006 IEEE Int. Conf. Acoust. Speed Signal Process. Proc., vol. 1, pp. I-433-I-436.

20. [20] H. Sakoe and S. Shiba, "Dynamic programming algorithm optimization for spoken word recognition," IEEE Trans. Acoust. Speech Signal Process., vol. 26, no. 1, pp. 43-49, 1978. [DOI:10.1109/TASSP.1978.1163055]

21. [21] C. Chan and L. Lee, "Unsupervised Spoken-Term Detection with Spoken Queries Using Segment-based Dynamic Time Warping," Evaluation, no. September, pp. 693-696, 2010.

22. [22] D. Ram, L. Miculicich, and H. Bourlard, "CNN based query by example spoken term detection," Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-September, pp. 92-96, 2018. [DOI:10.21437/Interspeech.2018-1722]

23. [23] C. W. Ao and H. Y. Lee, "Query-by-example spoken term detection using attention-based multi-hop networks," ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, pp. 6264-6268, 2018.

24. [24] R. C. Rose and D. B. Paul, "A hidden Markov model based keyword recognition system," Int. Conf. Acoust. Speech, Signal Process., pp. 129-132 vol.1, 1990.

25. [25] J. G. Wilpon, L. R. Rabiner, C.-H. Lee, and E. R. Goldman, "Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models," Ieee Taslp, vol. 3, no. I, pp. 1870-1878, 1990. [DOI:10.1109/29.103088]

26. [26] A. Tavanaei, H. Sameti, and S. H. Mohammadi, "False alarm reduction by improved filler model and post-processing in speech keyword spotting," IEEE Int. Work. Mach. Learn. Signal Process., 2011. [DOI:10.1109/MLSP.2011.6064588]

27. [27] R. Sukkar and J. Wilpon, "A two pass classifier for utterance rejection in Keyword Spotting," Acoust. Speech, Signal …, pp. 1-4, 1993. [DOI:10.1109/ICASSP.1993.319338]

28. [28] M. G. Rahim, C. H. Lee, and B. H. Juang, "Discriminative utterance verification for connected digits recognition," IEEE Trans. Speech Audio Process., vol. 5, no. 3, pp. 266-277, 1997. [DOI:10.1109/89.568733]

29. [29] "KWS16 Evaluation Plan." [Online]. Available: https://www.nist.gov/%0Asites/default/files/documents/itl/iad/mig/KWS16-evalplan-v04.pdf.

30. [30] C. Chelba, J. Silva, and A. Acero, "Soft indexing of speech content for search in spoken documents," Comput. Speech Lang., vol. 21, no. 3, pp. 458-478, Jul. 2007. [DOI:10.1016/j.csl.2006.09.001]

31. [31] G. Chen, O. Yilmaz, J. Trmal, D. Povey, and S. Khudanpur, "Using proxies for OOV keywords in the keyword search task," in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013, pp. 416-421. [DOI:10.1109/ASRU.2013.6707766]

32. [32] T. K. Chia, K. C. Sim, H. Li, and H. T. Ng, "Statistical lattice-based spoken document retrieval," ACM Trans. Inf. Syst., vol. 28, no. 1, pp. 1-30, Jan. 2010. [DOI:10.1145/1658377.1658379]

33. [33] Y. C. Pan and L. S. Lee, "Performance analysis for lattice-based speech indexing approaches using words and subword units," IEEE Trans. Audio, Speech Lang. Process., vol. 18, no. 6, pp. 1562-1574, 2010. [DOI:10.1109/TASL.2009.2037404]

34. [34] W. Hartmann, V. B. Le, A. Messaoudi, L. Lamel, and J. L. Gauvain, "Comparing decoding strategies for subword-based keyword spotting in low-resourced languages," Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, no. September, pp. 2764-2768, 2014.

35. [35] L. S. Lee and Y. C. Pan, "Voice-based information retrieval - How far are we from the text-based information retrieval?," Proc. 2009 IEEE Work. Autom. Speech Recognit. Understanding, ASRU 2009, pp. 26-43, 2009. [DOI:10.1109/ASRU.2009.5372952]

36. [36] D. Can, "Indexation, retrieval & decision techniques for spoken term detection," PhD diss, Boğaziçi University, 2010.

37. [37] M. Qadiri Nia, "Design and Performance Improvement of a Spoken Term Detection System", M.S. thesis, Sharif Universitt of Technology, 2015.

38. [38] M. Abbassian, "Keword Spotting in Persian Speech Using a Hybrid Model of DNN and HMM", M.S. thesis, Amir Kabir University of Technology, 2017.

39. [39] S.S. Sarfjou, Introducing a New Information Retrieval Framework for Persian Speech Retrieval, M.S. thesis, Qom University, 2012.

40. [40] M.Y. Akhlaqi, "Introducing a New Information Retrieval Method for Speech Recognized Texts, M.S. thesis, Qom University, 2014.

41. [41] M.H. Soltani, ''Introducing a New Information Retrieval Framework for Speech Retrieval'' M.s. thesis, Qom University, 2014.

42. [42] H. Naderi, Keyword Spotting in Speech Utterance, M.S. thesis, Shahrood University of Technology, 2013.

43. [43] M. Bijankhan, J. Sheikhzadegan, M. R. Roohani, Y. Samareh, C. Lucas, and M. Tebyani, "FARSDAT-The speech database of Farsi spoken language," Proc. Aust. Conf. Speech Sci. Technol., vol. 2, no. 0, pp. 826-831, 1994.

44. [44] J. Sheikhzadegan and M. Bijankhan, "Persian speech databases," 2nd Work. Persian Lang. Comput., pp. 247-261, 2006.

45. [45] M. Bijankhan, J. Sheykhzadegan, M. R. Roohani, R. Zarrintare, S. Z. Ghasemi, and M. E. Ghasedi, "Tfarsdat - The telephone farsi speech database," EUROSPEECH 2003 - 8th Eur. Conf. Speech Commun. Technol., 2003.

46. [46] D. Can and M. Saraclar, "Lattice Indexing for Spoken Term Detection," IEEE Trans. Audio. Speech. Lang. Processing, vol. 19, no. 8, pp. 2338-2347, Nov. 2011. [DOI:10.1109/TASL.2011.2134087]

47. [47] Z. Lv, J. Kang, W. Q. Zhang, and J. Liu, "An LSTM-CTC based verification system for proxy-word based OOV keyword search," ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 5655-5659, 2017. [DOI:10.1109/ICASSP.2017.7953239]

48. [48] C. Parada, A. Sethy, and B. Ramabhadran, "Balancing false alarms and hits in spoken term detection," ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 5286-5289, 2010. [DOI:10.1109/ICASSP.2010.5494966]

49. [49] Z. Victor, S. Seneff, and J. Glass, "TIMIT acoustic-phonetic continuous speech corpus," Speech Commun., vol. 9, no. 4, pp. 351-56, 1990. [DOI:10.1016/0167-6393(90)90010-7]

50. [50] B. BabaAli, ''State-of-the-art and Efficient Framework for Persian Speech Recognition, jsdp , Vol (3), pp. 51-62, 2017. [DOI:10.18869/acadpub.jsdp.13.3.51]

51. [51] M. Eslami, M. Sharifi Atashgah, S. Alizade , T, Zandi, ''Persian Generative Lexicon'' Proceedings of the first Persian language and computer research workshop, 2005.

52. [52] "هضم، Hazm." [Online]. Available: https://github.com/sobhe/hazm.

53. [53] M. Federico, N. Bertoldi, and M. Cettolo, "IRSTLM: An open source toolkit for handling large scale language models," Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, pp. 1618-1621, 2008.

54. [54] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, "The DET Curve in Assessment of Detection Task Performance," Proc. Eurospeech '97, pp. 1895-1898, 1997.

55. [55] D. Jurafsky and J. H. Martin, Speech and language processing. 1999.

56. [56] D. Povey et al., "The subspace Gaussian mixture model - A structured model for speech recognition," Comput. Speech Lang., vol. 25, no. 2, pp. 404-439, 2011. [DOI:10.1016/j.csl.2010.06.003]

57. [57] J. Tejedor, D. Wang, J. Frankel, S. King, and J. Colás, "A comparison of grapheme and phoneme-based units for Spanish spoken term detection," Speech Commun., vol. 50, no. 11-12, pp. 980-991, 2008. [DOI:10.1016/j.specom.2008.03.005]

58. [58] Y. Wang and F. Metze, "An in-depth comparison of keyword specific thresholding and sum-to-one score normalization," Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, pp. 2474-2478, 2014.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote