Volume 14, Issue 4 (3-2018)                   JSDP 2018, 14(4): 143-157 | Back to browse issues page

XML Persian Abstract Print

Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Jafari H S, homayounpour M. A comparison of machine learning techniques for Persian Extractive Speech to Speech Summarization without Transcript. JSDP. 2018; 14 (4) :143-157
URL: http://jsdp.rcisp.ac.ir/article-1-491-en.html
Amirkabir University of Technology
Abstract:   (1326 Views)

In this paper, extractive speech summarization using different machine learning algorithms was investigated. The task of Speech summarization deals with extracting important and salient segments from speech in order to access, search, extract and browse speech files easier and in a less costly manner. In this paper, a new method for speech summarization without using automatic speech recognition system (ASR) is proposed. ASR systems usually have high error rates especially in adverse acoustic environment and for low resource languages. Our goal was to answer this question: is it possible to summarize a Persian speech without ASR using less or no training data? We have proposed a method which discovers salient parts directly from speech signal by using a semi-supervised algorithm. The proposed algorithm consists of three main stages, features extraction, identifying key patterns and selecting important sentences.
First we have segmented speech voices manually into sentences to eliminate sentence segmentation errors. Therefore, we could have better comparison between different summarization methods. Then we have extracted some features from each sentence such as sentence duration, if the sentence is first or last sentence in the speech and so on. Also, repetitive patterns between each two sentence of speech are discovered directly from speech signal by using S-DTW algorithm. S-DTW algorithm can discover repetitive patterns between two speech signals by using MFCC features. By using these repetitive patterns between each pair of sentences we can make a similarity matrix. Therefore, we could measure the similarity distance between each pair of sentences and eliminate redundant sentences from summary without the need to use an ASR system
After finding the similarity between each two speech segments and extracting some features from each segment, various machine learning algorithms including unsupervised (MMR, TextRank), supervised (SVM, Naïve Bayes) and semi-supervised algorithms (self-training, Co-training) are used in order to extract salient parts. Experiences are done in read Persian news. The results show that using semi-supervised co-training method and appropriate features, the performance of speech summarization system on read Persian news corpus can improve about 3% compared to selecting the first sentences and by 5% compared to longest sentences when ROUGE-3 is used as the evaluation measure.

Full-Text [PDF 6205 kb]   (351 Downloads)    
Type of Study: Applicable | Subject: Paper
Received: 2016/02/18 | Accepted: 2017/06/10 | Published: 2018/03/13 | ePublished: 2018/03/13

1. [1] A. McCallum, "An Ecologically Valid Evaluation of Speech Summarizationin the University Lecture Domain," MSc thesis, University of Toronto, 2012.
2. [2] S. R. Maskey, "Automatic Broadcast News Speech Summarization," PhD thesis, School of Arts and Sciences, Columbia University, 2008.
3. [3] R. Flamary, X. Anguera, and N. Oliver, "Spoken WordCloud: Clustering recurrent patterns in speech," in CBMI 2011, pp. 133-138.
4. [4] Y. Liu, S. Xie, and F. Liu, "Using n-best recognition output for extractive summarization and keyword extraction in meeting speech," in ICASSP 2010, pp. 5310-5313.
5. [5] S. Xie and Y. Liu, "Using N-Best Lists and Confusion Networks for Meeting Summariza-tion," IEEE Transactions on Audio, Speech & Language Processing, vol. 19, no. 5, pp. 1160-1169, 2011. [DOI:10.1109/TASL.2010.2082534]
6. [6] J. Zhang, R. H. Y. Chan, P. Fung, and L. Cao, "A comparative study on speech summarization of broadcast news and lecture speech," in INTERSPEECH 2007, pp. 2781-2784.
7. [7] S. Xie, D. Hakkani-Tür, B. Favre, and Y. Liu, "Integrating prosodic features in extractive meeting summarization," in ASRU, Mer-ano/Meran, Italy, 2009, pp. 387-391.
8. [8] S. Xie, Y. Liu, and H. Lin, "Evaluating the effectiveness of features and sampling in extractive meeting summarization," presented at the SLT 2008.
9. [9] S. Xie and Y. Liu, "Improving supervised learning for meeting summarization using sampling and regression," Computer Speech & Language, vol. 24, no. 3, pp. 495-514, 2010. [DOI:10.1016/j.csl.2009.04.007]
10. [10] S.-H. Lin and B. Chen, "A Risk Minimization Framework for Extractive Speech Summariza-tion," in ACL Uppsala, Sweden, 2010, pp. 79-87.
11. [11] B. Chen and S.-H. Lin, "A Risk-Aware Modeling Framework for Speech Summariza-tion," IEEE Transactions on Audio, Speech & Language Processing, vol. 20, no. 1, pp. 211-222, 2012. [DOI:10.1109/TASL.2011.2159596]
12. [12] J. J. Zhang and P. Fung, "Learning deep rhetorical structure for extractive speech summarization," in ICASSP, 2010, pp. 5302-5305.
13. [13] J. Zhang, H. Yuan, and X. Pan, "rhetorical-state SVM for Lecture speech summarization," In-formation Technology Journal, 2014.
14. [14] S.-H. Lin, Y.-M. Yeh, and B. Chen, "Leveraging Kullback–Leibler Divergence Measures and Information-Rich Cues for Speech Summariza-tion," IEEE Transactions on Audio, Speech, and Language, vol. 19 no. 4, pp. 871-882, May 2011. [DOI:10.1109/TASL.2010.2066268]
15. [15] S. Xie, H. Lin, and Y. Liu, "Semi-supervised extractive speech summarization via co-training algorithm," in INTERSPEECH 2010, pp. 2522-2525.
16. [16] B. Chen, H.-C. Chang, and K.-Y. Chen, "Sentence modeling for extractive speech summarization," in ICME, San Jose, CA, USA, 2013, pp. 1-6.
17. [17] B. Chen, S.-H. Lin, Y.-M. Chang, and J.-W. Liu, "Extractive speech summarization using evaluation metric-related training criteria," Information Processing and Management, vol. 49, no. 1, pp. 1-12, 2013. [DOI:10.1016/j.ipm.2011.12.002]
18. [18] D. Gillick, K. Riedhammer, B. Favre, and D. Z. Hakkani-Tür, "A global optimization framework for meeting summarization," in ICASSP 2009, pp. 4769-4772.
19. [19] K. Riedhammer, B. Favre, and D. Hakkani-Tür, "Long story short - Global unsupervised models for keyphrase based meeting summarization," Speech Communication, vol. 52, pp. 801-815, 2010. [DOI:10.1016/j.specom.2010.06.002]
20. [20] Y.-N. Chen, Y. Huang, C.-f. Yeh, and L.-S. Lee, "Spoken Lecture Summarization by Random Walk over a Graph Constructed with Automati-cally Extracted Key Terms," in INTERSPEECH 2011, pp. 933-936.
21. [21] T. J. Hazen, "Latent Topic Modeling for Audio Corpus Summarization," in INTERSPEECH 2011, pp. 913-916.
22. [22] L. Wang and C. Cardie, "Unsupervised Topic Modeling Approaches to Decision Summariza-tion in Spoken Meetings," in SIGDIAL Confer-ence, 2012, pp. 40-49.
23. [23] K.-Y. Chen et al., "Extractive Broadcast News Summarization Leveraging Recurrent Neural Network Language Modeling Techniques," IEEE/ACM Transactions on Audio, Speech & Language Process-ing, vol. 23, no. 8, pp. 1322-1334, 2015.
24. [24] M. H. Bokaei, H. Sameti, and Y. Liu, "Extractive summarization of multi-party meet-ings through discourse segmentation," Natural Language Engineering, vol. 22, no. 1, pp. 41-72, 2016. [DOI:10.1017/S1351324914000199]
25. [25] M.-H. Siu, H. Gish, A. Chan, W. Belfield, and S. Lowe, "Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery," Computer Speech & Language, vol. 28, no. 1, pp. 210-223, 2014. [DOI:10.1016/j.csl.2013.05.002]
26. [26] N. F. Chen, B. Ma, and H. Li, "Minimal-resource phonetic language models to summar-ize untranscribed speech," in ICASSP 2013, pp. 8357-8361.
27. [27] A. Muscariello, G. Gravier, and F. Bimbot, "Unsupervised Motif Acquisition in Speech via Seeded Discovery and Template Matching Combination," IEEE Transactions on Audio, Speech & Language Processing, vol. 20, no. 7, pp. 2031-2044, 2012. [DOI:10.1109/TASL.2012.2194283]
28. [28] J. R. Glass, "Towards unsupervised speech processing," in ISSPA Montreal, QC, Canada, 2012, pp. 1-4.
29. [29] D. F. Harwath, T. J. Hazen, and J. R. Glass, "Zero resource spoken audio corpus analysis," in ICASSP 2013, pp. 8555-8559.
30. [30] S. Maskey and J. Hirschberg, "Summarizing Speech Without Text Using Hidden Markov Models," presented at the HLT-NAACL, 2006. [DOI:10.3115/1614049.1614072]
31. [31] S. H. Yella, V. Varma, and K. Prahallad, "Prominence based scoring of speech segments for automatic speech-to-speech summarization," in INTERSPEECH 2010, pp. 1297-1300.
32. [32] S. K. Jauhar, Y.-N. Chen, and F. Metze, "Prosody-Based Unsupervised Speech Summarization with Two-Layer Mutually Reinforced Random Walk," in IJCNLP, Nagoya, Japan, 2013, pp. 648-654.
33. [33] J. Zhang and H. Yuan, "Speech Summarization without Lexical Features for Mandarin Presenta-tion Speech," in IALP, Urumqi, China, 2013, pp. 147-150.
34. [34] X. Zhu, G. Penn, and F. Rudzicz, "Summarizing multiple spoken documents: finding evidence -from untranscribed audio," in ACL/IJCNLP, 2009, pp. 549-557.
35. [35] A. S. Park and J. R. Glass, "Unsupervised Pattern Discovery in Speech," IEEE Transac-tions on Audio, Speech & Language Processing, vol. 16, no. 1, pp. 186-197, 2008. [DOI:10.1109/TASL.2007.909282]
36. [36] A. Jansen, K. Church, and H. Hermansky, "Towards spoken term discovery at scale with zero resources," in INTERSPEECH 2010, pp. 1676-1679.
37. [37] Y. Zhang, "Unsupervised Speech Processing with Applications to Query-by-Example Spoken Term Detection," PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2013.
38. [38] Y. Zhang and J. R. Glass, "Towards multi-speaker unsupervised speech pattern discovery," in ICASSP 2010, pp. 4366-4369.
39. [39] D. F. Harwath, "Unsupervised Modeling of Latent Topics and Lexical Units in Speech Audio," MSc thesis, Department of Electrical Engineering and Computer Science, Massachu-setts Institute of Technology, 2013.
40. [40] R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Text," in EMNLP Barcelona, Spain, 2004, pp. 404-411.
41. [41] S. Maskey and J. Hirschberg, "Comparing Lexial, Acoustic/Prosodic, Discourse and Structural Features for Speech Summarization," in Eurospeech Lisbon, Portugal, 2005.
42. [42] H. S. Jafari and M. M. Homayounpour, "key pattern recognition from Persian speech signal without transcript," presented at the 19th National CSI Computer Conference, Shahid Beheshti University, Tehran, Iran, 2014.
43. [43] A. Blum and T. Mitchell, "Combining labeled and unlabeled data with co-training," in 11th Annual Conference on Computational Learning Theory, 1998, pp. 92-100. [DOI:10.1145/279943.279962]
44. [44] S. A. Goldman and Y. Zhou, "Enhancing Supervised Learning with Unlabeled Data," in ICML, Stanford, CA, USA, 2000, pp. 327-334.
45. [45] B. B. Moghaddas, M. Kahani, S. A. Toosi, AsefPourmasoumi, and A. Estiri, "Pasokh: A standard corpus for the evaluation of Persian text summarizers," in ICCKE, Mashhad, Iran, 2013, pp. 471-475.
46. [46] DUC. (2013). http://duc.nist.gov/.
47. [47] C.-Y. Lin, "Rouge: A package for automatic evaluation of summaries. In Proceedings," in workshop on text summarization branches out, 2004, pp. 25-26.
48. [48] A. pourmasoomi, M. kahani, S. A. Toosi, and A. Estiri, "Ijaz: An Operational system for single-document summarization of Persian news texts," JSDP, vol. 11, no. 1, pp. 33-48, 2014.
49. [49] H. S. Jafari and M. M. Homayounpour, "Persian speech sentence segmentation without speech recognition," presented at the Iranian Conference on Intelligent Systems (ICIS), Bam, Kerman, 2014. [DOI:10.1109/IranianCIS.2014.6802564]

Add your comments about this article : Your username or Email:

Send email to the article author

© 2015 All Rights Reserved | Signal and Data Processing