Signal and Data Processing

fa مقایسه روش های طیفی برای شناسایی زبان گفتاری A survey on spectral methods in spoken language identification مقالات پردازش گفتار Paper پژوهشي Research  شناسایی خودکار زبان گفتاری به تشخیص زبان از روی سیگنال گفتار گفته میشود. شناسایی زبان به‌طورمعمول به یکی از  دو دسته روش آوایی و طیفی انجام میشود. در این مقاله، انواع روشهای مختلف طیفی برای بازشناسی زبان گفتاری معرفی شده و نتایج به‌کارگیری آنها بر روی یک مجموعه دادگان گفتاری تلفنی محاورهای مقایسه شده است. روش طیفی پایۀ شناسایی زبان، مدل مخلوط گوسی-مدل جهانی (GMM-UBM) است. برای بهبود مدل گوسی هر زبان از روش تمایزی MMI و برای مدلکردن دینامیک زبان از مدل پنهان مارکوف ارگودیک (EHMM) استفاده میشود. روشهای GSV-SVM و روش نشانهگذار مبتنی بر GMM (GMM Tokenizer) نیز دو روش طیفی دیگر است که مورد بررسی قرار گرفته است. در این مقاله همچنین روشهای جدیدِ مدلسازی تنوعات کانال و گوینده (تحلیل توأم عاملها (JFA) و بردار شناسایی (i-Vector)) به‌کار رفته و برای بهبود نتایج آن از چند روش جبرانسازی تنوعات استفاده شده است. علاوه‌براین برای سهولت تصمیمگیری و کاهش خطای سامانۀ شناسایی زبان، از پسپردازش امتیاز استفاده شده است. این مقاله بخشی از هفت سال پژوهش‌ در زمینه شناسایی زبان گفتاری در پژوهشگاه توسعه فناوریهای پیشرفته خواجه نصیرالدین طوسی است و تنها خلاصهای از روشها و نتایج به‌دست‌آمده در این مقاله آورده شده است. Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The Gaussian mixture model is the most common statistical model in spectral-based language identification systems. On the other hand, in phonetic-based methods, speech signals are divided into a sequence of tokens using the hidden Markov model (HMM) and a language model is trained using the obtained sequence. Approaches like PRLM, PPRLM, and PR-SVM are some examples of phonetic-based methods. In research papers, usually a combination of phonetic-based and spectral-based systems are used to achieve a high quality language identification system. Spectral-based methods have been the focus of researchers, since they have no need for labeled data and usually achieve better results than phonetic approaches. Therefore, in this paper, these methods used for language identification and different spectral methods, are introduced, implemented, and compared with spoken language recognition. The basic spectral language identification method is Gaussian Mixture Model-Universal Background Model (GMM-UBM). In this paper, the MMI discrimination method is used to improve the Gaussian model of each language. Moreover, in order to model the language dynamically, GMM is replaced with the ergodic hidden Markov model (EHMM). GSV-SVM and GMM tokenizer methods are also implemented as two popular spectral approaches. In this paper, novel speaker and channel variation modeling methods are used as language identification approaches, including joint factor analysis (JFA), identity vector (i-Vector) and several variations compensation methods exploited to improve the results of i-Vector. Furthermore, in order to boost the performance of language recognition systems, different post-processing methods are applied. For post-processing, each element of raw score vector indicates the degree by which the spoken signal belongs to a language. Post-processing methods are applied to this vector as a classifier and allows making better language detection decisions by mapping the raw score vector to a space of desired languages. Different studies have employed different post-processing methods, including GMM, NN, SVM, and LLR. This study exploits several score post-processing methods to improve the quality of language recognition. The goal of the experiments in this article is to detect and distinguish Farsi, English, and Arabic, individually and simultaneously from other languages. The latter is also called open-set language identification. The signals considered in this paper include two-sided conversations, whose quality is usually not desirable due to strong noise signals, background noises of individuals or music, accents, etc. Gaussian mixture-universal model (GMM-UBM) was implemented as the basic method. In this approach, mean EER of the three target languages (Farsi, English, and Arabic) was 13.58. Experimental results indicated that training the GMM language identification system with the MMI discrimination training algorithm is more efficient than systems only trained by the ML algorithm. More specifically, the mean EER of the three target languages was reduced about 8 percent in comparison to GMM-UBM. The GMM tokenizer method was also tested as a novel spectral approach. Using this method, the mean EER of the three target languages was also about 5 percent better than GMM-UBM. In this study, the GSV-SVM discrimination method was also used for language recognition. The results of this method were considerably better than those of common spectral approaches, such that the mean EER of the three target languages was reduced by 11 percent in comparison to GMM-UBM. This study improves the low speed of this method using a model pushing method. This study also implemented two novel methods, JFA and i-Vector. According to the results, both of these methods provide better results than GMM-UBM, such that the mean EER values of the three target languages in JFA and i-Vector are respectively reduced by 1% and 12%. Generally, experimental results showed that i-Vector provides better results than other spectral language identification systems. This study is a result of a seven-year research in spoken language identification in the advanced technology development center of Khajeh Nasiredin Tousi. The ongoing research includes studying and implementing novel spectral language identification algorithms like PLDA and state-of-the-art phonetic language identification methods to combine the two spectral and phonetic systems and eventually, achieving a high quality language identification system. شناسایی خودکار زبان گفتاری, روش‌های طیفی, آموزش تمایزی, جبران‌سازی تنوعات کانال, بردار شناسایی. Automatic Spoken Language Recognition, Acoustic Approaches, Discriminative training, Channel compensation, Identity Vector. 111 134 http://jsdp.rcisp.ac.ir/browse.php?a_code=A-10-798-1&slc_lang=fa&sid=1 shaghayegh reza شقایق رضا shaghayegh.reza@gmail.com 10031947532846005230 10031947532846005230 Yes Amirkabir university پژوهشکده پردازش داده، پژوهشگاه توسعه فناوری‌های پیشرفته خواجه‌نصیرالدین طوسی jahanshah kabudian جهانشاه کبودیان kabudian@razi.ac.ir 10031947532846005231 10031947532846005231 No Razi University,Kermanshah دانشگاه رازی کرمانشاه