Signal and Data Processing

fa ارائه یک روش جدید بازیابی اطلاعات مناسب برای متون حاصل از بازشناسی گفتار Introducing a new information retrieval method applicable for speech recognized texts مقالات پردازش متن Paper كاربردي Applicable در این مقاله، یک پیش پردازش روی روش&lrm;های بازیابی اطلاعات، ارائه می شود که برای بازیابی اطلاعات حاصل از متون بازشناسی شده ی گفتاری، مناسب است. این پیش پردازش، به شکل ترکیبی از اصلاح و گسترش پرس&rlm; و جو می &rlm;باشد. ورودی&rlm; های مسئله، اسناد متنی بدست آمده از بازشناسی گفتار و پرس&rlm; و جو می باشد و هدف، یافتن اسناد مرتبط با کلمه پرس &rlm;و جو است. مشکل آن است که متن حاصل از بازشناسی گفتار، همواره دارای درصد خطایی در بازشناسی است که ممکن است منجر به این شود که کلماتی که در واقع مرتبط هستند و به&rlm; علت وقوع خطای بازشناسی دگرگون شده&rlm; اند مرتبط تشخیص داده نشوند. ایده ی روش ارائه شده، تشخیص خطای بازشناسی در کلمات و در نظر گرفتن کلمات مشابه برای آن دسته از کلماتی است که به عنوان خطا تشخیص داده‌شده اند. برای تشخیص کلمه ی خطا، پارامتری به عنوان احتمال خطا در کلمه تعریف می&rlm; شود که بزرگ بودن آن بیانگر امکان بیشتر وقوع خطا در کلمه است. همچنین برای تشخیص کلمات مشابه، ابتدا با استفاده از معیار فاصله لونشتاین، کلمات مشابه اولیه را پیدا می کنیم. سپس احتمال تبدیل این کلمات مشابه به کلمه پرس &rlm;و جوی اصلی، محاسبه می شود. کلمات مشابه معنایی، از بین کلماتی که احتمال تبدیل بیشتری دارند، بر اساس یک سطح آستانه انتخاب می‌شوند. اکنون در الگوریتم بازیابی، علاوه&rlm; بر کلمه اصلی، کلمات مشابه آن نیز در جستجو، مرتبط در نظر گرفته می&rlm; شوند. نتایج پیاده&rlm;سازی&rlm;ها نشان می&rlm;دهد که الگوریتم ارائه‌شده، معیار F را به میزان حداکثر 30&lrm;% بهبود می‌بخشد. In this article a pre-processing method is introduced which is applicable in speech recognized texts retrieval task. We have a text corpus, t generated from a speech recognition system and a query as inputs,  to search queries in these documents and find relevant documents. A basic problem in a typical speech recognized text is some error percentage in recognition. This, results erroneously assigning to irrelevant documents.The idea of this proposed method, is to detect error-prone terms and to find similar words for each term. A parameter is defined which calculates the probability for occurring errors in the error-prone words. To recognize similar words for each specific term, based on a criterion called average detection rate (ADR) and levenshtein distance criterion, some candidates are chosen as the initial similar words set. And then, a conversion probability is defined based on the conversion rate (CR) and the noisy channel model (NCM) and the words with higher probability based on a threshold level are selected as the final similar words. In the retrieval process, these words are considered in the search step in addition to the base word.  Implementation result shows a significant improvement up to 30% of F-measure in information retrieval method with consideration of this pre-processing. بازیابی اطلاعات- بازشناسی گفتار- سند- پرس و جو- فاصله لونشتاین Information retrieval, Speech recognition, Document, Query, Levenshtein Distance 93 108 http://jsdp.rcisp.ac.ir/browse.php?a_code=A-10-733-2&slc_lang=fa&sid=1 rouhollah dianat روح الله دیانت rdianat@qom.ac.ir 10031947532846004962 10031947532846004962 No qom university دانشگاه قم morteza ali ahmadi مرتضی علی احمدی morteza.ali.ahmadi@gmail.com 10031947532846004963 10031947532846004963 Yes qom university دانشگاه قم yahya akhlaghi یحیی اخلاقی yahya.akhlaghi@gmail.com 10031947532846004964 10031947532846004964 No khatamolnabiyin university دانشگاه خاتم النبیین bagher babaali باقر باباعلی babaali@ut.ac.ir 10031947532846004965 10031947532846004965 No tehran university دانشگاه تهران