Signal and Data Processing

fa یک روش جدید انتخاب ویژگی یک‌طرفه در دسته‌بندی داده‌های متنی نامتوازن A Novel One Sided Feature Selection Method for Imbalanced Text Classification مقالات پردازش متن Paper پژوهشي Research <div style="text-align: justify;">توزیع نامتوازن داده‌ها باعث افت کارایی دسته‌بندها می‌شود. راه‌حل‌های پیشنهاد‌شده برای حل این مشکل به چند دسته تقسیم می‌شوند، که روش‌های مبتنی بر نمونه‌گیری و روش‌های مبتنی بر الگوریتم از مهم‌ترین روش‌ها هستند. انتخاب ویژگی نیز به‌‌عنوان یکی از راه‌حل‌های افزایش کارایی دسته‌بندی داده‌های نامتوازن مورد توجه قرار گرفته است. در این مقاله یک روش جدید انتخاب ویژگی یک‌طرفه برای دسته‌بندی متون نامتوازن ارائه شده است. روش پیشنهادی با استفاده از توزیع ویژگی‌ها میزان نشان‌گر‌بودن ویژگی را محاسبه می‌کند. به‌منظور مقایسه عملکرد روش پیشنهادی، روش‌های انتخاب ویژگی مختلفی پیاده‌سازی و برای ارزیابی روش پیشنهادی از درخت تصمیم C4.5 و نایوبیز استفاده شد. نتایج آزمایش‌ها بر روی پیکره‌های Reuters-21875 و WebKB برحسب معیار Micro F ، Macro F و G-mean نشان می‌دهد که روش پیشنهادی نسبت به روش‌های دیگر، کارایی دسته‌بندها را به ‌اندازه قابل توجهی بهبود بخشیده است. </div> The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of the areas where the imbalance occurs. The amount of text information is rapidly increasing in the form of books, reports, and papers. The fast and precise processing of this amount of information requires efficient automatic methods. One of the key processing tools is the text classification. Also, one of the problems with text classification is the high dimensional data that lead to the impractical learning algorithms. The problem becomes larger when the text data are also imbalance. The imbalance data distribution reduces the performance of classifiers. The various solutions proposed for this problem are divided into several categories, where the sampling-based methods and algorithm-based methods are among the most important methods. Feature selection is also considered as one of the solutions to the imbalance problem. In this research, a new method of one-way feature selection is presented for the imbalance data classification. The proposed method calculates the indicator rate of the feature using the feature distribution. In the proposed method, the one-figure documents are divided in different parts, based on whether they contain a feature or not, and also if they belong to the positive-class or not. According to this classification, a new method is suggested for feature selection. In the proposed method, the following items are used. <ol style="list-style-type:lower-alpha;"> <li>If a feature is repeated in most positive-class documents, this feature is a good indicator for the positive-class; therefore, this feature should have a high score for this class. This point can be shown as a proportion of positive-class documents that contain this feature. Besides, if most of the documents containing this feature are belonged to the positive-class, a high score should be considered for this feature as the class indicator. This point can be shown by a proportion of documents containing feature that belong to the positive-class. </li> <li>If most of the documents that do not contain a feature are not in the positive-class, a high score should be considered for this feature as the representative of this class. Moreover, if most of the documents that are not in the positive class do not contain this feature, a high score should be considered for this feature. </li> </ol> Using the proposed method, the score of features is specified. Finally, the features are sorted in descending order based on score, and the necessary number of required features is selected from the beginning of the feature list. In order to evaluate the performance of the proposed method, different feature selection methods such as the Gini, DFS, MI and FAST were implemented. To assess the proposed method, the decision tree C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per Micro F , Macro F and G-mean criteria show that the proposed method has considerably improved the efficiency of the classifiers than other methods.   انتخاب ویژگی, روش پالایه, داده‌های نامتوازن, دسته‌بندی متون Feature selection, Imbalanced class, High dimensionality, Text classification 21 40 http://jsdp.rcisp.ac.ir/browse.php?a_code=A-10-652-2&slc_lang=fa&sid=1 Jafar Pouramini جعفر پورامینی j_pouramini@pnu.ac.ir 10031947532846007496 10031947532846007496 Yes 1Department of Computer & Information Technology Engineering, Faculty of Engineering, University of Qom, Qom, Iran گروه مهندسی فناوری اطلاعات، دانشکده فنی و مهندسی، دانشگاه پیام نور تهران Behrouze Minaei-Bidgoli بهروز مینایی بیدگلی b_minaei@iust.ac.ir 10031947532846007497 10031947532846007497 No Faculty of Computer Engineering, Iran University of Science and Technology دانشکده مهندسی کامپیوتر، دانشگاه علم و صنعت ایران Mahdi Esmaeili مهدی اسماعیلی m.esmaeili@iaukashan.ac.ir 10031947532846007498 10031947532846007498 No Faculty of Computer Engineering, Kashan Islamic Azad University دانشکده مهندسی کامپیوتر، دانشگاه آزاد اسلامی واحد کاشان