یک روش جدید انتخاب ویژگی یک‌طرفه در دسته‌بندی داده‌های متنی نامتوازن

پورامینی, جعفر; مینایی بیدگلی, بهروز; اسماعیلی, مهدی

doi:10.29252/jsdp.16.1.21

دوره 16، شماره 1 - ( 3-1398 ) جلد 16 شماره 1 صفحات 40-21 | برگشت به فهرست نسخه ها

‎ 10.29252/jsdp.16.1.21

Mendeley

Zotero

RefWorks

Pouramini J, Minaei-Bidgoli B, Esmaeili M. A Novel One Sided Feature Selection Method for Imbalanced Text Classification. JSDP 2019; 16 (1) :21-40
URL: http://jsdp.rcisp.ac.ir/article-1-728-fa.html

پورامینی جعفر، مینایی بیدگلی بهروز، اسماعیلی مهدی. یک روش جدید انتخاب ویژگی یک‌طرفه در دسته‌بندی داده‌های متنی نامتوازن. پردازش علائم و داده‌ها. 1398; 16 (1) :21-40

URL: http://jsdp.rcisp.ac.ir/article-1-728-fa.html

یک روش جدید انتخاب ویژگی یک‌طرفه در دسته‌بندی داده‌های متنی نامتوازن

جعفر پورامینی^*

، بهروز مینایی بیدگلی

، مهدی اسماعیلی

گروه مهندسی فناوری اطلاعات، دانشکده فنی و مهندسی، دانشگاه پیام نور تهران

چکیده: (4168 مشاهده)

توزیع نامتوازن داده‌ها باعث افت کارایی دسته‌بندها می‌شود. راه‌حل‌های پیشنهاد‌شده برای حل این مشکل به چند دسته تقسیم می‌شوند، که روش‌های مبتنی بر نمونه‌گیری و روش‌های مبتنی بر الگوریتم از مهم‌ترین روش‌ها هستند. انتخاب ویژگی نیز به‌‌عنوان یکی از راه‌حل‌های افزایش کارایی دسته‌بندی داده‌های نامتوازن مورد توجه قرار گرفته است. در این مقاله یک روش جدید انتخاب ویژگی یک‌طرفه برای دسته‌بندی متون نامتوازن ارائه شده است. روش پیشنهادی با استفاده از توزیع ویژگی‌ها میزان نشان‌گر‌بودن ویژگی را محاسبه می‌کند. به‌منظور مقایسه عملکرد روش پیشنهادی، روش‌های انتخاب ویژگی مختلفی پیاده‌سازی و برای ارزیابی روش پیشنهادی از درخت تصمیم C4.5 و نایوبیز استفاده شد. نتایج آزمایش‌ها بر روی پیکره‌های Reuters-21875 و WebKB برحسب معیار Micro F ، Macro F و G-mean نشان می‌دهد که روش پیشنهادی نسبت به روش‌های دیگر، کارایی دسته‌بندها را به ‌اندازه قابل توجهی بهبود بخشیده است.

واژه‌های کلیدی: انتخاب ویژگی، روش پالایه، داده‌های نامتوازن، دسته‌بندی متون

متن کامل [PDF 7654 kb] (1229 دریافت)

نوع مطالعه: پژوهشي | موضوع مقاله: مقالات پردازش متن
دریافت: 1396/9/10 | پذیرش: 1397/12/5 | انتشار: 1398/3/20 | انتشار الکترونیک: 1398/3/20

فهرست منابع

1. [1] He, H. ,and E.A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, vol. 21(9),p p. 1263-1284, 2009. [DOI:10.1109/TKDE.2008.239]

2. [2] P.Yang, et al. , "Ensemble-based wrapper methods for feature," springer,Advances in Knowledge Discovery and Data Mining, vol. 7818, pp. 544-555,2013. [DOI:10.1007/978-3-642-37453-1_45]

3. [3] M.Galar, et al., "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," IEEE Trans-actions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42(4), pp. 463-484,2012. [DOI:10.1109/TSMCC.2011.2161285]

4. [4] N.V. Chawla, N. Japkowicz, and A. Kotcz, Editorial: special issue on learning from im-balanced data sets. SIGKDD Explor. Newsl., 2004. ch,6(1), pp. 1-6. [DOI:10.1145/1007730.1007733]

5. [5] J.V.Hulse, T.M. Khoshgoftaar, and A. Napolitano, "Experimental perspectives on learn-ing from imbalanced data," in Proceedings of the 24th international conference on Machine learning, ACM: Corvalis, Oregon, USA, 2007. pp. 935-942.

6. [6] H. Ogura, , H. Amano, and M. Kondo, "Comparison of metrics for feature selection in imbalanced text classification," Expert Systems with Applications, vol. 38(5), pp. 4978-4989. 2011. [DOI:10.1016/j.eswa.2010.09.153]

7. [7] S.Maldonadoa, R. Weberb, and F. Famili, "Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines," National Research Council of Canada, Ottawa, Canada Information Sciences, pp. 228-246, 2014. [DOI:10.1016/j.ins.2014.07.015]

8. [8] E.Chen, et al., "Exploiting probabilistic topic models to improve text categorization under class imbalance," Information Processing & Manage-ment, vol. 47(2), pp. 202-214, 2011. [DOI:10.1016/j.ipm.2010.07.003]

9. [9] E.L. Iglesias, A. Seara Vieira, and L. Borrajo, "An HMM-based over-sampling technique to improve text classification," Expert Systems with Applications, vol. 40(18), pp. 7184-7192, 2013. [DOI:10.1016/j.eswa.2013.07.036]

10. [10] R. Barandela , et al., "The imbalanced training sample problem: Under or over sampling?" in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, 2004.

11. [11] N.V.Chawla, et al., "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, pp. 321-357, 2002. [DOI:10.1613/jair.953]

12. [12] S.Barua, et al., "MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning," Knowledge and Data En-gineering, IEEE Transactions on, vol.26(2), pp. 405-425, 2014. [DOI:10.1109/TKDE.2012.232]

13. [13] A.Sun, E.-P. Lim, and Y. Liu, "On strategies for imbalanced text classification using SVM: A comparative study," Decision Support Systems, vol, 48(1), pp. 191-201, 2009. [DOI:10.1016/j.dss.2009.07.011]

14. [14] C.Sanchez-Hernandez, , D.S. Boyd, and G.M. Foody, "One-class classification for mapping a specific land-cover class: SVDD classification of fenland," IEEE Transactions on Geoscience and Remote Sensing, vol.45(4), pp. 1061-1073, 2007. [DOI:10.1109/TGRS.2006.890414]

15. [15] S.S. Khan, and M.G. Madden, "A survey of recent trends in one class classification," in Irish con-ference on Artificial Intelligence and Cognitive Science, Springer, 2009. [DOI:10.1007/978-3-642-17080-5_21]

16. [16] K.M. Ting, "A comparative study of cost-sensitive boosting algorithms," in Proceedings of the 17th International Conference on Machine Learning, Citeseer, 2000. [DOI:10.1007/3-540-45164-1_42]

17. [17]Cheng, F., et al., Large cost-sensitive margin distribution machine for imbalanced data classi-fication. Neurocomputing, 2017. 224, pp. 45-57. [DOI:10.1016/j.neucom.2016.10.053]

18. [18] X.-w. Chen, and M. Wasikowski, "FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems," in Proceedings of the 14th ACM SIGKDD inter-national conference on Knowledge discovery and data mining, ACM: Las Vegas, Nevada, USA, 2008, pp. 124-132. [DOI:10.1145/1401890.1401910]

19. [19] Y. Xu, "A Comparative Study on Feature Selection in Unbalance Text Classification," in Proceedings of the 2012 Fourth International Symposium on Information Science and Engi-neering, IEEE Computer Society, 2012, p p. 44-47. [DOI:10.1109/ISISE.2012.19]

20. [20] T. Lei, and L. Huan, "Bias analysis in text classification for highly skewed data," in Data Mining, Fifth IEEE International Conference on. 2005.

21. [21] S. Chua, and N. Kulathuramaiyer, "Feature selection semantic based," Springer Nether-lands,Innovations and Advanced Techniques in Systems, Computing Sciences and Software En-gineering, pp. 471-476, 2008. [DOI:10.1007/978-1-4020-8735-6_88]

22. [22] A. Khan, B. Baharudin, and K. Khan, "Efficient Feature Selection and Domain Relevance Term Weighting Method for Document Classification," IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 2, pp. 398-403, 2010. [DOI:10.1109/ICCEA.2010.228]

23. [23]رضائی وحیده، محمدپور مجید، پروین حمید، نجاتیان صمد. ارائه روشی برای استخراج واژگان کلیدی و وزن‌دهی واژگان برای بهبود طبقه‌بندی متون فارسی. پردازش علائم و داده‌ها. ۱۳۹۶; ۱۴ (۴) :۵۵-۷۸

24. [23] r. V, et al., An Approach for Extraction of Key-words and Weighting Words for Improvement Farsi Documents Classification. JSDP, vol. 14(4), 2018,pp. 55-78. [DOI:10.29252/jsdp.14.4.55]

25. [24] A.K.Uysal, and S. Gunal, "A novel probabilistic feature selection method for text classification," Knowledge-Based Systems, vol.36, p p. 226-235, 2012. [DOI:10.1016/j.knosys.2012.06.005]

26. [25] W.Shang, et al., "A novel feature selection algorithm for text categorization," Expert Sys-tems with Applications, vol. 33(1), pp. 1-5, 2007. [DOI:10.1016/j.eswa.2006.04.001]

27. [26] Z. Zheng, and R.S. X Wu, Feature Selection for Text Categorization on Imbalanced Data, ACM SIGKDD Explorations Newsletter, 2004 - dl.acm.org, 2004. [DOI:10.1145/1007730.1007741]

28. [27] A.Moayedikia, et al., "Feature selection for high dimensional imbalanced class data using harmony search," Engineering Applications of Artificial Intelligence, vol. 57, pp. 38-49, 2017. [DOI:10.1016/j.engappai.2016.10.008]

29. [28] A.Rehman, K. Javed, and H.A. Babri,"Feature selection based on a normalized difference measure for text classification," Information Pro-cessing & Management, vol. 53(2), pp. 473-489, 2017. [DOI:10.1016/j.ipm.2016.12.004]

30. [29] S.Kansheng, et al., "Efficient text classification method based on improved term reduction and term weighting," The Journal of China Uni-versities of Posts and Telecommunications, vol.18, pp. 131-135, 2011. [DOI:10.1016/S1005-8885(10)60196-3]

31. [30] G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of machine learning research, vol-.3(Mar), pp. 1289-1305, 2003.

32. [31] G.S. Yanling, and Y. Zhu, "Data imbalance problem in text classification," IEEE ,Third International Symposium on Information Pro-cessing, 2010.

33. [32] P.Bermejo, et al., "Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking," Knowledge-Based Systems, vol. 25(1), pp. 35-44, 2012. [DOI:10.1016/j.knosys.2011.01.015]

34. [33] Z. Zhu, Y.-S. Ong, and M. Dash, "Wrapper-filter feature selection algorithm using a memetic framework," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 37(1), pp. 70-76, 2007. [DOI:10.1109/TSMCB.2006.883267]

35. [34] L.Breiman, Friedman ,and O. J. H., R. A., et al., Classification and regression trees. Montery CA: Wadsworth International Group, 1984.

36. [35] S.Li, et al., "A framework of feature selection methods for text categorization", in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Pro-cessing of the AFNLP, Association for Com-putational Linguistics. Vol.2. 2009. [DOI:10.3115/1690219.1690243]

37. [36] M. Alibeigi, S. Hashemi, and A. Hamzeh, "DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets," Data & Knowledge Engineering, pp.81-82, pp. 67-10,2012. [DOI:10.1016/j.datak.2012.08.001]

38. [37] H. Jing, et al., "A General Framework of Feature Selection for Text Categorization, in Machine Learning and Data Mining in Pattern Recognition," 6th International Conference, MLDM 2009, Leipzig, Germany, July 23-25, 2009. Proceedings, P. Perner, Editor. 2009, Springer Berlin Heidelberg: Berlin, Heidelberg. pp. 647-662. [DOI:10.1007/978-3-642-03070-3_49]

ارسال پیام به نویسنده مسئول

بازنشر اطلاعات
	این مقاله تحت شرایط Creative Commons Attribution-NonCommercial 4.0 International License قابل بازنشر است.

کلیه حقوق این تارنما متعلق به فصل‌نامة علمی - پژوهشی پردازش علائم و داده‌ها است.

نظر شما در مورد قالب جدید چیست؟
	خوب
	متوسط
	ضعیف

پایگاه‌های مرتبط

واژگان کلیدی

نظرسنجی