A Novel One Sided Feature Selection Method for Imbalanced Text Classification

Pouramini, Jafar; Minaei-Bidgoli, Behrouze; Esmaeili, Mahdi

doi:10.29252/jsdp.16.1.21

Volume 16, Issue 1 (5-2019) JSDP 2019, 16(1): 21-40 | Back to browse issues page

‎ 10.29252/jsdp.16.1.21

Mendeley

Zotero

RefWorks

Pouramini J, Minaei-Bidgoli B, Esmaeili M. A Novel One Sided Feature Selection Method for Imbalanced Text Classification. JSDP 2019; 16 (1) :21-40
URL: http://jsdp.rcisp.ac.ir/article-1-728-en.html

A Novel One Sided Feature Selection Method for Imbalanced Text Classification

Jafar Pouramini ^*

, Behrouze Minaei-Bidgoli

, Mahdi Esmaeili

1Department of Computer & Information Technology Engineering, Faculty of Engineering, University of Qom, Qom, Iran

Abstract: (4961 Views)

The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis.
The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of the areas where the imbalance occurs. The amount of text information is rapidly increasing in the form of books, reports, and papers. The fast and precise processing of this amount of information requires efficient automatic methods. One of the key processing tools is the text classification. Also, one of the problems with text classification is the high dimensional data that lead to the impractical learning algorithms. The problem becomes larger when the text data are also imbalance. The imbalance data distribution reduces the performance of classifiers. The various solutions proposed for this problem are divided into several categories, where the sampling-based methods and algorithm-based methods are among the most important methods. Feature selection is also considered as one of the solutions to the imbalance problem. In this research, a new method of one-way feature selection is presented for the imbalance data classification. The proposed method calculates the indicator rate of the feature using the feature distribution.
In the proposed method, the one-figure documents are divided in different parts, based on whether they contain a feature or not, and also if they belong to the positive-class or not. According to this classification, a new method is suggested for feature selection. In the proposed method, the following items are used.

If a feature is repeated in most positive-class documents, this feature is a good indicator for the positive-class; therefore, this feature should have a high score for this class. This point can be shown as a proportion of positive-class documents that contain this feature. Besides, if most of the documents containing this feature are belonged to the positive-class, a high score should be considered for this feature as the class indicator. This point can be shown by a proportion of documents containing feature that belong to the positive-class.
If most of the documents that do not contain a feature are not in the positive-class, a high score should be considered for this feature as the representative of this class. Moreover, if most of the documents that are not in the positive class do not contain this feature, a high score should be considered for this feature.

Using the proposed method, the score of features is specified. Finally, the features are sorted in descending order based on score, and the necessary number of required features is selected from the beginning of the feature list.
In order to evaluate the performance of the proposed method, different feature selection methods such as the Gini, DFS, MI and FAST were implemented. To assess the proposed method, the decision tree C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per Micro F , Macro F and G-mean criteria show that the proposed method has considerably improved the efficiency of the classifiers than other methods.

Keywords: Feature selection, Imbalanced class, High dimensionality, Text classification

Full-Text [PDF 7654 kb] (1733 Downloads)

Type of Study: Research | Subject: Paper
Received: 2017/12/1 | Accepted: 2019/02/24 | Published: 2019/06/10 | ePublished: 2019/06/10

References

1. [1] He, H. ,and E.A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, vol. 21(9),p p. 1263-1284, 2009. [DOI:10.1109/TKDE.2008.239]

2. [2] P.Yang, et al. , "Ensemble-based wrapper methods for feature," springer,Advances in Knowledge Discovery and Data Mining, vol. 7818, pp. 544-555,2013. [DOI:10.1007/978-3-642-37453-1_45]

3. [3] M.Galar, et al., "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," IEEE Trans-actions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42(4), pp. 463-484,2012. [DOI:10.1109/TSMCC.2011.2161285]

4. [4] N.V. Chawla, N. Japkowicz, and A. Kotcz, Editorial: special issue on learning from im-balanced data sets. SIGKDD Explor. Newsl., 2004. ch,6(1), pp. 1-6. [DOI:10.1145/1007730.1007733]

5. [5] J.V.Hulse, T.M. Khoshgoftaar, and A. Napolitano, "Experimental perspectives on learn-ing from imbalanced data," in Proceedings of the 24th international conference on Machine learning, ACM: Corvalis, Oregon, USA, 2007. pp. 935-942.

6. [6] H. Ogura, , H. Amano, and M. Kondo, "Comparison of metrics for feature selection in imbalanced text classification," Expert Systems with Applications, vol. 38(5), pp. 4978-4989. 2011. [DOI:10.1016/j.eswa.2010.09.153]

7. [7] S.Maldonadoa, R. Weberb, and F. Famili, "Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines," National Research Council of Canada, Ottawa, Canada Information Sciences, pp. 228-246, 2014. [DOI:10.1016/j.ins.2014.07.015]

8. [8] E.Chen, et al., "Exploiting probabilistic topic models to improve text categorization under class imbalance," Information Processing & Manage-ment, vol. 47(2), pp. 202-214, 2011. [DOI:10.1016/j.ipm.2010.07.003]

9. [9] E.L. Iglesias, A. Seara Vieira, and L. Borrajo, "An HMM-based over-sampling technique to improve text classification," Expert Systems with Applications, vol. 40(18), pp. 7184-7192, 2013. [DOI:10.1016/j.eswa.2013.07.036]

10. [10] R. Barandela , et al., "The imbalanced training sample problem: Under or over sampling?" in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, 2004.

11. [11] N.V.Chawla, et al., "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, pp. 321-357, 2002. [DOI:10.1613/jair.953]

12. [12] S.Barua, et al., "MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning," Knowledge and Data En-gineering, IEEE Transactions on, vol.26(2), pp. 405-425, 2014. [DOI:10.1109/TKDE.2012.232]

13. [13] A.Sun, E.-P. Lim, and Y. Liu, "On strategies for imbalanced text classification using SVM: A comparative study," Decision Support Systems, vol, 48(1), pp. 191-201, 2009. [DOI:10.1016/j.dss.2009.07.011]

14. [14] C.Sanchez-Hernandez, , D.S. Boyd, and G.M. Foody, "One-class classification for mapping a specific land-cover class: SVDD classification of fenland," IEEE Transactions on Geoscience and Remote Sensing, vol.45(4), pp. 1061-1073, 2007. [DOI:10.1109/TGRS.2006.890414]

15. [15] S.S. Khan, and M.G. Madden, "A survey of recent trends in one class classification," in Irish con-ference on Artificial Intelligence and Cognitive Science, Springer, 2009. [DOI:10.1007/978-3-642-17080-5_21]

16. [16] K.M. Ting, "A comparative study of cost-sensitive boosting algorithms," in Proceedings of the 17th International Conference on Machine Learning, Citeseer, 2000. [DOI:10.1007/3-540-45164-1_42]

17. [17]Cheng, F., et al., Large cost-sensitive margin distribution machine for imbalanced data classi-fication. Neurocomputing, 2017. 224, pp. 45-57. [DOI:10.1016/j.neucom.2016.10.053]

18. [18] X.-w. Chen, and M. Wasikowski, "FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems," in Proceedings of the 14th ACM SIGKDD inter-national conference on Knowledge discovery and data mining, ACM: Las Vegas, Nevada, USA, 2008, pp. 124-132. [DOI:10.1145/1401890.1401910]

19. [19] Y. Xu, "A Comparative Study on Feature Selection in Unbalance Text Classification," in Proceedings of the 2012 Fourth International Symposium on Information Science and Engi-neering, IEEE Computer Society, 2012, p p. 44-47. [DOI:10.1109/ISISE.2012.19]

20. [20] T. Lei, and L. Huan, "Bias analysis in text classification for highly skewed data," in Data Mining, Fifth IEEE International Conference on. 2005.

21. [21] S. Chua, and N. Kulathuramaiyer, "Feature selection semantic based," Springer Nether-lands,Innovations and Advanced Techniques in Systems, Computing Sciences and Software En-gineering, pp. 471-476, 2008. [DOI:10.1007/978-1-4020-8735-6_88]

22. [22] A. Khan, B. Baharudin, and K. Khan, "Efficient Feature Selection and Domain Relevance Term Weighting Method for Document Classification," IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 2, pp. 398-403, 2010. [DOI:10.1109/ICCEA.2010.228]

23. [23] r. V, et al., An Approach for Extraction of Key-words and Weighting Words for Improvement Farsi Documents Classification. JSDP, vol. 14(4), 2018,pp. 55-78. [DOI:10.29252/jsdp.14.4.55]

24. [24] A.K.Uysal, and S. Gunal, "A novel probabilistic feature selection method for text classification," Knowledge-Based Systems, vol.36, p p. 226-235, 2012. [DOI:10.1016/j.knosys.2012.06.005]

25. [25] W.Shang, et al., "A novel feature selection algorithm for text categorization," Expert Sys-tems with Applications, vol. 33(1), pp. 1-5, 2007. [DOI:10.1016/j.eswa.2006.04.001]

26. [26] Z. Zheng, and R.S. X Wu, Feature Selection for Text Categorization on Imbalanced Data, ACM SIGKDD Explorations Newsletter, 2004 - dl.acm.org, 2004. [DOI:10.1145/1007730.1007741]

27. [27] A.Moayedikia, et al., "Feature selection for high dimensional imbalanced class data using harmony search," Engineering Applications of Artificial Intelligence, vol. 57, pp. 38-49, 2017. [DOI:10.1016/j.engappai.2016.10.008]

28. [28] A.Rehman, K. Javed, and H.A. Babri,"Feature selection based on a normalized difference measure for text classification," Information Pro-cessing & Management, vol. 53(2), pp. 473-489, 2017. [DOI:10.1016/j.ipm.2016.12.004]

29. [29] S.Kansheng, et al., "Efficient text classification method based on improved term reduction and term weighting," The Journal of China Uni-versities of Posts and Telecommunications, vol.18, pp. 131-135, 2011. [DOI:10.1016/S1005-8885(10)60196-3]

30. [30] G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of machine learning research, vol-.3(Mar), pp. 1289-1305, 2003.

31. [31] G.S. Yanling, and Y. Zhu, "Data imbalance problem in text classification," IEEE ,Third International Symposium on Information Pro-cessing, 2010.

32. [32] P.Bermejo, et al., "Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking," Knowledge-Based Systems, vol. 25(1), pp. 35-44, 2012. [DOI:10.1016/j.knosys.2011.01.015]

33. [33] Z. Zhu, Y.-S. Ong, and M. Dash, "Wrapper-filter feature selection algorithm using a memetic framework," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 37(1), pp. 70-76, 2007. [DOI:10.1109/TSMCB.2006.883267]

34. [34] L.Breiman, Friedman ,and O. J. H., R. A., et al., Classification and regression trees. Montery CA: Wadsworth International Group, 1984.

35. [35] S.Li, et al., "A framework of feature selection methods for text categorization", in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Pro-cessing of the AFNLP, Association for Com-putational Linguistics. Vol.2. 2009. [DOI:10.3115/1690219.1690243]

36. [36] M. Alibeigi, S. Hashemi, and A. Hamzeh, "DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets," Data & Knowledge Engineering, pp.81-82, pp. 67-10,2012. [DOI:10.1016/j.datak.2012.08.001]

37. [37] H. Jing, et al., "A General Framework of Feature Selection for Text Categorization, in Machine Learning and Data Mining in Pattern Recognition," 6th International Conference, MLDM 2009, Leipzig, Germany, July 23-25, 2009. Proceedings, P. Perner, Editor. 2009, Springer Berlin Heidelberg: Berlin, Heidelberg. pp. 647-662. [DOI:10.1007/978-3-642-03070-3_49]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote