Improving Imbalanced Data Classification Accuracy by using Fuzzy Similarity Measure and Subtractive Clustering

Yasrebi Naeini, Ehsan; hatami, mahla

doi:10.52547/jsdp.19.2.27

Volume 19, Issue 2 (9-2022) JSDP 2022, 19(2): 27-38 | Back to browse issues page

‎ 10.52547/jsdp.19.2.27

Mendeley

Zotero

RefWorks

Yasrebi Naeini E, hatami M. Improving Imbalanced Data Classification Accuracy by using Fuzzy Similarity Measure and Subtractive Clustering. JSDP 2022; 19 (2) : 3
URL: http://jsdp.rcisp.ac.ir/article-1-1010-en.html

Improving Imbalanced Data Classification Accuracy by using Fuzzy Similarity Measure and Subtractive Clustering

Ehsan Yasrebi Naeini ^*

, Mahla Hatami

University of Torbat-e-Heydariyeh

Abstract: (929 Views)

One of the biggest challenges in this field is classification problems which refers to the number of different samples in each class. If a data set includes two classes, imbalance distribution occurs when one class has a large number of samples while the other is represented by a small number of samples. In general, the methods of solving these problems are divided into two categories: under-sampling and over-sampling. In this research, it is focused on under-sampling and the advantages of this method will be analyzed by considering the efficiency of classifying imbalanced data and it’s supposed to provide a method for sampling a majority data class by using subtractive clustering and fuzzy similarity measure. For this purpose, at first the subtractive clustering is conducted and the majority data class is clustered. Then, using fuzzy similarity measure, samples of each cluster will be ranked and appropriate samples are selected based on these rankings. The selected samples with the minority class create the final dataset. In this research, MATLAB software is used for implementation, the results are evaluated by using AUC criterion and analyzing the results has been performed by standard statistical tools. The experimental results show that the proposed method is superior to other methods of under-sampling.

Article number: 3

Keywords: Imbalanced data, Fuzzy similarity measure, Under-sampling, Subtractive clustering

Full-Text [PDF 1059 kb] (329 Downloads)

Type of Study: Research | Subject: Paper
Received: 2019/12/2 | Accepted: 2020/08/18 | Published: 2022/09/30 | ePublished: 2022/09/30

References

1. [1] A. Braun, and et al, "Landslide Susceptibility Mapping in Tegucigalpa, Honduras, Using Data Mining Methods", in IAEG/AEG Annual Meeting Proceedings, San Francisco, California, 2018-Volume 1. 2019. Springer. [DOI:10.1007/978-3-319-93124-1_25]

2. [2] S.Fotouhi, S. Asadi, and M.W. Kattan, "A comprehensive data level analysis for cancer diagnosis on imbalanced data", Journal of biomedical informatics, 2019. [DOI:10.1016/j.jbi.2018.12.003] [PMID]

3. [3] N. Junsomboon, and T. Phienthrakul, "Combining over-sampling and under-sampling techniques for imbalance dataset", in Proceedings of the 9th International Conference on Machine Learning and Computing. 2017. ACM. [DOI:10.1145/3055635.3056643]

4. [4] S.A. Golder, B.A. Huberman, "Usage patterns of collaborative tagging systems", Journal of information science, vol. 32(2), pp. 198-208. 2006. [DOI:10.1177/0165551506062337]

5. [5] Y. Sun, and et al., "Cost-sensitive boosting for classification of imbalanced data", Pattern Recognition, vol. 40(12), pp. 3358-3378, 2007. [DOI:10.1016/j.patcog.2007.04.009]

6. [6] Z.-H. Zhou, X.-Y. Liu, "Training cost-sensitive neural networks with methods addressing the class imbalance problem", IEEE Transactions on Knowledge & Data Engineering, pp. 63-77. 2006. [DOI:10.1109/TKDE.2006.17]

7. [7] N.V. Chawla, and et al., "SMOTE: synthetic minority over-sampling technique", Journal of artificial intelligence research, vol. 16, pp. 321-357. 2002. [DOI:10.1613/jair.953]

8. [8] E. Fernandes, and et al., "Ensemble of Classifiers based on MultiObjective Genetic Sampling for Imbalanced Data", IEEE Transactions on Knowledge and Data Engineering, 2019. [DOI:10.1109/TKDE.2019.2898861]

9. [9] A. Roy, et al. "A study on combining dynamic selection and data preprocessing for imbalance learning", Neurocomputing, pp. 179-192, 2002. [DOI:10.1016/j.neucom.2018.01.060]

10. [10] W. Xie, G.Liang, Z. Dong, B. Tan, and B. Zhang, "Mathematical Problems in Engineering; An Improved Oversampling Algorithm Based on the Samples", Selection Strategy for Classifying Imbalanced Data. 2019. [DOI:10.1155/2019/3526539]

11. [11] V.C. Silvia Cateni, M. Vannucci, "A method for resampling imbalanced datasets in binary classification tasks for real-world problems", Neurocomputing,Elsevier.

12. [12] T. M. Khoshgoftaar, A.F., D. J. Dittman and A. Napolitano, "Ensemble vs. Data Sampling: Which Option Is Best Suited to Improve Classification Performance of Imbalanced Bioinformatics Data?" 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), Vietri sul Mare, , 2015, pp. 705-712. [DOI:10.1109/ICTAI.2015.106]

13. [13] G.E. Batista, R.C. Prati, and M.C. Monard, "A study of the behavior of several methods for balancing machine learning training data", ACM SIGKDD explorations newsletter, vol. 6(1), pp. 20-29, 2004. [DOI:10.1145/1007730.1007735]

14. [14] P. Hart, "The condensed nearest neighbor rule (Corresp.)", IEEE transactions on information theory, vol. 14(3), pp. 515-516, 1968. [DOI:10.1109/TIT.1968.1054155]

15. [15] I.Tomek, "Two modifications of CNN", IEEE Trans. Systems, Man and Cybernetics, vol.6, pp. 769-772, 1976. [DOI:10.1109/TSMC.1976.4309452]

16. [16] J. Laurikkala, "Improving identification of difficult small classes by balancing class distribution", in Conference on Artificial Intelligence in Medicine in Europe, Springer. 2001. [DOI:10.1007/3-540-48229-6_9]

17. [17] S.-J.Yen, and Y.-S. Lee, "Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset", in Intelligent Control and Automation, Springer. pp. 731-740, 2006. [DOI:10.1007/978-3-540-37256-1_89]

18. [18] M. Kubat, and S. Matwin, "Addressing the curse of imbalanced training sets: one-sided selection", in Icml. 1997. Nashville, USA.

19. [19] S. Gazzah , A.H., N. Essoukri Ben Amara, "A hybrid sampling method for imbalanced data", pp. 1-6, 2015. [DOI:10.1109/SSD.2015.7348093]

20. [20] H. Han, W.-Y. Wang, and B.-H. Mao, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning", in International conference on intelligent computing, 2005, Springer. [DOI:10.1007/11538059_91]

21. [21] H. He, et al, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning", in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008.

22. [22] G. Cohen, et al., "Learning from imbalanced data in surveillance of nosocomial infection", Artificial intelligence in medicine, vol. 37(1), pp. 7-18, 2006. [DOI:10.1016/j.artmed.2005.03.002] [PMID]

23. [23] S. Tang, and S.-p. Chen, "The generation mechanism of synthetic minority class examples", in 2008 International Conference on Information Technology and Applications in Biomedicine, IEEE, 2008,. [DOI:10.1109/ITAB.2008.4570642] [PMID]

24. [24] J. Stefanowski, and S. Wilk, "Selective pre-processing of imbalanced data for improving classification performance", in International Conference on Data Warehousing and Knowledge Discovery, Springer, 2008.

25. [25] D.M.B. Tarigan, and D.P. Rini, "Particle Swarm Optimization-Based on Decision Tree of C4. 5 Algorithm for Upper Respiratory Tract Infections (URTI) Prediction", in Journal of Physics: Conference Series, IOP Publishing, 2019. [DOI:10.1088/1742-6596/1196/1/012077]

26. [26] D. Devi, and B. Purkayastha, "Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance", Pattern Recognition Letters, vol.93, pp. 3-12, 2017. [DOI:10.1016/j.patrec.2016.10.006]

27. [27] K. Javed, R. Gouriveau, and N. Zerhouni, "A new multivariate approach for prognostics based on extreme learning machine and fuzzy clustering", IEEE transactions on cybernetics, vol.45(12), pp. 2626-2639, 2015. [DOI:10.1109/TCYB.2014.2378056] [PMID]

28. [28] X.L. Xie, and G. Beni, "A validity measure for fuzzy clustering", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.(8), pp. 841-847. 1991 [DOI:10.1109/34.85677]

29. [29] K.Bataineh, M. Naji, and M. Saqer, "A comparison study between various fuzzy clustering algorithms", Editorial Board, vol. 5, pp. 335, 2011.

30. [30] Y. Ding, and X. Fu, "Kernel-based fuzzy c-means clustering algorithm based on genetic algorithm", Neurocomputing, vol.188, pp. 233-238, 2016. [DOI:10.1016/j.neucom.2015.01.106]

31. [31] R.R.Yager, and D.P. Filev, "Generation of fuzzy rules by mountain clustering", Journal of Intelligent & Fuzzy Systems, vol. 2(3), pp. 209-219. 1994. [DOI:10.3233/IFS-1994-2301]

32. [32] S.L. Chiu, "Fuzzy model identification based on cluster estimation", Journal of Intelligent & Fuzzy Systems,vol. 2(3), pp. 267-278. 1994. [DOI:10.3233/IFS-1994-2306]

33. [33] D. W.Kim, et al., "A kernel-based subtractive clustering method", Pattern Recognition Letters, vol. 26(7), pp. 879-891, 2005. [DOI:10.1016/j.patrec.2004.10.001]

34. [34] M. Y Chen, "A hybrid ANFIS model for business failure prediction utilizing particle swarm optimization and subtractive clustering", Information Sciences, vol.220, pp. 180-195. 2013. [DOI:10.1016/j.ins.2011.09.013]

35. [35] S. Zeng, S. M. Chen,. M. O.Teng, "Fuzzy forecasting based on linear combinations of independent variables, subtractive clustering algorithm and artificial bee colony algorithm", Information Sciences, vol.484, pp.350-366, 2019. [DOI:10.1016/j.ins.2019.01.071]

36. [36] I. Beg, and S. Ashraf, "Similarity measures for fuzzy sets", Appl. and Comput. Math, vol.8(2), pp. 192-202, 2009.

37. [37] L.T. Kóczy, and D. Tikk, "Fuzzy rendszerek", TypoTEX, Budapest, 2000.

38. [38] J. Williams, and N. Steele, "Difference, distance and similarity as a basis for fuzzy decision support based on prototypical decision classes", Fuzzy sets and systems, vol.131(1), pp. 35-46. 2002. [DOI:10.1016/S0165-0114(01)00253-6]

39. [39] S. Santini, and R. Jain, "Similarity is a geometer", Multimedia Tools and Applications, vol. 5(3), pp. 277-306, 1997. [DOI:10.1023/A:1009651725256]

40. [40] R. Zwick, E. Carlstein, and D.V. Budescu, "Measures of similarity among fuzzy concepts: A comparative analysis", International Journal of Approximate Reasoning, vol. 1(2), pp. 221-242,1987. [DOI:10.1016/0888-613X(87)90015-6]

41. [41] S. García, et al., "A study on the use of non-parametric tests for analyzing the evolutionary algorithms' behaviour: a case study on the CEC'2005 special session on real parameter optimization", Journal of Heuristics, vol.15(6), pp. 617-644, 2009. [DOI:10.1007/s10732-008-9080-4]

42. [42] O.T. Yıldız, Ö. Aslan, and E. Alpaydın, "Multivariate statistical tests for comparing classification algorithms," in Learning and Intelligent Optimization, Springer, pp. 1-15, 2011. [DOI:10.1007/978-3-642-25566-3_1]

43. [43] D.J. Sheskin, Handbook of parametric and nonparametric statistical procedures. 2003: Chapman and Hall/CRC. [DOI:10.1201/9781420036268] [PMID]

44. [44] S.García, and et al., "Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power", Information Sciences, vol.180(10), pp. 2044-2064, 2010. [DOI:10.1016/j.ins.2009.12.010]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote