Volume 17, Issue 4 (2-2021)                   JSDP 2021, 17(4): 49-66 | Back to browse issues page


XML Persian Abstract Print


Computer Department, faculty of engineering, Arak University
Abstract:   (2587 Views)
Increasing the use of Internet and some phenomena such as sensor networks has led to an unnecessary increasing the volume of information. Though it has many benefits, it causes problems such as storage space requirements and better processors, as well as data refinement to remove unnecessary data. Data reduction methods provide ways to select useful data from a large amount of duplicate, incomplete and redundant data. These methods are often applied in the pre-processing phase of machine learning algorithms. Three types of data reduction methods can be applied to data: 1. Feature reduction.2. Instance reduction: 3. Discretizing feature values.
In this paper, a new algorithm, based on ReliefF, is introduced to decrease both instances and features. The proposed algorithm can run on nominal and numeric features and on data sets with missing values. In addition, in this algorithm, the selection of instances from each class is proportional to the prior probability of classes. The proposed algorithm can run parallel on a multi-core CPU, which decreases the runtime significantly and has the ability to run on big data sets.
One type of instance reduction is instance selection. There are many issues in designing instance selection algorithms such as representing the reduced set, how to make a subset of instances, choosing distance function, evaluating designed reduction algorithm, the size of reduced data set and determining the critical and border instances. There are three ways of creating a subset of instances. 1) Incremental. 2) Decremental. 3) Batch. In this paper, we use the batch way for selecting instances. Another important issue is measuring the similarity of instances by a distance function. We use Jaccard index and Manhattan distance for measuring. Also, the decision on how many and what kind of instances should be removed and which must remain is another important issue. The goal of this paper is reducing the size of the stored set of instances while maintaining the quality of dataset. So, we remove very similar and non-border instances in terms of the specified reduction rate.
The other type of data reduction that is performed in our algorithm is feature selection. Feature selection methods divide into three categories: wrapper methods, filter methods, and hybrid methods. Many feature selection algorithms are introduced. According to many parameters, these algorithms are divided into different categories; For example, based on the search type for the optimal subset of the features, they can be categorized into three categories: Exponential Search, Sequential Search, and Random Search. Also, an assessment of a feature or a subset of features is done to measure its usefulness and relevance by the evaluation measures that are categorized into various metrics such as distance, accuracy, consistency, information, etc.
ReliefF is a feature selection algorithm used for calculating a weight for each feature and ranking features. But this paper is used ReliefF for ranking instances and features. This algorithm works as follows: First, the nearest neighbors of each instances are found. Then, based on the evaluation function, for each instance and feature, a weight is calculated, and eventually, the features and instances that are more weighed are retained and the rest are eliminated. IFSB-ReliefF (Instance and Feature Selection Based on ReliefF) algorithm is tested on two datasets and then C4.5 algorithm classifies the reduced data. Finally, the obtained results from the classification of reduced data sets are compared with the results of some instance and feature selection algorithms that are run separately.
Full-Text [PDF 6641 kb]   (731 Downloads)    
Type of Study: Research | Subject: Paper
Received: 2018/09/27 | Accepted: 2019/09/2 | Published: 2021/02/22 | ePublished: 2021/02/22

References
1. [1] J. Derrac, S. Garcia and F. Herrera, "IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule.," Pattern Recognition , vol. 43, no. 6, pp. 2082-2105, 2010. [DOI:10.1016/j.patcog.2009.12.012]
2. [2] H. Liu and L. Yu, "Toward Integrating Feature Selection Algorithms for Classification and Clustering," Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 4, pp. 491-502, 2005. [DOI:10.1109/TKDE.2005.66]
3. [3] V. Bolón-Canedo, N. Sánchez-Maroño and A. Alonso-Betanzos, "Recent advances and emerging challenges of feature selection in the context of big data," Knowledge-Based Systems, vol. 86, pp. 33-45, 2015. [DOI:10.1016/j.knosys.2015.05.014]
4. [4] L. C. Molina, L. Belanche and À. Nebot, "Feature Selection Algorithms: A Survey and Experimental Evaluation," in IEEE International Conference on Data Mining, 2002.
5. [5] J. Pouramini, B.Minaei-Bidgoli, M.Esmaeili, "A Novel One Sided Feature Selection Method for Imbalanced Text Classification", JSDP, 2019, vol. 16 (1), pp.21-40. [DOI:10.29252/jsdp.16.1.21]
6. [6] P. S. Bradley, U. M. Fayyad and C. Reina, "Scaling clustering algorithms to large databases," in Proceedings of the Fourth International Conference on Knowledge Discovery & Data Mining, New York, 1998.
7. [7] H. Liu, H. Motoda and L. Yu, "A selective sampling approach to active feature selection," Artificial Intelligence, vol. 159, pp. 49-74, 2004. [DOI:10.1016/j.artint.2004.05.009]
8. [8] W. Cochran, Sampling Techniques, New York: Wiley, 1977.
9. [9] H. Liu and H. Motoda, Instance Selection and Construction for Data Mining, Boston,MA: Kluwer Academic, 2001. [DOI:10.1007/978-1-4757-3359-4]
10. [10] S. Garcı'a, J. Derrac, J. R. Cano and F. Herrera, "Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study," IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 34, no. 3, pp. 417-435, MARCH 2012. [DOI:10.1109/TPAMI.2011.142] [PMID]
11. [11] S. Garc'ıa, J. Derrac, J. R. Cano and F. Herrera, "Prototype Selection for Nearest Neighbor Classification: Survey of Methods," IEEE Transactions On Pattern Analysis And Machine Intelligence., vol. 34, no. 3, pp. 417-435, 2012. [DOI:10.1109/TPAMI.2011.142] [PMID]
12. [12] F. H. M. L. Jose Ramon Cano, "On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining," Applied Soft Computing , vol. 6, pp. 323-332, 2006. [DOI:10.1016/j.asoc.2005.02.006]
13. [13] S. d. Río, V. Lopez, J. M. Benítez and F. Herrera, "On the use of MapReduce for imbalanced big data using Random Forest," Information Sciences , vol. 285, pp. 112-137, 2014. [DOI:10.1016/j.ins.2014.03.043]
14. [14] d. Wilson and t. r. Martinez, "Reduction techniques for instance-based learning algorithms," Machine learning, vol. 38, no. 3, pp. 257-286, 2000. [DOI:10.1023/A:1007626913721]
15. [15] H. Liu and H. Motoda, "On issues of instance selection," Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 115-130, 2002. [DOI:10.1023/A:1014056429969]
16. [16] D. R. Wilson and M. Tony R, "Instance pruning techniques," ICML, vol. 97, pp. 403-411, 1997.
17. [17] p. Jaccard, "Étude comparative de la distribution florale dans une portion des Alpes et des Jura," Bulletin de la Société Vaudoise des Sciences Naturelles, vol. 37, p. 547-579, 1901.
18. [18] G. H. J. Ron Kohavi, "Wrappers for feature subset selection," Artificial intelligence, vol. 97, no. 1, pp. 273-324, 1997. [DOI:10.1016/S0004-3702(97)00043-X]
19. [19] W. Duch, T. Wieczorek, J. Biesiada and M. Blachnik, "Comparison of feature ranking methods based on information entropy," in IEEE International Joint Conference on Neural Networks, 2004.
20. [20] G. Chandrashekar and F. Sahin, "A survey on feature selection methods," Computers & Electrical Engineering, vol. 40, no. 1, pp. 16-28, 2014. [DOI:10.1016/j.compeleceng.2013.11.024]
21. [21] I. Kononenko, E. Šimec and M. Robnik-Šikonja, "Overcoming the myopia of inductive learning algorithms with RELIEFF," Applied Intelligence, vol. 7, no. 1, pp. 39-55, 1997. [DOI:10.1023/A:1008280620621]
22. [22] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993.
23. [23] L. A. R. Kenji Kira, "The feature selection problem: Traditional methods and a new algorithm," AAAI, vol. 2, pp. 129-134, 1992.
24. [24] M. Robnik-Šikonja and I. Kononenko, "Theoretical and empirical analysis of ReliefF and RReliefF," Machine learning , vol. 53, no. (1-2), pp. 23-69, 2003. [DOI:10.1023/A:1025667309714]
25. [25] H. Liu and H. Motoda, Computational methods of feature selection, CRC Press, 2007. [DOI:10.1201/9781584888796]
26. [26] K. Yu, X. Xu, M. Ester and H.-P. Kriegel, "Feature weighting and instance selection for collaborative filtering: An information-theoretic approach," Knowledge and Information Systems, vol. 5, no. 2, pp. 201-224, 2003. [DOI:10.1007/s10115-003-0089-6]
27. [27] T. Chen, X. Zhang, S. Jin and O. Kim, "Efficient classification using parallel and scalable compressed model andits application on intrusion detection," Expert Systems with Applications, vol. 41, pp. 5972-5983, 2014. [DOI:10.1016/j.eswa.2014.04.009]
28. [28] W. T, Hadoop, The Definitive Guide, O'Reilly Media, Inc., 2012.
29. [29] P. Perner, "Prototype-based classification," Applied Intelligence, vol. 28, no. 3, pp. 238-246, 2008. [DOI:10.1007/s10489-007-0064-0]
30. [30] C.-F. Tsai, W. Eberle and C.-Y. Chu, "Genetic algorithms in feature and instance selection," Knowledge-Based Systems, vol. 39, p. 240-247, 2013. [DOI:10.1016/j.knosys.2012.11.005]
31. [31] F. Dimitris, D. Meretakis and L. Spiros, "Integrating Feature and Instance Selection for text classification," In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002.
32. [32] H. Ahn and K.-j. Kim, "Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach," Applied Soft Computing, vol. 9, no. 2, pp. 599-607, 2009. [DOI:10.1016/j.asoc.2008.08.002]
33. [33] Z. Abbasi and M. Rahmani, "An Instance Selection Algorithm Based on ReliefF‏," International Journal on Artificial Intelligence Tools, vol. 28, no. 1, p. 1950001, 2019. [DOI:10.1142/S0218213019500015]
34. [34] I. Tomek, "An experiment with the edited nearest-neighbor rule," IEEE Transactions on Systems, Man, and Cybernetics, vol. 6, pp. 448-452, 1976. [DOI:10.1109/TSMC.1976.4309523]
35. [35] J. R. Quinlan, "Improved use of continuous attributes in c4.5.," Journal of Artificial Intelligence Research, vol. 4, pp. 77-90, 1996. [DOI:10.1613/jair.279]
36. [36] I. Triguero, D. Peralta, J. Bacardit, S. García and F. Herrera, "MRPR: A MapReduce solution for prototype reduction in big data classification," Neurocomputing, vol. 150, pp. 331-345, 2015. [DOI:10.1016/j.neucom.2014.04.078]
37. [37] R. Hyndman and K. Anne B., "Another look at measures of forecast accuracy," International Journal of Forecasting, vol. 22, no. 4, pp. 679-688, 2006. [DOI:10.1016/j.ijforecast.2006.03.001]

Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.