Jafar Pouramini, Dr. Behrouze Minaei-Bidgoli, Dr. Mahdi Esmaeili,
Volume 16, Issue 1 (5-2019)
Abstract
The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis.
The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of the areas where the imbalance occurs. The amount of text information is rapidly increasing in the form of books, reports, and papers. The fast and precise processing of this amount of information requires efficient automatic methods. One of the key processing tools is the text classification. Also, one of the problems with text classification is the high dimensional data that lead to the impractical learning algorithms. The problem becomes larger when the text data are also imbalance. The imbalance data distribution reduces the performance of classifiers. The various solutions proposed for this problem are divided into several categories, where the sampling-based methods and algorithm-based methods are among the most important methods. Feature selection is also considered as one of the solutions to the imbalance problem. In this research, a new method of one-way feature selection is presented for the imbalance data classification. The proposed method calculates the indicator rate of the feature using the feature distribution.
In the proposed method, the one-figure documents are divided in different parts, based on whether they contain a feature or not, and also if they belong to the positive-class or not. According to this classification, a new method is suggested for feature selection. In the proposed method, the following items are used.
- If a feature is repeated in most positive-class documents, this feature is a good indicator for the positive-class; therefore, this feature should have a high score for this class. This point can be shown as a proportion of positive-class documents that contain this feature. Besides, if most of the documents containing this feature are belonged to the positive-class, a high score should be considered for this feature as the class indicator. This point can be shown by a proportion of documents containing feature that belong to the positive-class.
- If most of the documents that do not contain a feature are not in the positive-class, a high score should be considered for this feature as the representative of this class. Moreover, if most of the documents that are not in the positive class do not contain this feature, a high score should be considered for this feature.
Using the proposed method, the score of features is specified. Finally, the features are sorted in descending order based on score, and the necessary number of required features is selected from the beginning of the feature list.
In order to evaluate the performance of the proposed method, different feature selection methods such as the Gini, DFS, MI and FAST were implemented. To assess the proposed method, the decision tree C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per Micro F , Macro F and G-mean criteria show that the proposed method has considerably improved the efficiency of the classifiers than other methods.
Eng Seyede Mahboobe Mazarei, Dr Jafar Pouramini,
Volume 20, Issue 3 (12-2023)
Abstract
Abstract
Key employee's turnover is one of the most important concerns of Human Resource Managers (HRM); Because the organization by losing its valuable staff, suffers from the loss of skills and experience gained over the years, so predicting employee turnover helps HRMs to hire and retain permanent employees. One of the effective tools in this regard is the use of different data mining methods. Many researchers have done research in this field. This study reviewes recently published articles based on machine learning models, using Kaggle Human Resource (HR) databases [1-5] to compare them with this proposed models. In the article [9], the authors have selected 11 of the most important features by collecting common features from previous articles and filtering them using feature review and selection algorithms. After converting non-numerical variables to numerical and normalizing the data in the range [0,1], those attrition prediction approach is based on machine, deep and ensemble learning models and is experimented on a large-sized and a medium-sized simulated HR datasets and then a real small-sized dataset from a total of 450 responses. Those approach achieves higher Accuracy (0.96, 0.98 and 0.99 respectively) for the three datasets when compared previous solutions. In 2021, authors examined the relationship between features using Pearson correlation coefficient and selected 11 features with the highest correlation coefficient. Then used from six different machine learning algorithms including Random Forest (RF), Logistic Regression (LR), …, to predict employee turnover. The highest accuracy they obtained was 0.85 for RF [3]. In the article[1], the authors used two IBM datasets and a database containing HR information from a regional bank in the USA to predict employees turnover. After cleaning and preprocessing the data, the performance of 10 different machine learning algorithms such as Decision Tree (DT), RF, LR, Neural Network, …, was evaluated using ROC criteria on 10 small, medium, and large subsets of randomly selected, unassigned primary datasets. The average accuracy of algorithms is 0.83 in small datasets, 0.81 in medium datasets and 0.86 in large datasets. The authors of the paper [4] used three main experiments on IBM Watson simulated datasets to predict employees turnover. The first experiment involved training the original class-imbalanced dataset with the following machine learning models: support vector machine with several kernel functions, random forest and K-nearest neighbour (KNN). The second experiment focused on using adaptive synthetic (ADASYN) approach to overcome class imbalance, then retraining on the new dataset using the abovementioned machine learning models. As a result, training an ADASYN-balanced dataset with KNN (K = 3) achieved the highest performance, with 0.93 F1-score. this turnover prediction approach is based on tree-based ensemble learning models and is experimented on a large-sized standard simulated HR dataset (hr_data), including 15,000 samples with 10 features and a medium-sized (IBM) including 1470 samples with 34 features. The employees turnover rate in the IBM is 16.1% and in the hr_data is 23.8%, so datasets are unbalanced. To balance the data, the random-under-sampling technique and its combination of random-over-sampling with a ratio of 0.5965 for the IBM and 0.6558 for the hr_data has been used. In the preprocessing stage, Features with zero variance and samples containing the missing value were also removed. Then categorical (non-numeric) values were converted to binary fields and then All features were scaled using data normalization in [0,1]. In order to reduce the feature dimensions in the IBM dataset, we used the "Non-negative Matrix Factorization" (NMF) technique (n_components=17, max_iter=500) and For initialization, non-negative singular value analysis method with zeros filled with X value has been used. After reviewing and cleaning the data, in the processing stage, six different classification algorithms, including KNN (k=1), RF (number of trees= 1500), DT, ExtraTreesClassifier (number of trees= 1000) and Support Vector Classifier were training on 70% of data. The optimal value of the hyperparameters for the algorithms, was set using RandomizedSearchCV and GridSearchCV techniques. In order to investigate the effect of balancing and Dimensionality Reduction on the performance of models, experiments were performed in 3 stages (befor balancing, after balancing befor Dimensionality Reduction, after balancing and Dimensionality Reduction) on 30% of the remaining data. The results shown in Table (2-4) indicate that this proposed model, which uses tree-based optimized ensemble learning algorithms with data balancing and NMF dimensionality reduction method, increases the f1score of turnover prediction. In the hr_data dataset, the best f1score for the RandomForest algorithm was 99.52% and for the IBM HR dataset, the best f1score for the ExtraTreesClassifier algorithm was 95.82%, which is higher than previous research. Table 5 compares the results of previous research with this research. Since, the prediction of employee attrition will not be enough without finding the characteristics that affect it, therefore, after building models and evaluating their performance, using a combined feature selection method by averaging the results of the single-variable feature selection method called "SelectKBest", and A wrapper feature selection method called "Recursive feature elimination" (RFE) with four learning algorithms RF, DT, ExtraTreesClassifier and AdaBoost, the most effective features were selected. SelectKBest combines the chi2 univariate statistical test with the selection of K features based on the statistical result between the features and the target variable. Also, in the RFE method, machine learning algorithms are used to remove the least important features after recursive training, so that finally the number of features reaches the set number (17 features in this article). The performance results of the models based on the selected features are shown in Table 6. The most effective characteristics are "age", "daily rate", "over time", "NumCompaniesWorked" and, "monthly income" .