Signal and Data Processing -

Search published articles

Showing 12 results for Data Mining

A Novel Approach for Exceptional Phenomena Knowledge Detection and Analysis by Data mining

Ms Elahe Hajigol Yazdi, Dr Masood Abessi, Dr Mohammad Bagher Fakhrzad, Dr Hasan Hoseini Nasab,
Volume 14, Issue 1 (6-2017)

Abstract

Learning logic of exceptions is a substantial challenge in data mining and knowledge discovery. Exceptional phenomena detection takes place among huge records in a database which contains a large number of normal records and a few of exceptional ones. This is important to promote the confidence to a limited number of exceptional records for effective learning. In this study, a new approach based on the abnormality theory, information and information granulation theories are presented to detect exceptions and recognize their behavioral patterns. The efficiency of the proposed method was determined by using it to detect exceptional stocks from Iran stock market in a 30-month- period and learn their exceptional behavior. The proposed Enhanced-RISE algorithm (E-RISE) as a bottom-up learning approach was implemented to extract the knowledge of normal and exceptional behavior. The extracted knowledge was utilized to design an expert system based on the proposed abnormality theory to predict new exceptions from 6022 stocks. The superior findings show the results of this proposed approach in exceptional phenomena detection, is in accordance with experts' opinions.

A New Method to Determine Data Membership and Find Noise and Outlier Data Using Fuzzy Support Vector Machine

Mona Khodagholi, Ardeshir Dolati, Ali Hosseinzadeh, Khashayar Shamsolketabi,
Volume 15, Issue 3 (12-2018)

Abstract

Support Vector Machine (SVM) is one of the important classification techniques, has been recently attracted by many of the researchers. However, there are some limitations for this approach. Determining the hyperplane that distinguishes classes with the maximum margin and calculating the position of each point (train data) in SVM linear classifier can be interpreted as computing a data membership with certainty. A question may be raised here: how much the level of the certainty of this classification, based on hyperplane, can be trusted. In the standard SVM classification, the significance of error for different train data is considered equal and every datum is assumed to belong to just one class. However, in many cases some of train data, including outlier and vague data with no defined model, cannot be strictly considered as a member of a certain class. That means, a train datum may does not exactly belong to one class and its features may show 90 percent membership of one class and 10 percent of another. In such cases, by using fuzzy SVM based on fuzzy logic, we can determine the significance of data in the train phase and finally determine relative class membership of data.
The method proposed by Lin and Wang is a basic method that introduces a membership function for fuzzy support vector machine. Their membership function is based on the distance between a point and the center of its corresponding class.
In this paper, we introduce a new method for giving membership to train data based on their distance from distinctive hyperplane. In this method, SVM classification together with primary train data membership are used to introduce a fuzzy membership function for the whole space using symmetrical triangular fuzzy numbers. Based on this method, fuzzy membership function value of new data is selected with minimum difference from primary membership of train data and with the maximum level of fuzzification. In the first step, we define the problem as a nonlinear optimization problem. Then we introduce an efficient algorithm using critical points and obtain final membership function of train data. According to the proposed algorithm, the more distant data from the hyperplane will have a higher membership degree. If a datum exists on the hyperplane, it belongs to both classes with the same membership degree. Moreover, by comparing the primary membership degree of train data and calculated final distribution, we compute the level of noise for train data. Finally, we give a numerical example for illustration the efficiency of the proposed method and comparing its results with the results of the Lin and Wang approach.

Finding Frequent Patterns in Holy Quran UsingText Mining

Akram Aslani, Mahdi Esmaeili,
Volume 15, Issue 3 (12-2018)

Abstract

Quran’s Text differs from any other texts in terms of its exceptional concepts, ideas and subjects. To recognize the valuable implicit patterns through a vast amount of data has lately captured the attention of so many researchers. Text Mining provides the grounds to extract information from texts and it can help us reach our objective in this regard. In recent years, Text Mining on Quran and extracting implicit knowledge from Quranic words have been the object of researchers’ focus. It is common that in Quranic experts’ arguments, different sides of the discussion present different intellectual, logical and some non-integrated minor evidence in order to prove their own theories. More often than not, every side of these arguments disapproves of the other’s hypothesis and in the end it is impossible for them to reach a state of consensus on the matter, the reason is that, they do not have a common basis for their arguments and they do not make use of scientific, logical methods to strongly support their theories. Therefore, using modern technological trends regarding Quranic arguments could lead to resolving so many of current discrepancies, caused by human errors, which exist among Quranic researchers. It can help providing a common ground for their arguments in order to reach a comprehensive understanding.
The method used in this research implements frequent pattern mining algorithms, singular frequent patterns as well as dual and tripe frequent patterns in order to analyze Quranic text, in addition to this, Association rules have also been evaluated in the research.
Out of 54226 extracted association rules for Quranic words which have been evaluated by the use of criteria such as confidence coefficient, support coefficient, lift criteria as well as Co-efficient criteria. Top 10 rules for each criterion have been analyzed and reviewed throughout the project.

Representing a method to identify and contrast with the fraud which is created by robots for developing websites’ traffic ranking

Zahra Abdi, Mojtaba Mazoochi, Mohammadali Pourmina,
Volume 18, Issue 4 (3-2022)

Abstract

With the expansion of the Internet and the Web, communication and information gathering between individual has distracted from its traditional form and into web sites. The World Wide Web also offers a great opportunity for businesses to improve their relationship with the client and expand their marketplace in online world. Businesses use a criterion called traffic ranking to determine their site's popularity and visibility. Traffic ranking measures the amount of visitors to a site and based on these statistics, allocates a ranking to the site. One of the most important challenges in the ranking is the creation of fake traffic that generated by applications called robots. Robots are malicious software components that used to generate spam, set up distributed denial of services attacks, fishing, identity theft, removal of information and other illegal activities .there are already several ways to identify and discover the robot. According to Doran et al., The identification methods are divided into two categories: offline and real-time. The offline detection method is divided into three categories: Syntactical Log Analysis, Traffic Pattern Analysis, and Analytical Learning Techniques. The real-time method is performed by the Turing test system. In this research, the identification of robots is done through the offline method by analysis and processing of access logs to the web server and the use of data mining techniques. In this method, first, the features of each session are extracted, then generally these sessions are labeled with three conditions into two categories of human and robot. Finally, by using data mining tool, web robots are detected. In all previous studies, the features are extracted from each sessions, for example in first studies, Tan&Kumar extracted 25 features of sessions. After that Bomhardt et al. used 34 features to identify the robots. In 2009 Stassopoulou et al. used 6 features that was extracted from sessions and so on. But in this research, features are extracted from sessions of a unique user. Experimental results show that the proposed method in this research, by discovering new features and introducing a new condition in session labeling, improves the accuracy of identifying robots and moreover, improves the ranking of web traffic from previous work.

Detecting Suspicious Card Transactions in unlabeled data of bank Using Outlier Detection Techniqes

Seyed Morteza Seyed Rezaie, Ghorban Kheradmandian, Seyed Javad Kazemitabar Amirkolaie,
Volume 19, Issue 3 (12-2022)

Abstract

With the advancement of technology, the use of ATM and credit cards are increased. Cyber fraud and theft are the kinds of threat which result in using these Technologies. It is therefore inevitable to use fraud detection algorithms to prevent fraudulent use of bank cards. Credit card fraud can be thought of as a form of identity theft that consists of an unauthorized access to another person's card information for the purpose of charging purchases to the account or removing funds from it. Credit card fraud schemes are divided into two categories: application fraud and account takeover. When a credit card account gets opened without someone’s permission is called application fraud. Account takeovers, on the other hand, is when an existing credit card account is hijacked, and the criminal obtains enough personal information to modify the account's information. The criminal then subsequently reports the card lost or stolen in order to obtain a new card and make unauthorized purchases with it. Data mining as a technique capable of identifying useful patterns among a great deal of data is an effective method in detecting fraud in this regard. The main purpose of this paper is to present a new method for detecting unattended outliers that require high accuracy and recall. The method presented in this study is based on a combination of NMF, hierarchical k-means, k-means and k-nearest neighbors’ techniques. To evaluate the proposed method of outlier detection, several experiments were performed using standard data, in terms of accuracy and recall with Isolation Forest, k-nearest neighbors, Median kNN, and Average kNN. The dataset used in this paper is one that was provided in a 2016 Kaggle competition and was provided by a European bank after anonymization. The results, corroborate that the proposed method has higher accuracy and recall than other algorithms.

Verification of unemployment benefits’ claims using Classifier Combination method

Dr Rahim Dehklharghani, Dr Hojat Emami,
Volume 19, Issue 4 (3-2023)

Abstract

Unemployment insurance is one of the most popular insurance types in the modern world. The Social Security Organization is responsible for checking the unemployment benefits of individuals supported by unemployment insurance. Hand-crafted evaluation of unemployment claims requires a big deal of time and money. Data mining and machine learning as two efficient tools for data analysis can assist Social Security Organization in automating this process. In this research work, a hybrid supervised learning method is proposed to verify the eligibility of applicants for unemployment. The proposed method takes as input the information of insured individuals, and assigns a numeric score to each applicant through analyzing the input data. Then, claimants are classified into two groups according to those scores: "Qualified” and "Unqualified". The proposed method includes two hybrid strategies: BSA-SVM and combination of confidence values. In BSA-SVM method, backtracking search algorithm (BSA) is used to estimate the prameters of support vector machines (SVM) and improves the classification performance. In the second approach, confidence values extracted from individual classofiers are combined to better classify the input data. Empirical evaluation shows an accuracy of 87% for BSA-SVM and 86% for the second approach.

Combination of Ensemble Data Mining Methods for Detecting Credit Card Fraud Transactions

Dr. Saeid Bakhtiari, Mrs. Zahra Nasiri, Mr. Seyed Mohammad Sadegh Hejazi,
Volume 19, Issue 4 (3-2023)

Abstract

As we know, credit cards speed up and make life easier for all citizens and bank customers. They can use it anytime and anyplace according to their personal needs, instantly and quickly and without hassle, without worrying about carrying a lot of cash and more security than having liquidity. Together, these factors make credit cards one of the most popular forms of online banking. This has led to widespread and increasing use for easy payment for purchases made through mobile phones, the Internet, ATMs, and so on. Despite the popularity and ease of payment with credit cards, there are various security problems, increasing day by day. One of the most important and constant challenges in this field is credit card fraud all around the world. Due to the increasing security issues in credit cards, fraudsters are also updating themselves. In general, as a field grows in popularity, more fraudsters are attracted to it, and this is where credit card security comes into play. So naturally, this worries banks and their customers around the world. Meanwhile, financial information acts as the main factor in market financial transactions. For this reason, many researchers have tried to prioritize various solutions for detecting, predicting, and preventing credit card fraud in their research work and provide essential suggestions that have been associated with significant success. One of the practical and successful methods is data mining and machine learning. In these methods, one of the most critical parameters in fraud prediction and detection is the accuracy of fraud transaction detection. This research intends to examine the Gradient Boosting methods, which are a subset of Ensemble Learning and machine learning methods. By combining these methods, we can identify credit card fraud, reduce error rates, and improve the detection process, which in turn increases efficiency and accuracy. This study compared the two algorithms LightGBM and XGBoost, merged them using simple and weighted averaging techniques, and then evaluate the models using AUC, Recall, F1-score, Precision, and Accuracy. The proposed model provided 95.08, 90.57, 89.35, 88.28, and 99.27, respectively, after applying feature engineering and using the weighted average approach for the mentioned validation parameters. As a result, function engineering and weighted averaging significantly improved prediction and detection accuracy.

Outlier Detection on Data Streams Using a QLattice-based Model and Online Learning

Sahar Fardin, Mahdi Hashemzadeh,
Volume 20, Issue 2 (9-2023)

Abstract

With the advancement of computer science, the dramatic developments in data mining area and their increasing applications, the identification of outlier or anomaly data has also become one of the most important research topics. In most applications, the outlier data contain beneficial information that can be used to gain useful knowledge. Today, there are a large number of applications on data streams, in the vast majority of which the discovery of outlier/anomaly data is very important and in some cases vital. Detection of anomalies is an important way for detecting frauds, network intrusion detection, detection of abnormal behaviors in monitoring systems, and other rare events that are always of great importance; but they are often difficult to identify. Most of the existing efficient outlier detection algorithms have been designed for the static data. While outlier detection is more challenging in data streams, where data are generating continuously and has especial properties such as infinity and transience. In this research, we introduce an approach based on the QLattice classification model, which works based on the quantum computing and performs better in the intended application than other classification methods. Given the possibility of changing the distribution of data over time in streaming data, a scheme to take advantage of online incremental learning is also applied in the proposed method. Considering the unlimited data flow and limited processing memory, the detection process is applied to a window of data that is constantly updated with data sampled from previous windows. A function is also designed to solve the problem of data imbalance, which uses the random sampling technique to solve this issue. The results of experiments obtained on benchmark datasets show that the proposed approach has better performance than other methods.

Predicting employee turnover using tree-based ensemble ‎learning algorithms ‎

Eng Seyede Mahboobe Mazarei, Dr Jafar Pouramini,
Volume 20, Issue 3 (12-2023)

Abstract

Abstract
Key employee's turnover is one of the most important concerns of Human Resource Managers (HRM); Because the organization by losing its valuable staff, suffers from the loss of skills and experience gained over the years, so predicting employee turnover helps HRMs to hire and retain permanent employees. One of the effective tools in this regard is the use of different data mining methods. Many researchers have done research in this field. This study reviewes recently published articles based on machine learning models, using Kaggle Human Resource (HR) databases [1-5] to compare them with this proposed models. In the article [9], the authors have selected 11 of the most important features by collecting common features from previous articles and filtering them using feature review and selection algorithms. After converting non-numerical variables to numerical and normalizing the data in the range [0,1], those attrition prediction approach is based on machine, deep and ensemble learning models and is experimented on a large-sized and a medium-sized simulated HR datasets and then a real small-sized dataset from a total of 450 responses. Those approach achieves higher Accuracy (0.96, 0.98 and 0.99 respectively) for the three datasets when compared previous solutions. In 2021, authors examined the relationship between features using Pearson correlation coefficient and selected 11 features with the highest correlation coefficient. Then used from six different machine learning algorithms including Random Forest (RF), Logistic Regression (LR), …, to predict employee turnover. The highest accuracy they obtained was 0.85 for RF [3]. In the article[1], the authors used two IBM datasets and a database containing HR information from a regional bank in the USA to predict employees turnover. After cleaning and preprocessing the data, the performance of 10 different machine learning algorithms such as Decision Tree (DT), RF, LR, Neural Network, …, was evaluated using ROC criteria on 10 small, medium, and large subsets of randomly selected, unassigned primary datasets. The average accuracy of algorithms is 0.83 in small datasets, 0.81 in medium datasets and 0.86 in large datasets. The authors of the paper [4] used three main experiments on IBM Watson simulated datasets to predict employees turnover. The first experiment involved training the original class-imbalanced dataset with the following machine learning models: support vector machine with several kernel functions, random forest and K-nearest neighbour (KNN). The second experiment focused on using adaptive synthetic (ADASYN) approach to overcome class imbalance, then retraining on the new dataset using the abovementioned machine learning models. As a result, training an ADASYN-balanced dataset with KNN (K = 3) achieved the highest performance, with 0.93 F1-score. this turnover prediction approach is based on tree-based ensemble learning models and is experimented on a large-sized standard simulated HR dataset (hr_data), including 15,000 samples with 10 features and a medium-sized (IBM) including 1470 samples with 34 features. The employees turnover rate in the IBM is 16.1% and in the hr_data is 23.8%, so datasets are unbalanced. To balance the data, the random-under-sampling technique and its combination of random-over-sampling with a ratio of 0.5965 for the IBM and 0.6558 for the hr_data has been used. In the preprocessing stage, Features with zero variance and samples containing the missing value were also removed. Then categorical (non-numeric) values were converted to binary fields and then All features were scaled using data normalization in [0,1]. In order to reduce the feature dimensions in the IBM dataset, we used the "Non-negative Matrix Factorization" (NMF) technique (n_components=17, max_iter=500) and For initialization, non-negative singular value analysis method with zeros filled with X value has been used. After reviewing and cleaning the data, in the processing stage, six different classification algorithms, including KNN (k=1), RF (number of trees= 1500), DT, ExtraTreesClassifier (number of trees= 1000) and Support Vector Classifier were training on 70% of data. The optimal value of the hyperparameters for the algorithms, was set using RandomizedSearchCV and GridSearchCV techniques. In order to investigate the effect of balancing and Dimensionality Reduction on the performance of models, experiments were performed in 3 stages (befor balancing, after balancing befor Dimensionality Reduction, after balancing and Dimensionality Reduction) on 30% of the remaining data. The results shown in Table (2-4) indicate that this proposed model, which uses tree-based optimized ensemble learning algorithms with data balancing and NMF dimensionality reduction method, increases the f1score of turnover prediction. In the hr_data dataset, the best f1score for the RandomForest algorithm was 99.52% and for the IBM HR dataset, the best f1score for the ExtraTreesClassifier algorithm was 95.82%, which is higher than previous research. Table 5 compares the results of previous research with this research. Since, the prediction of employee attrition will not be enough without finding the characteristics that affect it, therefore, after building models and evaluating their performance, using a combined feature selection method by averaging the results of the single-variable feature selection method called "SelectKBest", and A wrapper feature selection method called "Recursive feature elimination" (RFE) with four learning algorithms RF, DT, ExtraTreesClassifier and AdaBoost, the most effective features were selected. SelectKBest combines the chi2 univariate statistical test with the selection of K features based on the statistical result between the features and the target variable. Also, in the RFE method, machine learning algorithms are used to remove the least important features after recursive training, so that finally the number of features reaches the set number (17 features in this article). The performance results of the models based on the selected features are shown in Table 6. The most effective characteristics are "age", "daily rate", "over time", "NumCompaniesWorked" and, "monthly income" .

A location recommender in social networks based on location based on deep learning

Mr Mohammad Rastgoo, Phd Hamid Reza Ghaffari,
Volume 21, Issue 4 (3-2025)

Abstract

The potential of social networks to extract valuable insights into user behavior has become a focal point of research. With the proliferation of social media platforms, people are increasingly sharing their experiences online. This wealth of user-generated data provides unique opportunities to understand movement patterns and predict future behavior. Location-based social networks like Foursquare exemplify this, allowing users to check in at various locations and enabling researchers to analyze these data points.By analyzing the data collected from these platforms, we can uncover patterns in user behavior, such as frequently visited locations and the factors influencing these choices. This information can be invaluable for businesses and urban planners.To improve the accuracy of predicting a user's next location, this study focuses on identifying the most influential friends or individuals in a user's social network. Factors such as the strength of these relationships, historical visit data, and temporal-spatial characteristics are considered. Additionally, the study emphasizes the importance of data quality, focusing on locations that have been visited more than 100 times to ensure reliability.

A key aspect of this research is understanding the influence of social connections on individual behavior. By analyzing the overlap in visited locations between friends, the study aims to identify the most influential friends for each user. These influential friends are then used to predict the user's next location.

The proposed method employs machine learning techniques, specifically RandomForest and recurrent neural networks (LSTM, RNN, and GRU), to predict user behavior. RandomForest is used to analyze the data and identify the most significant features, while recurrent neural networks are employed to model the sequential nature of user behavior. Among these, LSTM achieved the highest accuracy of 71% in predicting users' next locations.This research demonstrates that combining artificial intelligence with spatial-temporal data can provide profound insights into human behavior in urban and digital environments. By understanding these patterns, businesses can tailor their offerings to individual customers, and urban planners can design more efficient and user-friendly cities.

A Novel Privacy-Preserving Distributed Data Publishing Protocol Based on Probabilistic Models

Mr. Elyas Mosayebi, Dr. Reza Ebrahimi Atani,
Volume 22, Issue 3 (12-2025)

Abstract

In the era of digital transformation, government agencies and corporations increasingly rely on electronic services, generating vast volumes of sensitive data stored in distributed databases. While these records hold immense potential for knowledge discovery through data mining, their publication or sharing raises critical privacy concerns, particularly when sensitive individual information is at risk. Traditional Privacy-Preserving Distributed Data Publishing (PPDDP) methods rely heavily on Trusted Third-Party (TTP) intermediaries and Secure Multi-Party Computation (SMC), which introduce systemic vulnerabilities such as communication bottlenecks, synchronization failures, insider attacks, and inherent distrust in centralized entities. In healthcare analytics, hospitals leverage patient data to enhance diagnostic precision, optimize clinical workflows, and advance preventive and precision medicine. Yet, reliance on siloed datasets from individual institutions often restricts model generalizability and impedes comprehensive insights into health outcomes. Patient health is a multidimensional construct influenced not only by genetic and biological factors but also by behavioral patterns and socio-environmental determinants. Cross-institutional collaboration integrating diverse datasets from geographically distributed sources is essential to develop robust analytical models. However, such collaboration raises critical privacy concerns, as centralized aggregation of sensitive data risks exposure to breaches or misuse. Our probabilistic framework for privacy-preserving distributed data publishing directly addresses this challenge. By eliminating dependencies on trusted third parties and secure multi-party computation, our approach enables secure, decentralized integration of heterogeneous healthcare data. Through uncertainty-aware probabilistic anonymization and adaptive noise injection, the framework ensures compliance with stringent privacy regulations (e.g., GDPR, CPRA, HIPAA) while preserving the analytical utility required for accurate, actionable health outcome predictions. This balance of utility and privacy empowers researchers to harness the full potential of distributed datasets without compromising individual confidentiality, ultimately fostering innovation in precision medicine and population health management. This paper introduces a novel probabilistic framework for privacy preservation in distributed environments, eliminating dependencies on TTP and SMC. Unlike existing approaches, this method leverages uncertainty-aware probabilistic models to dynamically anonymize and perturb data across distributed nodes while preserving global data utility. First a survey of privacy preservation data publishing methods is presented in this paper and then we discuss about prose and cons of the techniques. After this we present the model and its implementation details. The results obtained by security evaluations shows that the presented method will balance out the privacy security and the accuracy of distributed data better, using the probability model without needing a Trusted Third-Party and Secure Multi-party Computation.

Predicting glioma brain tumor grades using ensemble machine learning

Dr. Hojjat Emami, Dr. Babak Azarnavid, Dr. Mohsen Abdolhosseinzadeh,
Volume 22, Issue 4 (3-2026)

Abstract

Gliomas, or in other words, aggressive and progressive brain tumors, lead to great complexity in the diagnosis and treatment of patients. While recent machine learning models provided encouraging results in glioma diagnosis and grading, the topic is open, and more efforts are needed. Existing models, despite encouraging results, often fall short of the ideal diagnostic state, highlighting the need for further research to develop robust and high-performing predictive models
.
This study introduces an optimized ensemble machine learning (EML) model designed to maximize classification (grading) performance and mitigate the pervasive issue of overfitting in glioma grading. Our approach employs a two-layer architecture that synergistically combines diverse weak and base learners. In the first layer, a diverse set of learners, including support vector machine (SVM), categorical boosting (CatBoost), extremely randomized trees (ERT), and random forest (RF), is integrated. This initial ensemble aims to capture a broad spectrum of grading patterns and enhance the overall accuracy by leveraging the complementary strengths of each base model. The outputs from this first layer, representing diversified classification probabilities, are then fed into a second-layer logistic regression (LR) model. This layer refines the predictions, performing the ultimate classification while explicitly addressing and eliminating the overfitting problem, thereby promoting better generalization to unseen data.
To rigorously evaluate the performance of the proposed ELM model, a comprehensive comparison was conducted against its constituent base learners and counterpart machine learning models. All models were assessed using a standard, publicly available glioma dataset. To prevent overfitting, examine the robustness of models, and evaluate models fairly, a 5-fold cross-validation strategy is used in experiments. The effectiveness of models was measured using four performance metrics, including accuracy, recall, precision, and F1-score.

The experimental results demonstrate the superior performance of the proposed EML model. Across all evaluated metrics, our model consistently outperformed the individual base learners and other benchmarked algorithms, securing the top rank in terms of accuracy. Specifically, the LR model operating on the first-layer ensemble predictions proved highly effective in both enhancing accuracy and preventing overfitting. Following our proposed model, the standalone LR and RF models demonstrated commendable performance, ranking second and third, respectively
.
The findings of this study underscore the significant potential of an optimized EML model for advancing the field of glioma tumor grading. The proposed model generated promising results and mitigated overfitting through integrating diverse base learners and using an LR model as a meta-model. The results reveal that the proposed model is a reliable and robust tool that can aid Clinical specialists in effectively diagnosing and classifying gliomas, ultimately paving the way for improved patient satisfaction.

Page 1 from 1

Signal and Data Processing

Search published articles

Vote