Presenting a new method for mixed data clustering based on the number of similar features

,; Daneshpour, Negin

doi:10.61186/jsdp.21.1.39

Volume 21, Issue 1 (6-2024) JSDP 2024, 21(1): 39-52 | Back to browse issues page

‎ 10.61186/jsdp.21.1.39

Mendeley

Zotero

RefWorks

Daneshpour N. Presenting a new method for mixed data clustering based on the number of similar features. JSDP 2024; 21 (1) : 4
URL: http://jsdp.rcisp.ac.ir/article-1-1329-en.html

Presenting a new method for mixed data clustering based on the number of similar features

Negin Daneshpour ^*

Abstract: (1769 Views)

Clustering is an operation in which a set of data samples is categorized according to the degree of similarity. Examples of clustering data are numerical or a mixture of numerical and non-numerical (nominal) data. Finding similarities and measuring distances is one of the challenges of mixed data clustering. In the related works, to detect the degree of similarity and obtain the distance value, only the parameter of the distance value was considered and the cluster was selected based on its value. Clustering in this way, especially for mixed data, has not had very accurate results.
In this paper, we have tried to pay attention to the parameter "number of similar features" in calculating the degree of similarity and determining the distance. In assigning each sample to a cluster in cases where the distances are equal or close, the number of common features of the samples will determine the appropriate cluster. That is, we will pay attention to the "number of similar features" in addition to the distance to select the cluster. This idea believes that in cases where the distance of the cluster centers is close to the data object, it is better to choose the cluster center that has more features similar to the data object. Logically and also according to the proposed algorithm, the amount of similarity should be in a larger number of features, not just a few limited features but with high similarity.
The parameter of the "number of similar features" has a specific definition and is obtained with a suitable threshold. If the distance value of two features is less than the threshold, those two features are considered as similar features.
To calculate the distance in the algorithm, the normalized numerical difference for numerical properties and the Hamming distance for non-numerical properties are used. Determining the initial cluster centers, like many methods, is done randomly, and in subsequent iterations of the algorithm, more appropriate samples are selected as the cluster centers. The algorithm is compared with 5 other algorithms in 5 datasets.
In examining the results, three criteria of Accuracy, RI and F-Measure have been used. According to the test results, in the mixed and integer datasets, the algorithm performs at least two percent better than the two algorithms and one percent better than the other algorithm. In another data set, the proposed algorithm had results equal to or close to one percent better accuracy than the superior algorithm. In the last data set, the proposed algorithm was ranked second among 5 algorithms. In general, the proposed algorithm won the top rank in most of the results, and in the rest of the cases, it won the second rank out of the five tested algorithms.

Article number: 4

Keywords: Clustering, Mixed data, Distance of values, Similarity of values, Cluster Center.

Full-Text [PDF 866 kb] (635 Downloads)

Type of Study: Research | Subject: Paper
Received: 2022/08/3 | Accepted: 2024/02/25 | Published: 2024/08/3 | ePublished: 2024/08/3

References

1. Ahmad, Amir, and Shehroz S. Khan. "Survey of state-of-the-art mixed data clustering algorithms." Ieee Access 7 (2019): 31883-31902. [DOI:10.1109/ACCESS.2019.2903568]

2. Ahmad, Amir, and Shehroz S. Khan. "initKmix-A novel initial partition generation algorithm for clustering mixed data using k-means-based clustering." Expert Systems with Applications 167 (2021): 114149. [DOI:10.1016/j.eswa.2020.114149]

3. Behzadi, Sahar, et al. "Clustering of mixed-type data considering concept hierarchies: problem specification and algorithm." International Journal of Data Science and Analytics 10.3 (2020): 233-248. [DOI:10.1007/s41060-020-00216-2]

4. Kumar, Pradeep, and Anita Kanavalli. "A Similarity based K-Means Clustering Technique for Categorical Data in Data Mining Application." International Journal of Intelligent Engineering and Systems 14.2 (2021): 43-51. [DOI:10.22266/ijies2021.0430.05]

5. Ji, Jinchao, et al. "A Multi-View Clustering Algorithm for Mixed Numeric and Categorical Data." IEEE Access 9 (2021): 24913-24924. [DOI:10.1109/ACCESS.2021.3057113]

6. Sangam, Ravi Sankar, and Hari Om. "An equi-biased k-prototypes algorithm for clustering mixed-type data." Sādhanā 43.3 (2018): 1-12. [DOI:10.1007/s12046-018-0823-0]

7. Yuan, Fang, Youlong Yang, and Tiantian Yuan. "A dissimilarity measure for mixed nominal and ordinal attribute data in k-Modes algorithm." Applied Intelligence 50.5 (2020): 1498-1509 [DOI:10.1007/s10489-019-01583-5]

8. Jia, Ziqi, and Ling Song. "Weighted k-Prototypes Clustering Algorithm Based on the Hybrid Dissimilarity Coefficient." Mathematical Problems in Engineering 2020 (2020). [DOI:10.1155/2020/5143797]

9. Jia, Hong, Yiu-ming Cheung, and Jiming Liu. "A new distance metric for unsupervised learning of categorical data." IEEE transactions on neural networks and learning systems 27.5 (2015): 1065-1079. [DOI:10.1109/TNNLS.2015.2436432]

10. Ji, Jinchao, et al. "Clustering mixed numeric and categorical data with artificial bee colony strategy." Journal of Intelligent & Fuzzy Systems 36.2 (2019): 1521-1530. [DOI:10.3233/JIFS-18146]

11. Skabar, Andrew. "Clustering Mixed-Attribute Data using Random Walk." Procedia Computer Science 108 (2017): 988-997. [DOI:10.1016/j.procs.2017.05.083]

12. Du, Mingjing, Shifei Ding, and Yu Xue. "A novel density peaks clustering algorithm for mixed data." Pattern Recognition Letters 97 (2017): 46-53 [DOI:10.1016/j.patrec.2017.07.001]

13. Qian, Yuhua, et al. "Space structure and clustering of categorical data." IEEE transactions on neural networks and learning systems 27.10 (2015): 2047-2059. [DOI:10.1109/TNNLS.2015.2451151]

14. dos Santos, Tiago RL, and Luis E. Zárate. "Categorical data clustering: What similarity measure to recommend?. " Expert Systems with Applications 42.3 (2015): 1247-1260. [DOI:10.1016/j.eswa.2014.09.012]

15. Ahmad, Amir, and Sarosh Hashmi. "K-Harmonic means type clustering algorithm for mixed datasets." Applied Soft Computing 48 (2016): 39-49. [DOI:10.1016/j.asoc.2016.06.019]

16. Ji, Jinchao, et al. "An initialization method for clustering mixed numeric and categorical data based on the density and distance." International Journal of Pattern Recognition and Artificial Intelligence 29.07 (2015): 1550024. [DOI:10.1142/S021800141550024X]

17. van de Velden, Michel, Alfonso Iodice D'Enza, and Angelos Markos. "Distance‐based clustering of mixed data." Wiley Interdisciplinary Reviews: Computational Statistics 11.3 (2019): e1456. [DOI:10.1002/wics.1456]

18. Caruso, Giulia, et al. "Cluster analysis: An application to a real mixed-type data set." Models and Theories in Social Systems. Springer, Cham, 2019. 525-533. [DOI:10.1007/978-3-030-00084-4_27]

19. Jinyin, Chen, et al. "A novel cluster center fast determination clustering algorithm." Applied Soft Computing 57 (2017): 539-555 [DOI:10.1016/j.asoc.2017.04.031]

20. Xiong, Jing, and Hong Yu. "An adaptive three-way clustering algorithm for mixed-type data." International Symposium on Methodologies for Intelligent Systems. Springer, Cham, 2018. [DOI:10.1007/978-3-030-01851-1_36]

21. Dinh, Duy-Tai, and Van-Nam Huynh. "k-PbC: an improved cluster center initialization for categorical data clustering." Applied Intelligence (2020): 1-23. [DOI:10.1007/s10489-020-01677-5]

22. Hsu, Chung-Chian, and Yan-Ping Huang. "Incremental clustering of mixed data based on distance hierarchy." Expert systems with applications 35.3 (2008): 1177-1185. [DOI:10.1016/j.eswa.2007.08.049]

23. Ahmad, Amir, and Lipika Dey. "A k-mean clustering algorithm for mixed numeric and categorical data." Data & Knowledge Engineering 63.2 (2007): 503-527. [DOI:10.1016/j.datak.2007.03.016]

24. UCI Repository. https://archive.ics.uci.edu/ml/datasets.html. (September 6, 2021).

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote