Ensembling semi-supervised p-spectral clustering for high dimensional data

Safari, Sedigheh; Afsari, Fatemeh

doi:10.61186/jsdp.20.1.39

Volume 20, Issue 1 (6-2023) JSDP 2023, 20(1): 39-58 | Back to browse issues page

‎ 10.61186/jsdp.20.1.39

Mendeley

Zotero

RefWorks

Safari S, Afsari F. Ensembling semi-supervised p-spectral clustering for high dimensional data. JSDP 2023; 20 (1) : 3
URL: http://jsdp.rcisp.ac.ir/article-1-1198-en.html

Ensembling semi-supervised p-spectral clustering for high dimensional data

Sedigheh Safari

, Fatemeh Afsari ^*

Abstract: (1318 Views)

Due to the increasing information and the detailed analysis of them, the clustering problems that detect the hidden patterns lie in the data are still of great importance. On the other hand, clustering of high-dimensional data using previous traditional methods has many limitations. In this study, a semi-supervised ensemble clustering method is proposed for a set of high-dimensional medical data. In the proposed method of this study, little information is available as prior knowledge using the information on similarity or dissimilarity (as a number of pairwise constraints). Initially using the transitive property, we generalize the pairwise constraints to all data. Then we divide the feature space into a number of sub-spaces, and to find the optimal clustering solution, the feature space is divided into an unequal number of sub-spaces randomly. A semi-supervised spectral clustering based on the p-Laplacian graph is performed at each sub-space independently. Specifically, to increase the accuracy of spectral clustering, we have used the spectral clustering method based on the p-Laplacian graph. The p-Laplacian graph is a nonlinear generalization of the Laplacian graph. The results of any clustering solutions are compared with the pairwise constraints and according to the level of matching, a degree of confidence is assigned to each clustering solution. Based on these degrees of confidence, an ensemble adjacency matrix is formed, which is the result of considering the results of all clustering solutions for each sub-space. This ensemble adjacency matrix is used in the final spectral clustering algorithm to find the clustering solution of the whole sub-space. Since the sub-spaces are generated randomly with an unequal number of features, clustering results are strongly influenced by different initial values. Therefore, it is necessary to find the optimal sub-space set. To this end, a search algorithm is designed to find the optimal sub-space set. The search process is initialized by forming several sets (we call each set an environment) consisting of several numbers of sub-spaces. An optimal environment is the one that has the best clustering results. The search algorithm utilized three search operators to find the optimal environment. The search operators search all the environments and the consequent sub-spaces both locally and globally. These operators combine two environments and/or replace an environment with a newly generated one. Each search operator tries to find the best possible environment in the entire search space or in a local space.
We evaluate the performance of our proposed clustering schema on 20 cancer gene datasets. The normalized mutual information (NMI) criterion and the adjusted rand index (ARI) are used to evaluate the performance evaluation. We first examine the effect of a different number of pairwise constraints. As expected, with increasing the number of pairwise constraints, the efficiency of the proposed method also increases. For example, the NMI value increases from 0.6 to 0.9 on the Khan-2001 dataset, when the number of pairwise constraints increases from 20 to 100. More number of pairwise constraints means more information is available, which helps to improve the performance of the clustering algorithm. Furthermore, we examine the effect of the number of random subspaces. It is observed that increasing the number of random subspaces has a positive effect on clustering performance with respect to the NMI value. In most datasets, when the number of sub-spaces reaches 20, the performance of the proposed method does not change much and is stable. Examining the effect of sampling rate for random subspace generation shows that the proposed method has the best performance in most cancer datasets, such as Armstrong-2002-v3, and Bredel-2005 datasets, when the random subspace generation rate is 0.5, and by deviating the rate from 0.5, the level of satisfaction decreases. Then, the results of the proposed idea are compared with the results of the method proposed in the reference [21] according to ARI and we see that our proposed method has performed better in 12 data sets out of 20 data sets than the method proposed in the reference [21]. Finally, the proposed idea is compared with some metric learning approaches with respect to NMI. We have observed that the proposed method obtained the best results compared to other compared methods on 11 datasets out of 20 datasets. It also achieved the second-best result on 6 out of 20 datasets. For example, the value NMI obtained in the proposed method is 0.1042 more than the reference [21] and it is 0.1846 more than RCA and it is 0.4 more than ITML and also it is 0.468 more than DCA on the Bredel-2005 dataset.
Utilizing ensemble clustering methods besides the confidence factor improves the ability of the proposed algorithm to achieve better results. Also, utilizing the transitive operators as well as the selection of random subspaces of unequal sizes play an important role in achieving better performance for the proposed algorithm. Using the p-Laplacian spectral clustering method produces a better, more balanced, and normal volume of clusters compared to the standard spectral clustering. Another effective approach to the performance of the proposed method is to use search operators to find the best subspace, which leads to better results.

Article number: 3

Keywords: Clustering, Subspace Learning, Ensemble Learning, Semi-supervised Learning, Pairwise Constraints

Full-Text [PDF 2069 kb] (678 Downloads)

Type of Study: Research | Subject: Paper
Received: 2020/12/16 | Accepted: 2022/10/8 | Published: 2023/08/13 | ePublished: 2023/08/13

References

1. [1] C. Chrysouli and A. Tefas, "Spectral clustering and semi-supervised learning using evolving similarity graphs," Applied Soft Computing, vol. 34, pp. 625-637, 2015. [DOI:10.1016/j.asoc.2015.05.026]

2. [2] W. Hu, C. Chen, F. Ye, Z. Zheng, and G. Ling, "Nonnegative Spectral Clustering for Large-Scale Semi-supervised Learning," in International Conference on Database Systems for Advanced Applications, 2019: Springer, pp. 287-291. [DOI:10.1007/978-3-030-18590-9_30]

3. [3] E. Hancer, B. Xue, and M. Zhang, "A survey on feature selection approaches for clustering," Artificial Intelligence Review, pp. 1-27, 2020.

4. [4] G. Chao, S. Sun, and J. Bi, "A survey on multi-view clustering," arXiv preprint arXiv:1712.06246, 2017.

5. [5] X. He, S. Zhang, and Y. Liu, "An adaptive spectral clustering algorithm based on the importance of shared nearest neighbors," Algorithms, vol. 8, no. 2, pp. 177-189, 2015. [DOI:10.3390/a8020177]

6. [6] H. Jia, S. Ding, H. Zhu, F. Wu, and L. Bao, "A Feature Weighted Spectral Clustering Algorithm Based on Knowledge Entropy," JSW, vol. 8, no. 5, pp. 1101-1108, 2013. [DOI:10.4304/jsw.8.5.1101-1108]

7. [7] Z. Yu et al., "Probabilistic cluster structure ensemble," Information Sciences, vol. 267, pp. 16-34, 2014. [DOI:10.1016/j.ins.2014.01.030]

8. [8] J.E. Van Engelen and H.H. Hoos, "A survey on semi-supervised learning," Machine Learning, vol. 109, no. 2, pp. 373-440, 2020. [DOI:10.1007/s10994-019-05855-6]

9. [9] S. Ding, B. Qi, H. Jia, H. Zhu, and L. Zhang, "Research of semi-supervised spectral clustering based on constraints expansion," Neural Computing and Applications, vol. 22, no. 1, pp. 405-410, 2013. [DOI:10.1007/s00521-012-0911-8]

10. [10] Y. Jia, S. Kwong, and J. Hou, "Semi-supervised spectral clustering with structured sparsity regularization," IEEE Signal Processing Letters, vol. 25, no. 3, pp. 403-407, 2018. [DOI:10.1109/LSP.2018.2791606]

11. [11] Y. Jia, S. Kwong, J. Hou, and W. Wu, "Semi-supervised non-negative matrix factorization with dissimilarity and similarity regularization," IEEE Transactions on Neural Networks and Learning Systems, 2019. [DOI:10.1109/TNNLS.2019.2933223] [PMID]

12. [12] M.S. Baghshah, F. Afsari, S. B. Shouraki, and E. Eslami, "Scalable semi-supervised clustering by spectral kernel learning," Pattern Recognition Letters, vol. 45, pp. 161-171, 2014. [DOI:10.1016/j.patrec.2014.02.014]

13. [13] R. Sheikhpour, M. A. Sarram, S. Gharaghani, and M. A. Z. Chahooki, "A survey on semi-supervised feature selection methods," Pattern Recognition, vol. 64, pp. 141-158, 2017. [DOI:10.1016/j.patcog.2016.11.003]

14. [14] S. Faußer and F. Schwenker, "Semi-supervised clustering of large data sets with kernel methods," Pattern recognition letters, vol. 37, pp. 78-84, 2014. [DOI:10.1016/j.patrec.2013.01.007]

15. [15] M. Sugiyama, G. Niu, M. Yamada, M. Kimura, and H. Hachiya, "Information-maximization clustering based on squared-loss mutual information," Neural Computation, vol. 26, no. 1, pp. 84-131, 2014. [DOI:10.1162/NECO_a_00534] [PMID]

16. [16] T. Bühler and M. Hein, "Spectral clustering based on the graph p-Laplacian," in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 81-88.

17. [17] J. Jost, R. Mulas, and D. Zhang, "p-Laplace Operators for Chemical Hypergraphs," arXiv preprint arXiv:2007.00325, 2020.

18. [18] S. Saito, D. P. Mandic, and H. Suzuki, "Hypergraph p-Laplacian: A Differential Geometry View," arXiv preprint arXiv:1711.08171, 2017. [DOI:10.1609/aaai.v32i1.11823]

19. [19] S. Ding, H. Jia, M. Du, and Q. Hu, "p-Spectral Clustering Based on Neighborhood Attribute Granulation," in International Conference on Intelligent Information Processing, 2016: Springer, pp. 50-58. [DOI:10.1007/978-3-319-48390-0_6]

20. [20] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, "A survey on ensemble learning," Frontiers of Computer Science, pp. 1-18, 2020.

21. [21] H. Niu, N. Khozouie, H. Parvin, H. Alinejad-Rokny, A. Beheshti, and M. R. Mahmoudi, "An Ensemble of Locally Reliable Cluster Solutions," Applied Sciences, vol. 10, no. 5, p. 1891, 2020. [DOI:10.3390/app10051891]

22. [22] Z. Yu, Z. Kuang, J. Liu, H. Chen, J. Zhang, J. You, H.S. Wong and G. Han, "Adaptive ensembling of semi-supervised clustering solutions. IEEE Transactions on Knowledge and Data Engineering", vol. 29, no. 8, pp. 1577-1590, 2017. [DOI:10.1109/TKDE.2017.2695615]

23. [23] Z. Yu, P. Luo, J. You, H.S. Wong, H. Leung, S. Wu, J. Zhang, and G. Han, "Incremental semi-supervised clustering ensemble for high dimensional data clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 3, pp. 701-714, 2015. [DOI:10.1109/TKDE.2015.2499200]

24. [24] M. Galar, A. Fernández, E. Barrenechea, and F. Herrera, "EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling," Pattern recognition, vol. 46, no. 12, pp. 3460-3471, 2013. [DOI:10.1016/j.patcog.2013.05.006]

25. [25] S. Safari, and F. Afsari. "Ensemble P-spectral Semi-supervised Clustering." In 2020 International Conference on Machine Vision and Image Processing (MVIP), pp. 1-5. IEEE, 2020. [DOI:10.1109/MVIP49855.2020.9116885] [PMID]

26. [26] M. C. de Souto, I. G. Costa, D. S. de Araujo, T. B. Ludermir, and A. Schliep, "Clustering cancer gene expression data: a comparative study," BMC bioinformatics, vol. 9, no. 1, p. 497, 2008. [DOI:10.1186/1471-2105-9-497] [PMID] []

27. [27] N. X. Vinh, J. Epps, and J. Bailey, "Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance," The Journal of Machine Learning Research, vol. 11, pp. 2837-2854, 2010. [DOI:10.1145/1553374.1553511]

28. [28] L. Hubert and P. Arabie, "Comparing partitions," Journal of classification, vol. 2, no. 1, pp. 193-218, 1985. [DOI:10.1007/BF01908075]

29. [29] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning a mahalanobis metric from equivalence constraints," Journal of Machine Learning Research, vol. 6, no. Jun, pp. 937-965, 2005.

30. [30] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, "Information-theoretic metric learning," in Proceedings of the 24th international conference on Machine learning, 2007, pp. 209-216. [DOI:10.1145/1273496.1273523]

31. [31] S. C. Hoi, W. Liu, M. R. Lyu, and W.-Y. Ma, "Learning distance metrics with contextual constraints for image retrieval," in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), 2006, vol. 2: IEEE, pp. 2072-2078.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote