Weighted Ensemble Clustering for Increasing the Accuracy of the Final Clustering

Vahidi Ferdosi, Sedigheh; Amirkhani, Hossein

doi:10.29252/jsdp.17.2.100

Volume 17, Issue 2 (9-2020) JSDP 2020, 17(2): 100-85 | Back to browse issues page

‎ 10.29252/jsdp.17.2.100

Mendeley

Zotero

RefWorks

Vahidi Ferdosi S, Amirkhani H. Weighted Ensemble Clustering for Increasing the Accuracy of the Final Clustering. JSDP 2020; 17 (2) :100-85
URL: http://jsdp.rcisp.ac.ir/article-1-816-en.html

Weighted Ensemble Clustering for Increasing the Accuracy of the Final Clustering

Sedigheh Vahidi Ferdosi

, Hossein Amirkhani ^*

University of Qom

Abstract: (2692 Views)

Clustering algorithms are highly dependent on different factors such as the number of clusters, the specific clustering algorithm, and the used distance measure. Inspired from ensemble classification, one approach to reduce the effect of these factors on the final clustering is ensemble clustering. Since weighting the base classifiers has been a successful idea in ensemble classification, in this paper we propose a method to use weighting in the ensemble clustering problem. The accuracies of base clusterings are estimated using an algorithm from crowdsourcing literature called agreement/disagreement method (AD). This method exploits the agreements or disagreements between different labelers for estimating their accuracies. It assumes different labelers have labeled a set of samples, so each two persons have an agreement ratio in their labeled samples. Under some independence assumptions, there is a closed-form formula for the agreement ratio between two labelers based on their accuracies. The AD method estimates the labelers’ accuracies by minimizing the difference between the parametric agreement ratio from the closed-form formula and the agreement ratio from the labels provided by labelers. To adapt the AD method to the clustering problem, an agreement between two clusterings are defined as having the same opinion about a pair of samples. This agreement can be as either being in the same cluster or being in different clusters. In other words, if two clusterings agree that two samples should be in the same or different clusters, this is considered as an agreement. Then, an optimization problem is solved to obtain the base clusterings’ accuracies such that the difference between their available agreement ratios and the expected agreements based on their accuracies is minimized. To generate the base clusterings, we use four different settings including different clustering algorithms, different distance measures, distributed features, and different number of clusters. The used clustering algorithms are mean shift, k-means, mini-batch k-means, affinity propagation, DBSCAN, spectral, BIRCH, and agglomerative clustering with average and ward metrics. For distance measures, we use correlation, city block, cosine, and Euclidean measures. In distributed features setting, the k-means algorithm is performed for 40%, 50%,…, and 100% of randomly selected features. Finally, for different number of clusters, we run the k-means algorithm by k equals to 2 and also 50%, 75%, 100%, 150%, and 200% of true number of clusters. We add the estimated weights by the AD algorithm to two famous ensemble clustering methods, i.e., Cluster-based Similarity Partitioning Algorithm (CSPA) and Hyper Graph Partitioning Algorithm (HGPA). In CSPA, the similarity matrix is computed by taking a weighted average of the opinions of different clusterings. In HGPA, we propose to weight the hyperedges by different values such as the estimated clustering accuracies, size of clusters, and the silhouette of clusterings. The experiments are performed on 13 real and artificial datasets. The reported evaluation measures include adjusted rand index, Fowlkes-Mallows, mutual index, adjusted mutual index, normalized mutual index, homogeneity, completeness, v-measure, and purity. The results show that in the majority of cases, the proposed weighted-based method outperforms the unweighted ensemble clustering. In addition, the weighting is more effective in improving the HGPA algorithm than CSPA. For different weighting methods proposed for HGPA algorithm, the best average results are obtained when we use the accuracies estimated by the AD method to weight the hyperedges, and the worst results are obtained when using the normalized silhouette measure for weighting. Finally, among different methods for generating base clusterings, the best results in weighted HGPA are obtained when we use different clustering algorithms to come up with different base clusterings.

Keywords: Weighted Ensemble Clustering, Unsupervised Learning, HGPA, CSPA, AD

Full-Text [PDF 5109 kb] (602 Downloads)

Type of Study: Research | Subject: Paper
Received: 2017/11/30 | Accepted: 2020/06/8 | Published: 2020/09/14 | ePublished: 2020/09/14

References

1. [1] J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques. Elsevier, 2011.

2. [2] S. Vega-Pons and J. Ruiz-Shulcloper, "a Survey of Clustering Ensemble Algorithms," Int. J. Pattern Recognit. Artif. Intell., vol. 25, no. 03, pp. 337-372, 2011. [DOI:10.1142/S0218001411008683]

3. [3] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, "On combining classifiers," IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226-239, 1998. [DOI:10.1109/34.667881]

4. [4] S.-B. Cho and J. H. Kim, "Combining multiple neural networks by fuzzy integral for robust classification," Syst. Man Cybern. IEEE Trans., vol. 25, no. 2, pp. 380-384, 1995. [DOI:10.1109/21.364825]

5. [5] J. Franke and E. Mandler, "A comparison of two approaches for combining the votes of cooperating classifiers," in Pattern Recognition, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, Proceedings., 11th IAPR International Conference on, 1992, pp. 611-614.

6. [6] L. K. Hansen and P. Salamon, "Neural network ensembles," IEEE Trans. Pattern Anal. Mach. Intell., no. 10, pp. 993-1001, 1990. [DOI:10.1109/34.58871]

7. [7] S. Hashem and B. Schmeiser, "Improving model accuracy using optimal linear combinations of trained neural networks," Neural Networks, IEEE Trans., vol. 6, no. 3, pp. 792-794, 1995. [DOI:10.1109/72.377990] [PMID]

8. [8] J. Kittler, "Improving recognition rates by classifier combination: A theoretical framework," DAC and IS, editors, Progress in Handwriting Recognition, pp. 231-248, 1997.

9. [9] D. J. Miller and L. Yan, "Critic-driven ensemble classification," Signal Process. IEEE Trans., vol. 47, no. 10, pp. 2833-2844, 1999. [DOI:10.1109/78.790663]

10. [10] K. W. De Bock, K. Coussement, and D. Van den Poel, "Ensemble classification based on generalized additive models," Comput. Stat. Data Anal., vol. 54, no. 6, pp. 1535-1546, 2010. [DOI:10.1016/j.csda.2009.12.013]

11. [11] C. Domeniconi and M. Al-Razgan, "Weighted cluster ensembles: Methods and analysis," ACM Trans. Knowl. Discov. from Data, vol. 2, no. 4, pp. 17, 2009. [DOI:10.1145/1460797.1460800]

12. [12] A. Strehl and J. Ghosh, "Cluster ensembles---a knowledge reuse framework for combining multiple partitions," J. Mach. Learn. Res., vol. 3, no. Dec, pp. 583-617, 2002.

13. [13] H. Amirkhani and M. Rahmati, "Agreement/disagreement based crowd labeling," Appl. Intell., vol. 41, no. 1, pp. 212-222, Jul. 2014. [DOI:10.1007/s10489-014-0516-2]

14. [14] N. Littlestone and M. K. Warmuth, "The weighted majority algorithm," in Foundations of Computer Science, 1989., 30th Annual Symposium on, 1989, pp. 256-261. [DOI:10.1109/SFCS.1989.63487]

15. [15] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, "Supervised machine learning: A review of classification techniques." Emerging Artificial Intelligence Applications in Computer Engineering, vol. 160, pp. 3-24, 2007. [DOI:10.1007/s10462-007-9052-3]

16. [16] T. G. Dietterich, "Ensemble methods in machine learning," in International workshop on multiple classifier systems, 2000, pp. 1-15. [DOI:10.1007/3-540-45014-9_1]

17. [17] T. Windeatt, "Vote counting measures for ensemble classifiers," Pattern Recognit., vol. 36, no. 12, pp. 2743-2756, 2003. [DOI:10.1016/S0031-3203(03)00191-2]

18. [18] S. Wang, A. Mathew, Y. Chen, L. Xi, L. Ma, and J. Lee, "Empirical analysis of support vector machine ensemble classifiers," Expert Syst. Appl., vol. 36, no. 3, pp. 6466-6476, 2009. [DOI:10.1016/j.eswa.2008.07.041]

19. [19] A. J. C. Sharkey, Combining artificial neural nets: ensemble and modular multi-net systems. Springer Science & Business Media, 2012.

20. [20] A. L. N. Fred and A. K. Jain, "Data clustering using evidence accumulation," in Pattern Recognition, 2002. Proceedings. 16th International Conference on, 2002, vol. 4, pp. 276-280.

21. [21] A. Topchy, A. K. Jain, and W. Punch, "A mixture model for clustering ensembles," in Society for Industrial and Applied Mathematics. Proceedings of the SIAM International Conference on Data Mining, 2004, pp. 379. [DOI:10.1137/1.9781611972740.35]

22. [22] S. Dudoit and J. Fridlyand, "Bagging to improve the accuracy of a clustering procedure," Bioinformatics, vol. 19, no. 9, pp. 1090-1099, 2003. [DOI:10.1093/bioinformatics/btg038] [PMID]

23. [23] D. Gondek and T. Hofmann, "Non-redundant clustering with conditional ensembles," in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005, pp. 70-77. [DOI:10.1145/1081870.1081882]

24. [24] A. Topchy, A. K. Jain, and W. Punch, "Combining multiple weak clusterings," in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, 2003, pp. 331-338.

25. [25] L. I. Kuncheva, S. T. Hadjitodorov, and Others, "Using diversity in cluster ensembles," in Systems, man and cybernetics, 2004 IEEE international conference on, 2004, vol. 2, pp. 1214-1219.

26. [26] B. Minaei-Bidgoli, A. Topchy, and W. F. Punch, "Ensembles of partitions via data resampling," in Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on, 2004, vol. 2, pp. 188-192. [DOI:10.1109/ITCC.2004.1286629]

27. [27] A. Topchy, A. K. Jain, and W. Punch, "Clustering ensembles: Models of consensus and weak partitions," Pattern Anal. Mach. Intell. IEEE Trans., vol. 27, no. 12, pp. 1866-1881, 2005. [DOI:10.1109/TPAMI.2005.237] [PMID]

28. [28] H. Alizadeh, M. Moshki, H. Parvin, B. Minaei Bidgoli, "Clustering ensemble based on combination of subset of primary clusters," Signal and Data Processing, vol. 7, no. 1, pp. 19-32, 2010.

29. [29] L. Franek and X. Jiang, "Ensemble clustering by means of clustering embedding in vector spaces," Pattern Recognit., vol. 47, no. 2, pp. 833-842, 2014. [DOI:10.1016/j.patcog.2013.08.019]

30. [30] A. Mirzaei, "Combining hierarchical clusterings with emphasis on retaining the structural contents of the base clusterings," PhD dissertation, Amirkabir University of Technology, 2009. [DOI:10.1109/ICPR.2008.4761275]

31. [31] J. H. Friedman and J. J. Meulman, "Clustering objects on subsets of attributes," J. R. Stat. Soc. Ser. B (Statistical Methodol.), vol. 66, no. 4, pp. 815-849, 2004. [DOI:10.1111/j.1467-9868.2004.02059.x]

32. [32] C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan, and D. Papadopoulos, "Locally adaptive metrics for clustering high dimensional data," Data Min. Knowl. Discov., vol. 14, no. 1, pp. 63-97, 2007. [DOI:10.1007/s10618-006-0060-8]

33. [33] M. Al-Razgan and C. Domeniconi, "Weighted clustering ensembles," in Proceedings of the 2006 SIAM International Conference on Data Mining, 2006, pp. 258-269. [DOI:10.1137/1.9781611972764.23]

34. [34] S. Vega-Pons, J. Correa-Morris, and J. Ruiz-Shulcloper, "Weighted cluster ensemble using a kernel consensus function," in Iberoamerican Congress on Pattern Recognition, 2008, pp. 195-202. [DOI:10.1007/978-3-540-85920-8_24]

35. [35] S. Vega-Pons and J. Ruiz-Shulcloper, "Clustering ensemble method for heterogeneous partitions," in Iberoamerican Congress on Pattern Recognition, 2009, pp. 481-488. [DOI:10.1007/978-3-642-10268-4_56]

36. [36] T. Li and C. Ding, "Weighted consensus clustering," in Proceedings of the 2008 SIAM International Conference on Data Mining, 2008, pp. 798-809. [DOI:10.1137/1.9781611972788.72]

37. [37] F. Gullo, A. Tagarelli, and S. Greco, "Diversity-Based Weighting Schemes for Clustering Ensembles," in Proceedings of the 2009 SIAM International Conference on Data Mining, 2009, pp. 437-448. [DOI:10.1137/1.9781611972795.38] [PMID]

38. [38] J. W. C.-D. Huang Dong; Lai, "Ensemble clustering using factor graph," Pattern Recognit., vol. 50, pp. 131-142, 2015. [DOI:10.1016/j.patcog.2015.08.015]

39. [39] H. Liu, J. Wu, T. Liu, D. Tao, and Y. Fu, "Spectral Ensemble Clustering via Weighted K-Means: Theoretical and Practical Evidence," IEEE Trans. Knowl. Data Eng., vol. 29, no. 5, pp. 1129-1143, 2017. [DOI:10.1109/TKDE.2017.2650229]

40. [40] Y. Ren, C. Domeniconi, G. Zhang, and G. Yu, "Weighted-object ensemble clustering: methods and analysis," Knowl. Inf. Syst., pp. 1-29, 2016.

41. [41] Q. Kang, S. Liu, M. Zhou, and S. Li, "A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence," Knowledge-Based Syst., vol. 104, pp. 156-164, 2016. [DOI:10.1016/j.knosys.2016.04.021]

42. [42] H. Parvin, " Clustering ensembles with weighting clusters and features," PhD dissertation, Iran University of Science & Technology, 2013.

43. [43] A. Litifi Pakdehi, N. Daneshpour, "Cluster ensemble selection using voting," Signal and Data Processing, vol. 15, no. 4, pp. 17-30, 2019. [DOI:10.29252/jsdp.15.4.17]

44. [44] Z. Yu and H. S. Wong, "Class discovery from gene expression data based on perturbation and cluster ensemble," IEEE Trans. Nanobioscience, vol. 8, no. 2, pp. 147-160, 2009. [DOI:10.1109/TNB.2009.2023321] [PMID]

45. [45] H. Liu, M. Shao, S. Li, and Y. Fu, "Infinite ensemble clustering," Data Min. Knowl. Discov., vol. 32, no. 2, pp. 385-416, 2018. [DOI:10.1007/s10618-017-0539-5]

46. [46] D. Huang, C.-D. Wang, and J.-H. Lai, "Locally weighted ensemble clustering," IEEE Trans. Cybern., vol. 48, no. 5, pp. 1460-1473, 2018. [DOI:10.1109/TCYB.2017.2702343] [PMID]

47. [47] C. Zhong, L. Hu, X. Yue, T. Luo, Q. Fu, and H. Xu, "Ensemble clustering based on evidence extracted from the co-association matrix," Pattern Recognit., vol. 92, pp. 93-106, 2019. [DOI:10.1016/j.patcog.2019.03.020]

48. [48] F. Li, Y. Qian, J. Wang, C. Dang, and L. Jing, "Clustering ensemble based on sample's stability," Artif. Intell., vol. 273, pp. 37-55, 2019. [DOI:10.1016/j.artint.2018.12.007]

49. [49] G. Karypis and V. Kumar, "A fast and high quality multilevel scheme for partitioning irregular graphs," SIAM J. Sci. Comput., vol. 20, no. 1, pp. 359-392, 1998. [DOI:10.1137/S1064827595287997]

50. [50] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, "Multilevel hypergraph partitioning: applications in VLSI domain," IEEE Trans. Very Large Scale Integr. Syst., vol. 7, no. 1, pp. 69-79, 1999. [DOI:10.1109/92.748202]

51. [51] L. Rokach and O. Maimon, "Clustering methods," in Data mining and knowledge discovery handbook, Springer, 2005, pp. 321-352. [DOI:10.1007/0-387-25465-X_15]

52. [52] P. J. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis," J. Comput. Appl. Math., vol. 20, pp. 53-65, 1987. [DOI:10.1016/0377-0427(87)90125-7]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.