Data Clustering Based On Key Identification

Fazl-Ersi, Ehsan; Kazemi Nooghabi, Masoud

doi:10.29252/jsdp.14.4.31

Volume 14, Issue 4 (3-2018) JSDP 2018, 14(4): 31-42 | Back to browse issues page

‎ 10.29252/jsdp.14.4.31

Data Clustering Based On Key Identification

Ehsan Fazl-Ersi ^*

, Masoud Kazemi Nooghabi

Ferdowsi University of Mashhad

Abstract: (4584 Views)

Clustering has been one of the main building blocks in the fields of machine learning and computer vision. Given a pair-wise distance measure, it is challenging to find a proper way to identify a subset of representative exemplars and its associated cluster structures. Recent trend on big data analysis poses a more demanding requirement on new clustering algorithm to be both scalable and accurate. A recent advance in graph-based clustering extends its ability to millions of data points by massive utility of engineering endeavor and parallel optimization. However, most other existing clustering algorithms, though promising in theory, are limited in the scalability issue.
In this paper, a novel clustering method is proposed that is both accurate and scalable. Based on a simple criteria, ”key” items that are representative of the whole data set are iteratively selected and thus form associated cluster structures. Taking input of pairwise distance measure between data instances, the proposed method searches centers of clusters by identifying data items far away from selected keys, but representative of unselected data items. Inspired by hierarchical clustering, small clusters are iteratively merged until a desired number of clusters are obtained. To solve the scalability problem, a novel tracking table technique is designed to reduce the time complexity which is capable of clustering millions of data points within a few minutes.
To assess the performance of the proposed method, several experiments are conducted. The first experiment tests the ability of our algorithm on different manifold structures and various number of clusters. It is observed that our clustering algorithm outperforms existing alternatives in capturing different shapes of data distributions. In the second experiment, the scalability of our algorithm to large scale data points is assessed by clustering up to one million data points with dimensions of up to 100. It is shown that, even with one million data points, the proposed method only takes a few minutes to perform clustering. The third experiment is conducted on the ORL database, which consists of 400 face images of 40 individuals. The proposed clustering method outperforms the compared alternatives in this experiment as well. In the final experiment, shape clustering is performed on the MPEG-7 dataset, which contains 1400 silhouette images from 70 classes, 20 different shapes for each class. The goal here is to cluster the data items (here the binary shapes) into 70 clusters, so that each cluster only includes shapes that belong to one class. The proposed method outperforms other alternative clustering algorithms on this dataset as well.
Extensive empirical experiments demonstrate the superiority of the proposed method over existing alternatives, in terms of both effectiveness and efficiency. Furthermore, our algorithm is capable of large-scale data clustering where millions of data points can be clustered in a few seconds.

Keywords: Clustering, Key Identification, Large Scale

Full-Text [PDF 5077 kb] (1132 Downloads)

Type of Study: Applicable | Subject: Paper
Received: 2016/06/5 | Accepted: 2017/07/8 | Published: 2018/03/13 | ePublished: 2018/03/13

References

1. [1] چاقری آرش، فیضی درخشی محمدرضا. خوشه‌بندی خودکار داده‌ها با بهره‌گیری از الگوریتم رقابت استعماری بهبودیافته. پردازش علائم و داده‌ها; ۱۴ (۲) :۱۵۹-۱۶۹; 1396

2. [1] Chaghari A, Feizi-Derakhshi M. Automatic Clustering Using Improved Imperialist Competitive Algorithm. JSDP; 14 (2) :159-169, 2017. [DOI:10.18869/acadpub.jsdp.14.2.159]

3. [2] Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633, 2012. [DOI:10.14778/2180912.2180915]

4. [3] Belongie, S., Malik, J., & Puzicha, J. Shape matching and object recognition using shape contexts. IEEE transactions on pattern analysis and machine intelligence, 24(4), 509-522, 2002. [DOI:10.1109/34.993558]

5. [4] Comaniciu, D., & Meer, P. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence, 24(5), 603-619, 2002. [DOI:10.1109/34.1000236]

6. [5] El-Naqa, I., Yang, Y., Galatsanos, N. P., Nishikawa, R. M., & Wernick, M. N. A similarity learning approach to content-based image retrieval: application to digital mammography. IEEE transactions on medical imaging, 23(10), 1233-1244, 2004. [DOI:10.1109/TMI.2004.834601] [PMID]

7. [6] Frey, B. J., & Dueck, D. Clustering by passing messages between data points. Science, 315(5814), 972-976, 2007. [DOI:10.1126/science.1136800] [PMID]

8. [7] Huang, G., Song, S., Gupta, J. N., & Wu, C. Semi-supervised and unsupervised extreme learning machines. IEEE Transactions on Cybernetics, 44(12), 2405-2417, 2014. [DOI:10.1109/TCYB.2014.2307349] [PMID]

9. [8] Jain, A. K., Murty, M. N., & Flynn, P. J. Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323, 1999.

10. [9] Jain, A. K. Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8), 651-666, 2010. [DOI:10.1016/j.patrec.2009.09.011]

11. [10] Johnson, S. C. Hierarchical clustering schemes. Psychometrika, 32(3), 241-254, 1967. [DOI:10.1007/BF02289588] [PMID]

12. [11] Koyuturk, M., Grama, A., & Ramakrishnan, N. Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Transactions on Knowledge and Data Engineering, 17(4), 447-461, 2005. [DOI:10.1109/TKDE.2005.55]

13. [12] Latecki, L. J., Lakamper, R., & Eckhardt, T. Shape descriptors for non-rigid shapes with a single closed contour. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on (Vol. 1, pp. 424-429). IEEE, 2000. [DOI:10.1109/CVPR.2000.855850]

14. [13] Ling, H., & Jacobs, D. W. Shape classification using the inner-distance. IEEE transactions on pattern analysis and machine intelligence, 29(2), 2007. [DOI:10.1109/TPAMI.2007.41]

15. [14] Liu, H., Liu, T., Wu, J., Tao, D., & Fu, Y. Spectral ensemble clustering. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 715-724). ACM, 2015, August. [DOI:10.1145/2783258.2783287]

16. [15] Liu, H., Shao, M., Li, S., & Fu, Y. Infinite ensemble for image clustering. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, August. [DOI:10.1145/2939672.2939813]

17. [16] Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., & Suetens, P. Multimodality image registration by maximization of mutual information. IEEE transactions on medical imaging, 16(2), 187-198, 1997. [DOI:10.1109/42.563664] [PMID]

18. [17] Peng, Y., Zheng, W. L., & Lu, B. L. An unsupervised discriminative extreme learning machine and its applications to data clustering. Neurocomputing, 174, 250-264, 2016. https://doi.org/10.1016/j.neucom.2014.11.097 [DOI:10.1016/j.neucom.2015.03.118]

19. [18] Reynolds, D. Gaussian mixture models. Encyclopedia of biometrics, 827-832, 2015. [DOI:10.1007/978-1-4899-7488-4_196]

20. [19] Samaria, F. S., & Harter, A. C. Parameterisation of a stochastic model for human face identification. In Applications of Computer Vision, 1994., Proceedings of the Second IEEE Workshop on (pp. 138-142). IEEE, 1994, December. [DOI:10.1109/ACV.1994.341300]

21. [20] Sculley, D. Web-scale k-means clustering. In Proceedings of the 19th international conference on World Wide Web (pp. 1177-1178). ACM, 2010, April. [DOI:10.1145/1772690.1772862]

22. [21] Soheily-Khah, S., Douzal-Chouakria, A., & Gaussier, E. Generalized k-means-based clustering for temporal data under weighted and kernel time warp. Pattern Recognition Letters, 75, 63-69, 2016. [DOI:10.1016/j.patrec.2016.03.007]

23. [22] Steinbach, M., Karypis, G., & Kumar, V. A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, No. 1, pp. 525-526), 2000, August.

24. [23] Tuma, M. N., Scholz, S. W., & Decker, R. THE APPLICATION OF CLUSTER ANALYSIS IN MARKETING RESEARCH: A LITERATURE ANALYSIS. B> Quest, 2009.

25. [24] Von Luxburg, U. A tutorial on spectral clustering. Statistics and computing, 17(4), 395-416, 2007. [DOI:10.1007/s11222-007-9033-z]

26. [25] Wang, B., Mezlini, A. M., Demir, F., Fiume, M., Tu, Z., Brudno, M., & Goldenberg, A. Similarity network fusion for aggregating data types on a genomic scale. Nature methods, 11(3), 333-337, [26] Wu, J., Liu, H., Xiong, H., Cao, J., & Chen, J. K-means-based consensus clustering: A unified view. IEEE Transactions on Knowledge and Data Engineering, 27(1), 155-169, 2015.

27. [27] Zhang, Z., Pati, D., & Srivastava, A. Bayesian clustering of shapes of curves. Journal of Statistical Planning and Inference, 166, 171-186, 2015. [DOI:10.1016/j.jspi.2015.04.007]

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.