A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Daneshpour, Negin; Barzegari, Ali

doi:10.52547/jsdp.18.4.3

Volume 18, Issue 4 (3-2022) JSDP 2022, 18(4): 3-22 | Back to browse issues page

‎ 10.52547/jsdp.18.4.3

Mendeley

Zotero

RefWorks

Daneshpour N, Barzegari A. A New Method for Duplicate Detection Using Hierarchical Clustering of Records. JSDP 2022; 18 (4) : 1
URL: http://jsdp.rcisp.ac.ir/article-1-1039-en.html

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Negin Daneshpour ^*

, Ali Barzegari

Shahid Rajaee Teacher Training University

Abstract: (2722 Views)

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of data sources and human faults in data entry, it is possible to appear several copies of an entity in a data source. This problem leads to error occurrence in operations or output results of a system; also, it costs a lot for related organization or business. Therefore, data cleaning process especially duplicate record detection, became one of the most important area of computer science in recent years. Many solutions presented for detecting duplicates in different situations, but they almost are all time-consuming. Also, the volume of data is growing up every day. hence, previous methods don’t have enough performance anymore. Incorrect detection of two different records as duplicates, is another problem that recent works are being faced. This becomes important because duplicates will usually be deleted and some correct data will be lost. So it seems that presenting new methods is necessary.
In this paper, a method has been proposed that reduces required volume of process using hierarchical clustering with appropriate features. In this method, similarity between records has been estimated in several levels. In each level, a different feature has been used for estimating similarity between records. As a result, clusters that contain very similar records will be created in the last level. The comparisons are done on these records for detecting duplicates. Also, in this paper, a relative similarity function has been proposed for comparing between records. This function has high precision in determining the similarity. Eventually, the evaluation results show that the proposed method detects 90% of duplicate records with 97% accuracy in less time and results have improved.

Article number: 1

Keywords: Duplicate Record Detection, Data Cleaning, Hierarchical Clustering, Similarity Function, Feature Selection

Full-Text [PDF 867 kb] (1308 Downloads)

Type of Study: Research | Subject: Paper
Received: 2019/06/18 | Accepted: 2020/08/18 | Published: 2022/03/21 | ePublished: 2022/03/21

References

1. [1] E. Rahm, and H.H. Do, "Data cleaning: Problems and current approaches", IEEE Data Eng. Bull., 23(4), pp. 3-13, 2000.

2. [2] L. Bradji, and M. Boufaida, "Knowledge based data cleaning for data warehouse quality", in Digital Information Processing and Communications., Springer. pp. 373-384, 2011. [DOI:10.1007/978-3-642-22410-2_33]

3. [3] D.K. Koshley, and R. Halder, "Data cleaning: An abstraction-based approach. in Advances in Computing, Communications and Informatics (ICACCI)," 2015 International Conference on. 2015. IEEE. [DOI:10.1109/ICACCI.2015.7275695]

4. [4] M. Alian, , A. Awajan, and B. Ramadan, "Unsupervised learning blocking keys technique for indexing Arabic entity resolution", International Journal of Speech Technology, pp. 1-8, 2018. [DOI:10.1007/s10772-018-9489-6]

5. [5] Y. Li, , H. Wang, and H. Gao, "Efficient entity resolution based on sequence rules", in Advanced Research on Computer Science and Information Engineering, Springer. pp. 381-388, 2011. [DOI:10.1007/978-3-642-21402-8_61]

6. [6] https://www.reddit.com/r/datasets/ comments/3bxlg7/i_have_every_publicly_available_reddit_comment/.

7. [7] Y. Altowim, , D.V. Kalashnikov, and S. Mehrotra, "ProgressER: Adaptive Progressive Approach to Relational Entity Resolution", ACM Transactions on Knowledge Discovery from Data (TKDD),vol. 12(3), pp. 33, 2018. [DOI:10.1145/3154410]

8. [8] J.H. Martin, and D. Jurafsky, "Speech and language processing", International Edition, vol. 710: pp. 25, 2000.

9. [9] B. Hussain, et al., ''An evaluation of clustering algorithms in duplicate detection,'' Technical Report CSRG-620, University of Toronto, Department of Computer Science, 2013.

10. [10] M.A. Hernández, and S.J. Stolfo, ''Real-world data is dirty: Data cleansing and the merge/purge problem'', Data mining and knowledge discovery, vol. 2(1), pp. 9-37, 1998. [DOI:10.1023/A:1009761603038]

11. [11] L. He, et al, ''An efficient data cleaning algorithm based on attributes selection'', in Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on. 2011. IEEE.

12. [12] T. Smith, and M. Waterman, ªIdentification of Common Molecular Subsequences. º J. Molecular Biology, vol. 147, pp. 195-197, 1981. [DOI:10.1016/0022-2836(81)90087-5]

13. [13] Li, M., Q. Xie, and Q. Ding, An Improved Data Cleaning Algorithm Based on SNM, in Cloud Computing and Security. 2015, Springer. p. 259-269. [DOI:10.1007/978-3-319-27051-7_22]

14. [14] T. Wang, et al, ''SIER: An Efficient Entity Resolution Mechanism Combining SNM and Iteration''. in Web Information System and Application Conference (WISA), 2014 11th. 2014. IEEE. [DOI:10.1109/WISA.2014.50]

15. [15] L. Alami, I. Hafidi, and A. Metrane, ''Entity Resolution in NoSQL Data Warehouse'', in International Conference on Information Technology and Communication Systems. 2017. Springer. [DOI:10.1007/978-3-319-64719-7_5]

16. [16] M. Bilenko, and R.J. Mooney, ''Adaptive duplicate detection using learnable string similarity measures''. in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003. ACM. [DOI:10.1145/956750.956759]

17. [17] B. Kenig, and A. Gal, ''MFIBlocks: An effective blocking algorithm for entity resolution'', Information Systems, vol. 38(6), pp. 908-926, 2013. [DOI:10.1016/j.is.2012.11.008]

18. [18] R. Agrawal, T. Imieliński, and A. Swami, ''Mining association rules between sets of items in large databases'', in Acm sigmod record. 1993. ACM. [DOI:10.1145/170035.170072]

19. [19] S. Chaudhuri, V. Ganti, and R. Motwani, ''Robust identification of fuzzy duplicates. in Data Engineering,'' 2005. ICDE 2005. Proceedings. 21st International Conference on. 2005. IEEE.

20. [20] A. Saeedi, E. Peukert, and E. Rahm, ''Comparative evaluation of distributed clustering schemes for multi-source entity resolution'', in Advances in Databases and Information Systems, 2017, Springer. [DOI:10.1007/978-3-319-66917-5_19]

21. [21] P. Christen, ''Data matching: concepts and techniques for record linkage'', entity resolution, and duplicate detection. 2012: Springer Science & Business Media.

22. [22] T. Papenbrock, , A. Heise, and F. Naumann, ''Progressive duplicate detection,'' IEEE Transactions on knowledge and data engineering,vol. 27(5), p p. 1316-1329, 2015. [DOI:10.1109/TKDE.2014.2359666]

23. [23] S.E. Whang, D. Marmaros, and H. Garcia-Molina, ''Pay-as-you-go entity resolution'', IEEE Transactions on Knowledge and Data Engineering, vol.25(5), pp. 1111-1124, 2013. [DOI:10.1109/TKDE.2012.43]

24. [24] P. Indyk and R. Motwani, ''Approximate nearest neighbors: towards removing the curse of dimensionality'', in Proceedings of the thirtieth annual ACM symposium on Theory of computing, 1998, ACM. [DOI:10.1145/276698.276876]

25. [25] I. Van Dam, et al, ''Duplicate detection in web shops using LSH to reduce the number of computations'', in Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016, ACM. [DOI:10.1145/2851613.2851861]

26. [26] R. van Bezu, et al, ''Multi-component similarity method for web product duplicate detection,'' in Proceedings of the 30th annual ACM symposium on applied computing, 2015, ACM. [DOI:10.1145/2695664.2695818]

27. [27] D. Vatsalan, and P. Christen, ''Sorted nearest neighborhood clustering for efficient private blocking'', in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2013, Springer. [DOI:10.1007/978-3-642-37456-2_29]

28. [28] Z. Yuhang, W. Yue, and Y. Wei, ''Research on Data Cleaning in Text Clustering'', in Information Technology and Applications (IFITA), 2010 International Forum on, 2010, IEEE. [DOI:10.1109/IFITA.2010.73]

29. [29] S. Thampi, and D. Loganathan, Progressive of Duplicate Detection Using Adaptive Window Technique.

30. [30] M. Dash, and H. Liu, ''Feature selection for classification. Intelligent data analysis'',vol. 1(1-4), pp. 131-156. 1997. [DOI:10.1016/S1088-467X(97)00008-5]

31. [31] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. 2012, MIT press.

32. [32] P.E. Greenwood, and M.S. Nikulin, ''A guide to chi-squared testing,'' Vol. 280, 1996.

33. [33] https://www13.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/projekte/dude/restaurant.csv.

34. [34] https://www13.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/projekte/dude/cd.csv.

35. [35] https://www13.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/projekte/dude/CORA.xml.

36. [36] https://data.wa.gov/api/views/y3ds-rkew/rows.csv?accessType=DOWNLOAD.

37. [37] J.C. Dunn, Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 1974. 4(1): p. 95-104. [DOI:10.1080/01969727408546059]

38. [38] https://www.statisticshowto. datasciencecentral.com/probability-and-statistics/t-test/.

39. [39] M. keyvanpour, ''A Divisive Hierarchical Clustering-based Method for Indexing Image Information'' , JSDP, vol. 11 (2), pp. 91-109, 2015.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote