روشی جدید در تشخیص تکراری رکوردها با استفاده از خوشه‌‌بندی سلسله مراتبی

دانشپور, نگین; برزگری, علی

doi:10.52547/jsdp.18.4.3

دوره 18، شماره 4 - ( 12-1400 ) جلد 18 شماره 4 صفحات 22-3 | برگشت به فهرست نسخه ها

‎ 10.52547/jsdp.18.4.3

Mendeley

Zotero

RefWorks

Daneshpour N, Barzegari A. A New Method for Duplicate Detection Using Hierarchical Clustering of Records. JSDP 2022; 18 (4) : 1
URL: http://jsdp.rcisp.ac.ir/article-1-1039-fa.html

دانشپور نگین، برزگری علی. روشی جدید در تشخیص تکراری رکوردها با استفاده از خوشه‌‌بندی سلسله مراتبی. پردازش علائم و داده‌ها. 1400; 18 (4) :3-22

URL: http://jsdp.rcisp.ac.ir/article-1-1039-fa.html

روشی جدید در تشخیص تکراری رکوردها با استفاده از خوشه‌‌بندی سلسله مراتبی

نگین دانشپور^*

، علی برزگری

دانشگاه تربیت دبیر شهید رجایی

چکیده: (2556 مشاهده)

به‌دلیل اهمیت بالای کیفیت داده‌‌ها در عملکرد سامانه‌های نرم‌‌افزاری، فرآیند پاکسازی داده به‌خصوص تشخیص رکوردهای تکراری، طی سالیان اخیر یکی از مهم‌‌ترین حوزه‌‌های علوم رایانه به حساب آمده است. در این مقاله روشی برای تشخیص رکوردهای تکراری ارائه شده است که با خوشه‌‌بندی سلسله‌‌مراتبی رکوردها بر اساس ویژگی‌‌های مناسب در هر سطح، میزان شباهت میان رکوردها تخمین زده می‌‌شود. این کار سبب می‌‌شود تا خوشه‌‌هایی در سطح آخر به‌دست آیند که رکوردهای درون آن‌‌ها بسیار مشابه یکدیگر باشند. برای کشف رکوردهای تکراری نیز مقایسه تنها بر روی رکوردهای درون یک خوشه از سطح آخر انجام می‌‌گیرد. همچنین در این مقاله برای مقایسه میان رکوردها، یک تابع تشابه نسبی بر پایه تابع فاصله ویرایشی ارائه شده که دقت بسیار بالایی به همراه دارد. مقایسه نتایج ارزیابی سامانه نشان می‌‌دهد که روش ارائه‌شده، در زمان کمتری، 90% تکراری‌‌های موجود را با دقت 97% کشف می‌‌کند و بهبود داشته است.

شماره‌ی مقاله: 1

واژه‌های کلیدی: تشخیص تکراری، پاک‌سازی داده، خوشه‌‌بندی سلسله‌‌مراتبی، تابع تشابه، انتخاب ویژگی

متن کامل [PDF 867 kb] (1264 دریافت)

نوع مطالعه: پژوهشي | موضوع مقاله: مقالات پردازش داده‌های رقمی
دریافت: 1398/3/28 | پذیرش: 1399/5/28 | انتشار: 1401/1/1 | انتشار الکترونیک: 1401/1/1

فهرست منابع

1. [1] E. Rahm, and H.H. Do, "Data cleaning: Problems and current approaches", IEEE Data Eng. Bull., 23(4), pp. 3-13, 2000.

2. [2] L. Bradji, and M. Boufaida, "Knowledge based data cleaning for data warehouse quality", in Digital Information Processing and Communications., Springer. pp. 373-384, 2011. [DOI:10.1007/978-3-642-22410-2_33]

3. [3] D.K. Koshley, and R. Halder, "Data cleaning: An abstraction-based approach. in Advances in Computing, Communications and Informatics (ICACCI)," 2015 International Conference on. 2015. IEEE. [DOI:10.1109/ICACCI.2015.7275695]

4. [4] M. Alian, , A. Awajan, and B. Ramadan, "Unsupervised learning blocking keys technique for indexing Arabic entity resolution", International Journal of Speech Technology, pp. 1-8, 2018. [DOI:10.1007/s10772-018-9489-6]

5. [5] Y. Li, , H. Wang, and H. Gao, "Efficient entity resolution based on sequence rules", in Advanced Research on Computer Science and Information Engineering, Springer. pp. 381-388, 2011. [DOI:10.1007/978-3-642-21402-8_61]

6. [6] https://www.reddit.com/r/datasets/ comments/3bxlg7/i_have_every_publicly_available_reddit_comment/.

7. [7] Y. Altowim, , D.V. Kalashnikov, and S. Mehrotra, "ProgressER: Adaptive Progressive Approach to Relational Entity Resolution", ACM Transactions on Knowledge Discovery from Data (TKDD),vol. 12(3), pp. 33, 2018. [DOI:10.1145/3154410]

8. [8] J.H. Martin, and D. Jurafsky, "Speech and language processing", International Edition, vol. 710: pp. 25, 2000.

9. [9] B. Hussain, et al., ''An evaluation of clustering algorithms in duplicate detection,'' Technical Report CSRG-620, University of Toronto, Department of Computer Science, 2013.

10. [10] M.A. Hernández, and S.J. Stolfo, ''Real-world data is dirty: Data cleansing and the merge/purge problem'', Data mining and knowledge discovery, vol. 2(1), pp. 9-37, 1998. [DOI:10.1023/A:1009761603038]

11. [11] L. He, et al, ''An efficient data cleaning algorithm based on attributes selection'', in Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on. 2011. IEEE.

12. [12] T. Smith, and M. Waterman, ªIdentification of Common Molecular Subsequences. º J. Molecular Biology, vol. 147, pp. 195-197, 1981. [DOI:10.1016/0022-2836(81)90087-5]

13. [13] Li, M., Q. Xie, and Q. Ding, An Improved Data Cleaning Algorithm Based on SNM, in Cloud Computing and Security. 2015, Springer. p. 259-269. [DOI:10.1007/978-3-319-27051-7_22]

14. [14] T. Wang, et al, ''SIER: An Efficient Entity Resolution Mechanism Combining SNM and Iteration''. in Web Information System and Application Conference (WISA), 2014 11th. 2014. IEEE. [DOI:10.1109/WISA.2014.50]

15. [15] L. Alami, I. Hafidi, and A. Metrane, ''Entity Resolution in NoSQL Data Warehouse'', in International Conference on Information Technology and Communication Systems. 2017. Springer. [DOI:10.1007/978-3-319-64719-7_5]

16. [16] M. Bilenko, and R.J. Mooney, ''Adaptive duplicate detection using learnable string similarity measures''. in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003. ACM. [DOI:10.1145/956750.956759]

17. [17] B. Kenig, and A. Gal, ''MFIBlocks: An effective blocking algorithm for entity resolution'', Information Systems, vol. 38(6), pp. 908-926, 2013. [DOI:10.1016/j.is.2012.11.008]

18. [18] R. Agrawal, T. Imieliński, and A. Swami, ''Mining association rules between sets of items in large databases'', in Acm sigmod record. 1993. ACM. [DOI:10.1145/170035.170072]

19. [19] S. Chaudhuri, V. Ganti, and R. Motwani, ''Robust identification of fuzzy duplicates. in Data Engineering,'' 2005. ICDE 2005. Proceedings. 21st International Conference on. 2005. IEEE.

20. [20] A. Saeedi, E. Peukert, and E. Rahm, ''Comparative evaluation of distributed clustering schemes for multi-source entity resolution'', in Advances in Databases and Information Systems, 2017, Springer. [DOI:10.1007/978-3-319-66917-5_19]

21. [21] P. Christen, ''Data matching: concepts and techniques for record linkage'', entity resolution, and duplicate detection. 2012: Springer Science & Business Media.

22. [22] T. Papenbrock, , A. Heise, and F. Naumann, ''Progressive duplicate detection,'' IEEE Transactions on knowledge and data engineering,vol. 27(5), p p. 1316-1329, 2015. [DOI:10.1109/TKDE.2014.2359666]

23. [23] S.E. Whang, D. Marmaros, and H. Garcia-Molina, ''Pay-as-you-go entity resolution'', IEEE Transactions on Knowledge and Data Engineering, vol.25(5), pp. 1111-1124, 2013. [DOI:10.1109/TKDE.2012.43]

24. [24] P. Indyk and R. Motwani, ''Approximate nearest neighbors: towards removing the curse of dimensionality'', in Proceedings of the thirtieth annual ACM symposium on Theory of computing, 1998, ACM. [DOI:10.1145/276698.276876]

25. [25] I. Van Dam, et al, ''Duplicate detection in web shops using LSH to reduce the number of computations'', in Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016, ACM. [DOI:10.1145/2851613.2851861]

26. [26] R. van Bezu, et al, ''Multi-component similarity method for web product duplicate detection,'' in Proceedings of the 30th annual ACM symposium on applied computing, 2015, ACM. [DOI:10.1145/2695664.2695818]

27. [27] D. Vatsalan, and P. Christen, ''Sorted nearest neighborhood clustering for efficient private blocking'', in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2013, Springer. [DOI:10.1007/978-3-642-37456-2_29]

28. [28] Z. Yuhang, W. Yue, and Y. Wei, ''Research on Data Cleaning in Text Clustering'', in Information Technology and Applications (IFITA), 2010 International Forum on, 2010, IEEE. [DOI:10.1109/IFITA.2010.73]

29. [29] S. Thampi, and D. Loganathan, Progressive of Duplicate Detection Using Adaptive Window Technique.

30. [30] M. Dash, and H. Liu, ''Feature selection for classification. Intelligent data analysis'',vol. 1(1-4), pp. 131-156. 1997. [DOI:10.1016/S1088-467X(97)00008-5]

31. [31] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. 2012, MIT press.

32. [32] P.E. Greenwood, and M.S. Nikulin, ''A guide to chi-squared testing,'' Vol. 280, 1996.

33. [33] https://www13.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/projekte/dude/restaurant.csv.

34. [34] https://www13.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/projekte/dude/cd.csv.

35. [35] https://www13.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/projekte/dude/CORA.xml.

36. [36] https://data.wa.gov/api/views/y3ds-rkew/rows.csv?accessType=DOWNLOAD.

37. [37] J.C. Dunn, Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 1974. 4(1): p. 95-104. [DOI:10.1080/01969727408546059]

38. [38] https://www.statisticshowto. datasciencecentral.com/probability-and-statistics/t-test/.

39. [39] M. keyvanpour, ''A Divisive Hierarchical Clustering-based Method for Indexing Image Information'' , JSDP, vol. 11 (2), pp. 91-109, 2015.

40. [39] ایزدپناه نجوا، کیوان پور محمدرضا، رنجبران سعیده. یک روش مبتنی بر خوشه‌بندی سلسله‌مراتبی تقسیم‌کننده جهت شاخص‌گذاری اطلاعات تصویری . پردازش علائم و داده‌ها. ۱۳۹۳; ۱۱ (۲) :۱۰۹-۹۱

ارسال پیام به نویسنده مسئول

بازنشر اطلاعات
	این مقاله تحت شرایط Creative Commons Attribution-NonCommercial 4.0 International License قابل بازنشر است.

کلیه حقوق این تارنما متعلق به فصل‌نامة علمی - پژوهشی پردازش علائم و داده‌ها است.

نظر شما در مورد قالب جدید چیست؟
	خوب
	متوسط
	ضعیف

پایگاه‌های مرتبط

واژگان کلیدی

نظرسنجی