1. [1] E. Rahm, and H.H. Do, "Data cleaning: Problems and current approaches", IEEE Data Eng. Bull., 23(4), pp. 3-13, 2000.
2. [2] L. Bradji, and M. Boufaida, "Knowledge based data cleaning for data warehouse quality", in Digital Information Processing and Communications., Springer. pp. 373-384, 2011. [
DOI:10.1007/978-3-642-22410-2_33]
3. [3] D.K. Koshley, and R. Halder, "Data cleaning: An abstraction-based approach. in Advances in Computing, Communications and Informatics (ICACCI)," 2015 International Conference on. 2015. IEEE. [
DOI:10.1109/ICACCI.2015.7275695]
4. [4] M. Alian, , A. Awajan, and B. Ramadan, "Unsupervised learning blocking keys technique for indexing Arabic entity resolution", International Journal of Speech Technology, pp. 1-8, 2018. [
DOI:10.1007/s10772-018-9489-6]
5. [5] Y. Li, , H. Wang, and H. Gao, "Efficient entity resolution based on sequence rules", in Advanced Research on Computer Science and Information Engineering, Springer. pp. 381-388, 2011. [
DOI:10.1007/978-3-642-21402-8_61]
6. [6] https://www.reddit.com/r/datasets/ comments/3bxlg7/i_have_every_publicly_available_reddit_comment/.
7. [7] Y. Altowim, , D.V. Kalashnikov, and S. Mehrotra, "ProgressER: Adaptive Progressive Approach to Relational Entity Resolution", ACM Transactions on Knowledge Discovery from Data (TKDD),vol. 12(3), pp. 33, 2018. [
DOI:10.1145/3154410]
8. [8] J.H. Martin, and D. Jurafsky, "Speech and language processing", International Edition, vol. 710: pp. 25, 2000.
9. [9] B. Hussain, et al., ''An evaluation of clustering algorithms in duplicate detection,'' Technical Report CSRG-620, University of Toronto, Department of Computer Science, 2013.
10. [10] M.A. Hernández, and S.J. Stolfo, ''Real-world data is dirty: Data cleansing and the merge/purge problem'', Data mining and knowledge discovery, vol. 2(1), pp. 9-37, 1998. [
DOI:10.1023/A:1009761603038]
11. [11] L. He, et al, ''An efficient data cleaning algorithm based on attributes selection'', in Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on. 2011. IEEE.
12. [12] T. Smith, and M. Waterman, ªIdentification of Common Molecular Subsequences. º J. Molecular Biology, vol. 147, pp. 195-197, 1981. [
DOI:10.1016/0022-2836(81)90087-5]
13. [13] Li, M., Q. Xie, and Q. Ding, An Improved Data Cleaning Algorithm Based on SNM, in Cloud Computing and Security. 2015, Springer. p. 259-269. [
DOI:10.1007/978-3-319-27051-7_22]
14. [14] T. Wang, et al, ''SIER: An Efficient Entity Resolution Mechanism Combining SNM and Iteration''. in Web Information System and Application Conference (WISA), 2014 11th. 2014. IEEE. [
DOI:10.1109/WISA.2014.50]
15. [15] L. Alami, I. Hafidi, and A. Metrane, ''Entity Resolution in NoSQL Data Warehouse'', in International Conference on Information Technology and Communication Systems. 2017. Springer. [
DOI:10.1007/978-3-319-64719-7_5]
16. [16] M. Bilenko, and R.J. Mooney, ''Adaptive duplicate detection using learnable string similarity measures''. in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003. ACM. [
DOI:10.1145/956750.956759]
17. [17] B. Kenig, and A. Gal, ''MFIBlocks: An effective blocking algorithm for entity resolution'', Information Systems, vol. 38(6), pp. 908-926, 2013. [
DOI:10.1016/j.is.2012.11.008]
18. [18] R. Agrawal, T. Imieliński, and A. Swami, ''Mining association rules between sets of items in large databases'', in Acm sigmod record. 1993. ACM. [
DOI:10.1145/170035.170072]
19. [19] S. Chaudhuri, V. Ganti, and R. Motwani, ''Robust identification of fuzzy duplicates. in Data Engineering,'' 2005. ICDE 2005. Proceedings. 21st International Conference on. 2005. IEEE.
20. [20] A. Saeedi, E. Peukert, and E. Rahm, ''Comparative evaluation of distributed clustering schemes for multi-source entity resolution'', in Advances in Databases and Information Systems, 2017, Springer. [
DOI:10.1007/978-3-319-66917-5_19]
21. [21] P. Christen, ''Data matching: concepts and techniques for record linkage'', entity resolution, and duplicate detection. 2012: Springer Science & Business Media.
22. [22] T. Papenbrock, , A. Heise, and F. Naumann, ''Progressive duplicate detection,'' IEEE Transactions on knowledge and data engineering,vol. 27(5), p p. 1316-1329, 2015. [
DOI:10.1109/TKDE.2014.2359666]
23. [23] S.E. Whang, D. Marmaros, and H. Garcia-Molina, ''Pay-as-you-go entity resolution'', IEEE Transactions on Knowledge and Data Engineering, vol.25(5), pp. 1111-1124, 2013. [
DOI:10.1109/TKDE.2012.43]
24. [24] P. Indyk and R. Motwani, ''Approximate nearest neighbors: towards removing the curse of dimensionality'', in Proceedings of the thirtieth annual ACM symposium on Theory of computing, 1998, ACM. [
DOI:10.1145/276698.276876]
25. [25] I. Van Dam, et al, ''Duplicate detection in web shops using LSH to reduce the number of computations'', in Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016, ACM. [
DOI:10.1145/2851613.2851861]
26. [26] R. van Bezu, et al, ''Multi-component similarity method for web product duplicate detection,'' in Proceedings of the 30th annual ACM symposium on applied computing, 2015, ACM. [
DOI:10.1145/2695664.2695818]
27. [27] D. Vatsalan, and P. Christen, ''Sorted nearest neighborhood clustering for efficient private blocking'', in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2013, Springer. [
DOI:10.1007/978-3-642-37456-2_29]
28. [28] Z. Yuhang, W. Yue, and Y. Wei, ''Research on Data Cleaning in Text Clustering'', in Information Technology and Applications (IFITA), 2010 International Forum on, 2010, IEEE. [
DOI:10.1109/IFITA.2010.73]
29. [29] S. Thampi, and D. Loganathan, Progressive of Duplicate Detection Using Adaptive Window Technique.
30. [30] M. Dash, and H. Liu, ''Feature selection for classification. Intelligent data analysis'',vol. 1(1-4), pp. 131-156. 1997. [
DOI:10.1016/S1088-467X(97)00008-5]
31. [31] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. 2012, MIT press.
32. [32] P.E. Greenwood, and M.S. Nikulin, ''A guide to chi-squared testing,'' Vol. 280, 1996.
33. [33] https://www13.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/projekte/dude/restaurant.csv.
34. [34] https://www13.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/projekte/dude/cd.csv.
35. [35] https://www13.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/projekte/dude/CORA.xml.
36. [36] https://data.wa.gov/api/views/y3ds-rkew/rows.csv?accessType=DOWNLOAD.
37. [37] J.C. Dunn, Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 1974. 4(1): p. 95-104. [
DOI:10.1080/01969727408546059]
38. [38] https://www.statisticshowto. datasciencecentral.com/probability-and-statistics/t-test/.
39. [39] M. keyvanpour, ''A Divisive Hierarchical Clustering-based Method for Indexing Image Information'' , JSDP, vol. 11 (2), pp. 91-109, 2015.