A New Privacy Preserving Data Publishing Technique Conserving Accuracy of Classification on Anonymized Data

Ebrahimi Atani, Reza; Sadeghpour, Mehdi

doi:10.29252/jsdp.15.3.31

Volume 15, Issue 3 (12-2018) JSDP 2018, 15(3): 31-46 | Back to browse issues page

‎ 10.29252/jsdp.15.3.31

A New Privacy Preserving Data Publishing Technique Conserving Accuracy of Classification on Anonymized Data

Reza Ebrahimi Atani ^*

, Mehdi Sadeghpour

University of Guilan

Abstract: (5017 Views)

Data collection and storage has been facilitated by the growth in electronic services, and has led to recording vast amounts of personal information in public and private organizations databases. These records often include sensitive personal information (such as income and diseases) and must be covered from others access. But in some cases, mining the data and extraction of knowledge from these valuable sources, creates the need for sharing them with other organizations. This would bring security challenges in user’s privacy. The concept of privacy is described as sharing of information in a controlled way. In other words, it decides what type of personal information should be shared and which group or person can access and use it. “Privacy preserving data publishing” is a solution to ensure secrecy of sensitive information in a data set, after publishing it in a hostile environment. This process aimed to hide sensitive information and keep published data suitable for knowledge discovery techniques. Grouping data set records is a broad approach to data anonymization. This technique prevents access to sensitive attributes of a specific record by eliminating the distinction between a number of data set records. So far a large number of data publishing models and techniques have been proposed but their utility is of concern when a high privacy requirement is needed. The main goal of this paper to present a technique to improve the privacy and performance data publishing techniques. In this work first we review previous techniques of privacy preserving data publishing and then we present an efficient anonymization method which its goal is to conserve accuracy of classification on anonymized data. The attack model of this work is based on an adversary inferring a sensitive value in a published data set to as high as that of an inference based on public knowledge. Our privacy model and technique uses a decision tree to prevent publishing of information that removing them provides privacy and has little effect on utility of output data. The presented idea of this paper is an extension of the work presented in [20]. Experimental results show that classifiers trained on the transformed data set achieving similar accuracy as the ones trained on the original data set.

Keywords: Privacy preservation, Data sharing, Anonymization, Classification, Decision tree, Suppression

Full-Text [PDF 6516 kb] (2455 Downloads)

Type of Study: Research | Subject: Paper
Received: 2017/12/12 | Accepted: 2018/07/25 | Published: 2018/12/19 | ePublished: 2018/12/19

References

1. [1] B. C. M. Fung, K. Wang, A. Wai-Chee Fu and P. S. Yu, (2010), Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques, Chapman and Hall/CRC. [DOI:10.1201/9781420091502]

2. [2] J. Bennett and S. Lanning, (2007), "The Netflix Prize", Proceedings of the KDD Cup Workshop, pp. 3-6.

3. [3] D. Nettleton, (2014), "Data Privacy and Privacy-Preserving Data Publishing," in Commercial Data Mining: Processing, Analysis and Modeling for Predictive Analytics Projects, Morgan Kaufmann, pp. 266-277. [DOI:10.1016/B978-0-12-416602-8.00018-2]

4. [4] B. Fung, K. Wang and P. Yu, (2010), "Privacy-Preserving Data Publishing: A Survey of Recent Developments", ACM Computing Surveys, vol. 42, no. 4, [DOI:10.1145/1749603.1749605]

5. [5] L. Sweeney, (2002), "k-Anonymity: A Model for Protecting Privacy", International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, pp. 557-570. https://doi.org/10.1142/S0218488502001648 [DOI:10.1142/S021848850200165X]

6. [6] K. S. Babu, (2013), Utility-Based Privacy Preserving Data Publishing, PhD thesis, National Institute of Technology Rourkela.

7. [7] N. Mohammed, B. C. M. Fung, P. C. K. Hung and C. K. Lee, (2009), "Healthcare Data: A Case Study on the Blood Transfusion Service", Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1285-1294. [DOI:10.1145/1557019.1557157]

8. [8] D. Dou and S. Coulondre, (2012), "Detecting Privacy Violations in Multiple Views Publishing," in Database and Expert Systems Applications, Springer-Verlag Berlin Heidelberg, 506–513. [DOI:10.1007/978-3-642-32597-7_46]

9. [9] A. Anjum and G. Raschia, (2013), "Anonymizing Sequential Releases under Arbitrary Updates", Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 145-154. [DOI:10.1145/2457317.2457342]

10. [10] B. Fung, K. Wang and P. Yu, (2007), "Anonymizing Classification Data for Privacy Preservation", IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 5, pp. 711-725. [DOI:10.1109/TKDE.2007.1015]

11. [11] V. S. Susan and T. Christopher, (2014), "A Survey on Privacy Preservation in Data Publishing", International Journal of Computer Science and Mobile Computing, vol. 3, no. 3, pp. 188-193.

12. [12] T. Dalenius, (1977), "Towards a Methodology for Statistical Disclosure Control", Statistik Tidskrift, vol. 15, 429–222.

13. [13] C. Dwork, (2006), "Differential Privacy," in Automata, Languages and Programming, Springer Berlin Heidelberg, pp. 1-12. [DOI:10.1007/11787006_1]

14. [14] K. Wang and B. C. M. Fung, (2006), "Anonymizing sequential releases", Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 414-423. [DOI:10.1145/1150402.1150449]

15. [15] A. Machanavajjhala, D. Kifer, J. Gehrke and M. Venkitasubramaniam, (2007), "l-diversity: Privacy beyond k-anonymity", ACM Transactions on Knowledge Discovery from Data, vol. 1, no. 1, article 3. [DOI:10.1145/1217299.1217302]

16. [16] N. Li, T. Li and S. Venkatasubramanian, (2007), "t-Closeness: Privacy Beyond k-Anonymity and l-Diversity", IEEE 23rd International Conference on Data Engineering, pp. 106 - 115. [DOI:10.1109/ICDE.2007.367856]

17. [17] Y. Rubner, C. Tomasi and L. J. Guibas, (2000), "The Earth Mover's Distance as a Metric for Image Retrieval", International Journal of Computer Vision, vol. 40, no. 2, pp. 99 - 121. [DOI:10.1023/A:1026543900054]

18. [18] N. Li, T. Li and S. Venkatasubramanian, (2010), "Closeness A New Privacy Measure for Data Publishing", IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 7, pp. 943-956. [DOI:10.1109/TKDE.2009.139]

19. [19] M. E. Nergiz, M. Atzori and C. Clifton, (2007), "Hiding the Presence of Individuals from Shared Databases", InProc. of ACM International Conference on Management of Data, pp. 665-676. [DOI:10.1145/1247480.1247554]

20. [20] A. S. Sattar, J. Li, X. Ding, J. Liu and M. Vincent, (2013), "A general framework for privacy preserving data publishing", Knowledge-Based Systems, vol. 54, 276–287. [DOI:10.1016/j.knosys.2013.09.022]

21. [21] K. Wang, P. Yu and S. Chakraborty, (2004), "Bottom-Up Generalization: A Data Mining Solution to Privacy Protection", Fourth IEEE International Conference on Data Mining, pp. 249 - 256. [DOI:10.1109/ICDM.2004.10110]

22. [22] B. Fung, K. Wang and Y. P.S, (2005), "Top-Down Specialization for Information and Privacy Preservation", Proc. 21st International Conference on Data Engineering, pp. 205-216. [DOI:10.1109/ICDE.2005.143]

23. [23] S. Kisilevich, L. Rokach, Y. Elovici and B. Shapira, (2010), "Efficient Multidimensional Suppression for K-Anonymity", IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 3, pp. 334 - 347. [DOI:10.1109/TKDE.2009.91]

24. [24] A. Hussien, N. Hamza and A. Hefny, (2013), "Attacks on Anonymization-Based Privacy-Preserving: A Survey for Data Mining and Data Publishing", Journal of Information Security, vol. 4, no. 2, pp. 101-112. [DOI:10.4236/jis.2013.42012]

25. [25] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten, (2009), "The WEKA Data Mining Software: An Update", ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10-18. [DOI:10.1145/1656274.1656278]

26. [26] "Taxonomy trees of the Adult data set", [Online]. Available: http://ddm.cs.sfu.ca/dmsoft/Privacy/products/adultHierarchy.txt. [Accessed 8 May 2016].

27. [27] "UCI Machine Learning Repository: Adult Data Set", [Online]. Available: http://archive.ics.uci.edu/ml/datasets/Adult. [Accessed 8 May 2016].

28. [28] M. Nergiz, C. Clifton and A. Nergiz, (2009), "Multirelational k-Anonymity", IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 8, pp. 1104-1117. [DOI:10.1109/TKDE.2008.210]

29. [29] Mehdi Sadeghpour, "Privacy Preserving Data Publishing using Group Based Anonymization", MSc thesis in Software engineering, University of Guilan, 2015.

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.