Detecting Concept Drift in Data Stream Using Semi-Supervised Classification

Hasan Nezhad Namaghi, Hossein; Mashayekhi, Hoda; Zahedi, Morteza

doi:10.52547/jsdp.18.4.153

Volume 18, Issue 4 (3-2022) JSDP 2022, 18(4): 153-164 | Back to browse issues page

‎ 10.52547/jsdp.18.4.153

Mendeley

Zotero

RefWorks

Hasan Nezhad Namaghi H, Mashayekhi H, Zahedi M. Detecting Concept Drift in Data Stream Using Semi-Supervised Classification. JSDP 2022; 18 (4) : 9
URL: http://jsdp.rcisp.ac.ir/article-1-1031-en.html

Detecting Concept Drift in Data Stream Using Semi-Supervised Classification

Hossein Hasan Nezhad Namaghi

, Hoda Mashayekhi ^*

, Morteza Zahedi

Shahrood University of Technology

Abstract: (2280 Views)

Data stream is a sequence of data generated from various information sources at a high speed and high volume. Classifying data streams faces the three challenges of unlimited length, online processing, and concept drift. In related research, to meet the challenge of unlimited stream length, commonly the stream is divided into fixed size windows or gradual forgetting is used. Concept drift refers to changes in the statistical properties of data, and is divided into four categories: sudden, gradual, incremental, and recurring. Concept drift is generally dealt with by periodically updating the classifier, or employing an explicit change detector to determine the update time. These approaches are based on the assumption that the true labels are available for all data samples. Nevertheless, due to the cost of labeling instances, access to a partial labeling is more realistic. In a number of studies that have used semi-supervisory learning, the labels are received from the user to update the models in form of active learning. The purpose of this study is to classify samples in an unlimited data stream in presence of concept drift, using only a limited set of initial labeled data. To this end, a semi-supervised ensemble learning algorithm for data stream is proposed, which uses entropy variation to detect concept drift and is applicable for sudden and gradual drifts. The proposed model is trained with a limited initial labeled set. In occurrence of concept drift, the unlabeled data is used to update the ensemble model. It does not require receiving the labels from the user. In contrast to many of the current studies, the proposed algorithm uses an ensemble of K-NN classifiers. It constructs a group of clustering-based classification models, each of which is trained on a batch of data. On receiving each new sample, first it is determined whether the data sample is an outlier or not. If the data is included in a cluster, the sample class is determined by majority voting. When a window of the stream is received, the possibility of concept drift is examined based on entropy variation, and the classifier is updated by a semi-supervised approach if necessary. The model itself determines the required data labels. The proposed method is capable of detecting concept drift in data, and improving its accuracy via updating the learning model with appropriate samples received from the stream. Therefore, the proposed method only requires a small initial labeled data. Experiments are performed using five real and synthetic datasets, and the model performance is compared to three other approaches. The results show that the proposed method is superior in terms of precision, recall and F1 score compared to other studies.

Article number: 9

Keywords: data stream, ensemble learning, concept drift, entropy, semi-supervised classification

Full-Text [PDF 702 kb] (1094 Downloads)

Type of Study: Research | Subject: Paper
Received: 2019/06/8 | Accepted: 2021/03/1 | Published: 2022/03/21 | ePublished: 2022/03/21

References

1. [1] M. Masud, J. Gao, L. Khan, J. Han and B. M. Thuraisingham, "Classification and novel class detection in concept-drifting data streams under time constraints," IEEE Transactions on knowledge and data engineering, vol. 23, no. 6, pp. 859-874, 2010. [DOI:10.1109/TKDE.2010.61]

2. [2] M. M. Masud, Q. Chen, L. Khan, C. Aggarwal, J. Gao, J. Han and B. Thuraisingham, "Addressing concept-evolution in concept-drifting data streams," in 2010 IEEE International Conference on Data Mining, IEEE, 2010, pp. 929-934. [DOI:10.1109/ICDM.2010.160]

3. [3] B. S. Parker and L. Khan, "Detecting and tracking concept class drift and emergence in non-stationary fast data streams," in Twenty-ninth AAAI conference on artificial intelligence, 2015. [DOI:10.1109/ICDMW.2014.116]

4. [4] R. Klinkenberg, "Learning drifting concepts: Example selection vs. example weighting," Intelligent data analysis, vol. 8, no. 3, pp. 281-300, 2004. [DOI:10.3233/IDA-2004-8305]

5. [5] A. Bifet and R. Gavalda, "Learning from time-changing data with adaptive windowing," in Proceedings of the 2007 SIAM international conference on data mining, SIAM, 2007, pp. 443-448. [DOI:10.1137/1.9781611972771.42]

6. [6] A. Haque, L. Khan and M. Baron, "Sand: Semi-supervised adaptive novel class detection and classification over data stream," in THIRTIETH AAAI Conference on Artificial Intelligence, 2016.

7. [7] L. I. Kuncheva and W. J. Faithfull, "PCA feature extraction for change detection in multidimensional unlabeled data," IEEE transactions on neural networks and learning systems, vol. 25, no. 1, pp. 69-80, 2013. [DOI:10.1109/TNNLS.2013.2248094] [PMID]

8. [8] P. Sidhu and M. Bhatia, "A novel online ensemble approach to handle concept drifting data streams: diversified dynamic weighted majority," International Journal of Machine Learning and Cybernetics, vol. 9, no. 1, pp. 37-61, 2018. [DOI:10.1007/s13042-015-0333-x]

9. [9] O. A. Mahdi, E. Pardede and J. Cao, "Combination of information entropy and ensemble classification for detecting concept drift in data stream," in Proceedings of the Australasian Computer Science Week Multiconference, ACM, 2018, p. 13. [DOI:10.1145/3167918.3167946] [PMCID]

10. [10] M. Ester, H.-P. Kriegel, J. Sander and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in Kdd, 1996, pp. 226-231.

11. [11] X. Zhu and A. B. Goldberg, "Introduction to semi-supervised learning," Synthesis lectures on artificial intelligence and machine learning, vol. 3, no. 1, pp. 1-130, 2009. [DOI:10.2200/S00196ED1V01Y200906AIM006]

12. [12] A. Tsymbal, "The problem of concept drift: definitions and related work," Computer Science Department, Trinity College Dublin, vol. 106, no. 2, p. 58, 2004.

13. [13] I. Žliobaitė, "Learning under concept drift: an overview," in arXiv preprint arXiv:1010.4784, 2010.

14. [14] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby and R. Gavaldà, "New ensemble methods for evolving data streams," in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2009, pp. 139-148. [DOI:10.1145/1557019.1557041]

15. [15] S. J. Morshed, J. Rana and M. Milrad, "Real-time Data analytics: An algorithmic perspective," in International Conference on Data Mining and Big Data, Springer, 2016, pp. 311-320. [DOI:10.1007/978-3-319-40973-3_31]

16. [16] A. Bifet, G. Holmes, R. Kirkby and B. Pfahringer, "Moa: Massive online analysis," Journal of Machine Learning Research, vol. 11, no. May, pp. 1601-1604, 2010.

17. [17] B. Pfahringer, G. Holmes and R. Kirkby, "Handling numeric attributes in hoeffding trees," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, Berlin, Heidelberg, Springer, 2008, pp. 296-307. [DOI:10.1007/978-3-540-68125-0_27]

18. [18] D.L. Cabral, D. Rafael, and R.S.M. de Barros. "Concept drift detection based on Fisher's Exact test." Information Sciences, vol. 442, pp. 220-234, 2018. [DOI:10.1016/j.ins.2018.02.054]

19. [19] R.F. de Mello, Y. Vaz, C.H. Grossi, and A. Bifet. "On learning guarantees to unsupervised concept drift detection on data streams." Expert Systems with Applications. Vol. 117, pp. 90-102, 2019. [DOI:10.1016/j.eswa.2018.08.054]

20. [20] X. Wang, Q. Kang, M. Zhou, L. Pan, and A. Abusorrah. "Multiscale Drift Detection Test to Enable Fast Learning in Nonstationary Environments." IEEE Transactions on Cybernetics, pp. 1-13, 2020.

21. [21] Y. Song, J. Lu, H. Lu, and G. Zhang. "Fuzzy clustering-based adaptive regression for drifting data streams." IEEE Transactions on Fuzzy Systems, vol. 28, no. 3, pp. 544-557, 2019. [DOI:10.1109/TFUZZ.2019.2910714]

22. [22] Y. Li, Y. Wang, Q. Liu, C. Bi, X. Jiang, and S. Sun. "Incremental semi-supervised learning on streaming data." Pattern Recognition, vol. 88 pp. 383-396, 2019. [DOI:10.1016/j.patcog.2018.11.006]

23. [23] X. Mu, F. Zhu, J. Du, E.P. Lim, & Z.H. Zhou, "Streaming classification with emerging new class by class matrix sketching" In Thirty-First AAAI Conference on Artificial Intelligence, pp. 2373-2379, 2017.

24. [24] P. Vorburger, A. Bernstein. "Entropy-based concept shift detection" In Sixth IEEE International Conference on Data Mining, ICDM'06, pp. 1113-1118, 2006. [DOI:10.1109/ICDM.2006.66]

25. [25] L. Du, Q. Song, and X. Jia. "Detecting concept drift: an information entropy based method using an adaptive sliding window." Intelligent Data Analysis vol. 18, no. 3, pp. 337-364, 2014. [DOI:10.3233/IDA-140645]

26. [26] J. Haug, G. Kasneci. "Learning Parameter Distributions to Detect Concept Drift in Data Streams". arXiv preprint arXiv:2010.09388. 2020.

27. [27] H. Hanqing, M. Kantardzic, T. S. Sethi. "No Free Lunch Theorem for concept drift detection in streaming data classification: A review." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 10, no. 2, e1327, 2020. [DOI:10.1002/widm.1327]

28. [28] M. Mosaferi, A. Safaei, "Providing a Dynamic Technique for Answering Ad-hoc Continuous Aggregate". Journal of Signal and Data Processing. Vol. 14, No. 3, pp. 3-22, 2017. [DOI:10.29252/jsdp.14.3.3]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote