An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Asheghi Dizaji, Zahra; Asghari Aghjehdizaj, Sakineh; Soleimanian Gharehchopogh, Farhad

doi:10.29252/jsdp.17.1.117

Volume 17, Issue 1 (6-2020) JSDP 2020, 17(1): 117-130 | Back to browse issues page

‎ 10.29252/jsdp.17.1.117

Mendeley

Zotero

RefWorks

Asheghi Dizaji Z, Asghari Aghjehdizaj S, Soleimanian Gharehchopogh F. An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification. JSDP 2020; 17 (1) :117-130
URL: http://jsdp.rcisp.ac.ir/article-1-871-en.html

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Zahra Asheghi Dizaji ^*

, Sakineh Asghari Aghjehdizaj

, Farhad Soleimanian Gharehchopogh

Department of Computer Engineering, Urmia Branch, Islamic Azad University, Urmia, Iran.

Abstract: (4449 Views)

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years.
In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text classification is one of the most important parts in data mining and machine learning. Classification can be considered as the most important supervised technique which classifies the input space to k groups based on similarity and difference such that targets in the same group are similar and targets in different groups are different. Text classification system has been widely used in many fields, like spam filtering, news classification, web page detection, Bioinformatics, machine translation, automatic response systems, and applications regarding of automatic organization of documents.
The important point in obtaining an efficient text classification method is extraction and selection of key features of texts. It is proved that only 33% of words and features of the texts are useful and they can be used to extract information and most words existing in texts are used to represent purpose of a text and they are sometimes repeated. Feature selection is known as a good solution to high dimensionality of the feature space. Excessive number of Features not only increase computation time but also degrade classification accuracy. In general, purpose of extracting and selecting features of texts is to reduce data volume, time required for training, computational time and increase performance speed of the methods proposed for text classification. Feature extraction refers to the process of generating a small set of new features by combining or transforming the original ones, while in feature selection dimension of the space is reduced by selecting the most prominent features.
In this paper, a solution to improve support vector machine algorithm using Imperialism Competitive Algorithm, are provided. In this proposed method, the Imperialism Competitive Algorithm for selecting features and the support vector machine algorithm for Classification of texts are used.
At the stage of extracting the features of the texts, using weighting schemes such as NORMTF, LOGTF, ITF, SPARCK, and TF, each extracted word is allocated a weight in order to determine the role of the words in terms of their effects as the keywords of the texts. The weight of each word indicates the extent of its effect on the main topic of the text compared to other words used in the same text. In the proposed method, the TF weighting scheme is used for attributing weights to the words. In this scheme, the features are a function of the distribution of different features in each of the documents

.
Moreover, at this stage, using the process of pruning, low-frequency features and words that are used fewer than two times in the text are pruned. Pruning basically filters low-frequency features in a text [18].
In order to reduce the number of dimensions of the features and decrease computational complexity, the imperialist competitive algorithm (ICA) is utilized in the proposed method. The main goal of employing the imperialist competitive algorithm (ICA) in the proposed method is minimizing the loss of data in the texts, while also maximizing the reduction of the dimensions of the features.
In the proposed method, since the imperialist competitive algorithm (ICA) has been used for selecting the features, there must be a mapping created between the parameters of the imperialist competitive algorithm (ICA) and the proposed method. Accordingly, when using the imperialist competitive algorithm (ICA) for selecting the key features, the search space includes the dimensions of the features, and among all the extracted features,

, or

of all the features are attributed to each of the countries. Since the mapping is carried out randomly, there may be repetitive features in any of the countries as well. Next, based on the general trend of the imperialist competitive algorithm (ICA),some countries which are more powerful are considered as imperialists, while the other countries are considered as colonies. Once the countries are identified, the optimization process can begin. Each country is defined in the form of an

array with different values for the variables as in Equations 2 and 3.

(2)	Country = [ , , …, , ]
(3)	Cost = f (Country)

The variables attributed to each country can be structural features, lexical features, semantic features, or the weight of each word, and so on. Accordingly, the power of each country for identifying the class of each text is increased or decreased based on its variables.
One of the most important phases of the imperialist competitive algorithm (ICA) is the colonial competition phase. In this phase, all the imperialists try to increase the number of colonies they own. Each of the more powerful empires tries to seize the colonies of the weakest empires to increase their own power. In the proposed method, colonies with the highest number of errors in classification and the highest number of features are considered as the weakest empires.
Based on trial and error, and considering the target function in the proposed method, the number of key features relevant to the main topic of the texts is set to

of the total extracted features, and only through using

of the key features of each text along with a classifier algorithm such as

, support vector machine (SVM),

nearest neighbors, and so on, the class of that text can be determined in the proposed method.
Since the classification of texts is a nonlinear problem, in order to classify texts, the problem must first be mapped into a linear problem. In this paper, the RBF kernel function along with

is used for mapping the problem.
The hybrid algorithm is implemented on the Reuters21578, WebKB, and Cade 12 data sets to evaluate the accuracy of the proposed method. The simulation results indicate that the proposed hybrid algorithm in precision, recall and F Measure criteria is more efficient than primary support machine carriers.

Keywords: Feature Selection, Text Classification, Imperialism Competitive Algorithm, Support Vector Machines Algorithm, Optimization

Full-Text [PDF 4403 kb] (1179 Downloads)

Type of Study: Research | Subject: Paper
Received: 2018/06/3 | Accepted: 2019/07/10 | Published: 2020/06/21 | ePublished: 2020/06/21

References

1. [1] R. Feldman, and J. Sanger, The Text Mining Handbook, "Advanced Approach in Analyzing Unstructured Data", Cambridge University Press, 2007. [DOI:10.1017/CBO9780511546914]

2. [2] D. Chiang, H. Keh, H. Huang, and D. Chyr, "The Chinese text categorization system with association rule and category priority", Expert System with Applications, vol. 35, no. 1-2, pp. 102-110, 2008. [DOI:10.1016/j.eswa.2007.06.019]

3. [3] L. Khreisat , "A machine learning approach for Arabic text classification using N-gram frequency statistics", Proceeding of the 3nd Journal of Informetrics, vol.3(1), pp 72-77, 2009. [DOI:10.1016/j.joi.2008.11.005]

4. [4] A. An, B. Dauletbakov and E. Levner, "Multi-attribute Classification of Text Documents as a Tool for Ranking and Categorization of Educational Innovation Projects", Lecture Notes in Computer Science, vol. 8404, pp 404-416, 2014. [DOI:10.1007/978-3-642-54903-8_34]

5. [5] A. K. Uysal, "An improved global feature selection scheme for text classification", Expert systems with Applications, vol. 43, pp.82-92, 2016. [DOI:10.1016/j.eswa.2015.08.050]

6. [6] C. H. Wan, L. H. Lee, R. Rajkumar and D. Isa, "A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine", Expert Systems with Applications, vol. 39(15), pp.11880-11888, 2013. [DOI:10.1016/j.eswa.2012.02.068]

7. [7] B. Ramesh and J. G. R. Sathiaseelan, "An advanced Multi Class instance selection based Support Vector Machine for Text Classification", Procedia Computer Science, vol. 57, pp. 1124-1130, 2015. [DOI:10.1016/j.procs.2015.07.400]

8. [8] Y. Ko and J. Seo, "Text classification from unlabeled documents with bootstrapping and feature projection techniques", Information Processing & Management, vol. 45(1), pp. 70-83, 2009. [DOI:10.1016/j.ipm.2008.07.004]

9. [9] N. Shafiabady, L. H. Lee, R. Rajkumar,, V. P. Kallimani, , N. A. Akram and D. Isa, "Using unsupervised clustering approach to train the Support Vector Machine for text classification", Neurocomputing, vol. 211, pp. 4-10, 2016. [DOI:10.1016/j.neucom.2015.10.137]

10. [10] L. H. Lee, D. Isa, W. O. Choo and W. Y. Chue, "High Relevance Keyword Extraction facility for Bayesian text classification on different domains of varying characteristic", Expert Systems with Applications, vol. 39(1), pp. 1147-115, 2013. [DOI:10.1016/j.eswa.2011.07.116]

11. [11] J. He, A. H. Tan and C. L. Tan, "On Machine Learning methods for Chinese document categorization", Applied Intelligence, vol. 18(3), pp. 311-322, 2003. [DOI:10.1023/A:1023202221875]

12. [12] D. Isa, L. H. Lee, V. P. Kallimani and R. Rajkumar, "Text document pre-processing with the bayes Formula for classification using the support vector machine", IEEE Transaction on Knowledge and Data Engineering, vol. 20(9), pp. 1264-1272, 2008. [DOI:10.1109/TKDE.2008.76]

13. [13] D. S. Guru and M. Suhil, "A Novel Term_Class Relevance Measure for Text Categorization", Procedia Computer Science, Vol. 45, pp. 13-22, 2015. [DOI:10.1016/j.procs.2015.03.074]

14. [14] A. Onan, S. Korukoğlu and H. Bulut, "Ensemble of keyword extraction methods and classifiers in text classification", Expert Systems with Applications, vol. 57, pp. 232-247, 2016. [DOI:10.1016/j.eswa.2016.03.045]

15. [15] Y. Ko, J. Park and J. Seo, "Automatic Text Categorization using the Importance of Sentences", 19th international linguistics- Association for Computational Linguistics, vol. 1, PP.1-7, 2002. [DOI:10.3115/1072228.1072331]

16. [16] M. Sivakumar, C. Karthika and P. Renuga, "A Hybrid Text Classification Approach Using KNN And SVM ", Intenational Journal of Innovative Research In Science Engineering And Technology,Special Issue 3, vol. 3, pp.1987-1991, 2014.

17. [17] G. Feng, J. Guo, B. Y. Jing and T. Sun, "Feature Subset Selection Using Naive Bayes for Text Classification", Pattern Recognition Letters, vol. 65, pp. 109-115, 2015. [DOI:10.1016/j.patrec.2015.07.028]

18. [18] H. Uguz, "A two-stage feature selection method for text categorization by using information gain", principal component analysis and genetic algorithm. Knowledge-Based Systems, vol. 24(7), pp. 1024-1032, 2011.‌ [DOI:10.1016/j.knosys.2011.04.014]

19. [19] E. H. S. Han, G. Karypis and V. Kumar, "Text Categor ization Using Weight Adjusted k-Nearest Neighbor Classiﬁcation", In Pacific-asia conference on knowledge discovery and data mining, PP: 53-65. Springer Berlin Heidelberg, 2001. [DOI:10.1007/3-540-45357-1_9]

20. [20] K. Nigam, A. K. McCallum, S. Thrun and T. Mitchell, "Text Classiﬁcation from Labeled and Unlabeled Documents using EM", Kluwer Academic Publishers, Printed in The Netherlands. Machine Learning, vol.39 (2), pp. 103-134, 2000.

21. [21] R. Habibpour and K. Khalilpour, "A New Hybrid K-means and K-Nearest-Neighbor Algorithms for Text Document Clustering", International Journal of Academic Research, vol.6(3), pp. 7984, 2004. [DOI:10.7813/2075-4124.2014/6-3/A.12]

22. [22] S. Kashef and H. Nezamabadi-pour, "An advanced ACO algorithm for feature subset selection", Neurocomputing, vol.147, pp. 271-279, 2015. [DOI:10.1016/j.neucom.2014.06.067]

23. [23] A. S. Ghareb, A. A. Bakar and A. R. Hamdan, "Hybrid feature selection based on enhanced genetic algorithm for text categorization", Expert SystemsWith Applications, vol.49, pp.31-47, 2016. [DOI:10.1016/j.eswa.2015.12.004]

24. [24] Y. Lu, M. Liang, Z. Ye and L. Cao, "Improved particle swarm optimization algorithm and its applicationin text feature selection", Applied Soft Computing, vol. 35, pp. 629-636, 2015. [DOI:10.1016/j.asoc.2015.07.005]

25. [25] H. Wang and B. Niu, "A novel bacterial algorithm with randomness control for feature selection in classification", Neurocomputing, vol. 228, pp. 176-186, 2017. [DOI:10.1016/j.neucom.2016.09.078]

26. [26] V. Rezaie, M. Mohammadpour, H. parvin, S. Nejatian, "An Approach for Extraction of Keywords and Weighting Words for Improvement Farsi Documents Classification", JSDP, vol. 14 (4), pp.55-78. 2018. [DOI:10.29252/jsdp.14.4.55]

27. [27] F. Rad, H. Parvin, A. Dehbashi, B. Minaee, "Improved Clustering Persian Text Based on Keyword Using Linguistic and Thesaurus Knowledge", JSDP, vol. 13 (1), pp.87-100, 2016.

28. [28] F. Hoseinkhani, B. Nasersharif, "Two Featuer Transformation Methods Based on Genetic Algorithm for Reducing Support Vector Machine Classification Error", JSDP. Vol. 12 (2), pp. 23-39, 2015 [DOI:10.1109/PRIA.2015.7161625]

29. [29] E. Atashpaz-Gargari and C. Lucas, "Imperialist competitive algorithm: An algorithm for optimization inspired by imperialistic competition", IEEE Congress on Evolutionary Computation, pp. 4661-4667, 2007. [DOI:10.1109/CEC.2007.4425083]

30. [30] C. Lucas, Z. Nasiri-Gheidari and F. Tootoonchian, "Application of an imperialist competitive algorithm to the design of a linear induction motor", Energy Conversion and Management, Elsevier, vol. 51(7).‌ pp. 1407-1411, 2010. [DOI:10.1016/j.enconman.2010.01.014]

31. [31] E. Atashpaz-Gargari, F. Hashemzadeh, R. Rajabioun, and C. Lucas, "Colonial Competitive Algorithm, a novel approach for PID controller design in MIMO distillation column process", International Journal of Intelligent Computing and Cybernetics, vol. 1(3). pp. 337-355, 2008. [DOI:10.1108/17563780810893446]

32. [32] T.Mitchell, K.Nigam, D.Freitag, M,Craven, "Learning to extract symbolic knowledge from the world wide web", In: DTIC Document, 1998.

33. [33] A. Asuncion and D.J. Newmen, UCI Machine Learning Repository, Irvine, CA: Uni-versity of California, Department of information and Computer Science, 2007.

34. [34]http://archive.ics.uci.edu/ml/datasets/Reuters21778+Text+Categorization+Collection [Last Access: 12-19-2112.

35. [35] http://ana.cachopo.org/datasets-for-single-label-text-categorization [Lase Access: 12-19-2112].

36. [36] J. J. Rocchio, "Document Retrieval Systems - Optimization and Evaluation", PhD thesis, Harvard, 1966.

37. [37] K. M. Elhadad, Kh. M. Badran, and G. I. Salama, "A Novel Approach for Ontology-based Dimensionality Reduction for Web Text Document Classification", Computer society, pp.373-378, 2017. [DOI:10.1109/ICIS.2017.7960021]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote