Volume 16, Issue 3 (12-2019)                   JSDP 2019, 16(3): 88-79 | Back to browse issues page

XML Persian Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Baradaran R, Golpar-Raboki E. Feature Extraction and Efficiency Comparison Using Dimension Reduction Methods in Sentiment Analysis Context. JSDP. 2019; 16 (3) :88-79
URL: http://jsdp.rcisp.ac.ir/article-1-698-en.html
Qom University
Abstract:   (539 Views)

Nowadays, users can share their ideas and opinions with widespread access to the Internet and especially social networks. On the other hand, the analysis of people's feelings and ideas can play a significant role in the decision making of organizations and producers. Hence, sentiment analysis or opinion mining is an important field in natural language processing. One of the most common ways to solve such problems is machine learning methods, which creates a model for mapping features to the desired output. One challenge of using machine learning methods in NLP fields is feature selection and extraction among a large number of early features to achieve models with high accuracy. In fact, the high number of features not only cause computational and temporal problems but also have undesirable effects on model accuracy.
Studies show that different methods have been used for feature extraction or selection. Some of these methods are based on selecting important features from feature sets such as Principal Component Analysis (PCA) based methods. Some other methods map original features to new ones with less dimensions but with the same semantic relations like neural networks. For example, sparse feature vectors can be converted to dense embedding vectors using neural network-based methods. Some others use feature set clustering methods and extract less dimension features set like NMF based methods. In this paper, we compare the performance of three methods from these different classes in different dataset sizes.
In this study, we use two compression methods using Singular Value Decomposition (SVD) that is based on selecting more important attributes and non-Negative Matrix Factorization (NMF) that is based on clustering early features and one Auto-Encoder based method which convert early features to new feature set with the same semantic relations. We compare these methods performance in extracting more effective and fewer features on sentiment analysis task in the Persian dataset. Also, the impact of the compression level and dataset size on the accuracy of the model has been evaluated. Studies show that compression not only reduces computational and time costs but can also increase the accuracy of the model.
For experimental analysis, we use the Sentipers dataset that contains more than 19000 samples of user opinions about digital products and sample representation is done with bag-of-words vectors. The size of bag-of-words vectors or feature vectors is very large because it is the same as vocabulary size. We set up our experiment with 4 sub-datasets with different sizes and show the effect of different compression performance on various compression levels (feature count) based on the size of dataset size. 
According to experiment results of classification with SVM, feature compression using the neural network from 7700 to 2000 features not only increases the speed of processing and reduces storage costs but also increases the accuracy of the model from 77.05% to 77.85% in the largest dataset contains about 19000 samples. Also in the small dataset, the SVD approach can generate better results and by 2000 features from 7700 original features can obtain 63.92 % accuracy compared to 63.57 % early accuracy.
Furthermore, the results indicate that compression based on neural network in large dataset with low dimension feature sets is much better than other approaches, so that with only 100 features extracted by neural network-based auto-encoder, the system achieves acceptable 74.46% accuracy against SVD accuracy 67.15% and NMF accuracy 64.09% and the base model accuracy 77.05% with 7700 features.
 

Full-Text [PDF 3119 kb]   (103 Downloads)    
Type of Study: Research | Subject: Paper
Received: 2017/07/22 | Accepted: 2019/06/19 | Published: 2020/01/7 | ePublished: 2020/01/7

References
1. [1] P. Hosseini, A. Ahmadian-Ramaki, H. Maleki, M. Anvari and A. Mirroshandel, "Sentipers: A sentiment analysis corpus for Persian", in 3th National Conference on Linguistics, Tehran: Sharif University of Technology, 2015.
2. [2] H. Ghassemian, and H. R. Shahdoosti. "Multispectral and Panchromatic image fusion using Spatial PCA", in Signal and Data Processing, vol. 10, pp. 78-69, 2013.
3. [3] E. Asgarian, M. Kahani, and S. Sharifi, "HesNegar: Persian Sentiment WordNet", Signal and Data Processing, vol. 15, pp. 71-86, 2018.
4. [4] M. Najafzadeh, S. Rahati Quchani, R. Ghaemi, "A Semi-supervised Framework Based on Self-constructed Adaptive Lexicon for Persian Sentiment Analysis", Signal and Data Processing, vol. 15, pp. 89-102, 2018.
5. [5] S. Noferesti, and M. Shamsfard. “Automatic building a corpus and exploiting it for polarity classification of indirect opinions about drugs”, Signal and Data Processing, vol 2, pp. 35-42, 2017.
6. [6] D. Ankitkumar, R. Badre, and M. Kinikar, “A Survey on Sentiment Analysis and Opinion Mining”, International Journal of Innovative Research in Computer and Communication Engineering, vol. 2, no. 11, November 2014.
7. [7] J. Blitzer, “Dimensionality Reduction for Language, A Survey of Dimensionality Reduction Techniques for Natural Language”, 2008, [Online]. Available:http://john.blitzer.com/papers/wpe2.pdf. [Accessed: 10 July 2017].
8. [8] M. Chu, F. Diele, R. Plemmons, and S. Ragni, “Optimality, Computation and Interpretation of NonNegative Matrix Factorizations”, October 2014. Available: http://users.wfu.edu/ple-mmons/papers/chu_ple.pdf. [Accessed: 10 July 2017].
9. [9] G. Golub, and C. V. Loan, Matrix computation, 3th ed. Baltimore, Maryland: JHU Press, 1989.
10. [10] J. Jotheeswaran, B. MadhuSudhanan, and R.Loganathan, “Feature Reduction using Principal Component Analysis for Opinion Mining”, International Journal of Computer Science and Telecommunications, vol. 3, no. 5, pp. 118-121, May 2012.
11. [11] J. Jotheeswaran, and S.Koteeswaran, “Feature Selection using Random Forest method for Sentiment Analysis”, Indian Journal of Science and Technology, vol. 9, no. 3, pp. 1-7, January 2016.
12. [12] E. Keogh, and A. Mueen, “Curse of dimensionality”, In: Encyclopedia of Machine Learning, Springer, pp. 257–258, 2010.
13. [13] J. Kim, and H. Park, “Sparse nonnegative matrix factorization for clustering”, Technical Report CSE Technical Reports, GTCSE-08-01, Georgia Institute of Technology, 2008.
14. [14] D.P. Kingma, and M. Welling, “Auto-Encoding Variational Bayes”, Cornell University Library, ArXiv: 1312.6114, December 2013.
15. [15] D. D. Lee, and H. Sebastian Seung, “Algorithms for Non-Negative Matrix Factorization”, Advances in Neural Information Processing Systems, vol. 13, pp. 556-562, 2001.
16. [16] TS. Lee, BC. Shia, and CL. Huh, “Social Media Sentimental Analysis in Exhibition’s Visitor Engagement Prediction”, American Journal of Industrial and Business Management, vol. 06, pp. 392-400. March 2016.
17. [17] T. Li, Y. Zhang, and V. Sindhwani, “A non-negative matrix tri-factorization approach to sentiment classification with lexical prior kno-wledge”, in Proceedings of ACL-IJCNLP, 2009, pp. 244–252.
18. [18] C. Y. Cheng, J. W Liou, D. R Liou, “Autoencoder for Words”, Neurocomputing, vol. 139, pp. 84–96, September 2014.
19. [19] B. Liu,”Sentiment Analysis and Opinion Mining”, Synthesis lectures on human language technologies, vol. 5. no. 1, pp. 1-167, 2012.
20. [20] W. Medhat, A. Hassan, and H. Korashy, “Sentiment analysis algorithms and applications: A survey”, Ain Shams Engineering Journal, vol. 5, no. 4, pp. 1093-1113, December 2014.
21. [21] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space”, ICLR, 2013.
22. [22] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, 2010, pp. 1045–1048.
23. [23] B. Pang, L. Lee, “Opinion mining and sentiment analysis”, Foundations and Trends in Infor-mation Retrieval, vol. 2, no. 1-2, pp. 1-135, 2008.
24. [24] B. Pang, L. Lee, S. Vaithyanathan, “Thumbs up? Sentiment classification using machine learning techniques”, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 79–86, 2002.
25. [25] W. Rong, Y. Nie, Y. Ouyang, B. Peng, and Z. Xiong, “Auto-encoder Based Bagging Architecture for Sentiment Analysis”, Journal of Visual Languages and Computing, vol. 25, pp. 840-849, 2014.
26. [26] G. Vinodhini, and RM. Chandrasekaran, “Opinion mining using principal component analysis based ensemble model for e-commerce application”, CSI Transactions on ICT, vol. 2, pp. 169–179, November 2014.
27. [27] M. E. Wall, A. Rechtsteiner, and L. M. Rocho, “Singular Value Decomposition and Principal Component Analysis”, chapter 5 in A Practical Approach to Microarray Data Analysis Kluwer Academic Publishers, Boston, MA, 91-109, 2003.
28. [28] Wikipedia-Autoencoder, [Online]. Available: https://en.wikipedia.org/wiki/Autoencoder. [Accessed: 10 July 2017].
29. [29] Y. Yoshida, T. Hirao, T. Iwata, M. Nagata, and Y. Matsumoto, “Transfer learning for multiple-domain sentiment analysis identifying domain dependent/independent word polarity.” in Proceedings of the Twenty-Fifth AAAI Con-ference on Artificial Intelligence, 2011.
30. [30] N. Zainuddin,A. Selamat and R. Ibrahim, "Hybrid Sentiment Classification on Twitter Aspect-Based Sentiment Analysis," Applied Intelligence, vol. 48, no. 5, pp. 1218-1232, May 2018.

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


© 2015 All Rights Reserved | Signal and Data Processing