Volume 19, Issue 4 (3-2023)                   JSDP 2023, 19(4): 137-148 | Back to browse issues page


XML Persian Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Rahimi Z, Homayounpour M M. A New Document Embedding Method for News Classification. JSDP 2023; 19 (4) : 10
URL: http://jsdp.rcisp.ac.ir/article-1-1159-en.html
Amirkabir university of technology
Abstract:   (981 Views)
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way that can be distinguishable by a classifier. There is an abundance of methods in the literature for document representation which can be divided into a bag of words model, graph-based methods, word embedding pooling, neural network-based, and topic modeling based methods. Most of these methods only use local word co-occurrences to generate document embeddings. Local word co-occurrences miss the overall view of a document and topical information which can be very useful for classifying news articles.
 In this paper, we propose a method that utilizes term-document and document-topic matrix to generate richer representations for documents.  Term-document matrix represents a document in a specific way where each word plays a role in representing a document. The generalization power of this type of representation for text classification and information retrieval is not very well. This matrix is created based on global co-occurrences (in document-level). These types of co-occurrences are more suitable for text classification than local co-occurrences. Document-topic matrix represents a document in an abstract way and the higher level co-occurrences are used to generate this matrix. So this type of representation has a good generalization power for text classification but it is so high-level and misses the rare words as features which can be very useful for text classification.
The proposed approach is an unsupervised document-embedding model that utilizes the benefit of both document-topic and term-document matrices to generate a richer representation for documents. This method constructs a tensor with the help of these two matrices and applied tensor factorization to reveal the hidden aspects of data. The proposed method is evaluated on the task of text classification on 20-Newsgroups and R8 datasets which are benchmark datasets in the news classification area. The results show the superiority of the proposed model with respect to baseline methods. The accuracy of text classification is improved by 3%.
Article number: 10
Full-Text [PDF 577 kb]   (386 Downloads)    
Type of Study: Applicable | Subject: Paper
Received: 2020/08/1 | Accepted: 2021/03/8 | Published: 2023/03/20 | ePublished: 2023/03/20

References
1. [1] M. Fu, H. Qu, L. Huang, and L. Lu, "Bag of meta-words: A novel method to represent document for the sentiment classification, " Expert Syst. Appl., vol. 113, pp. 33-43, 2018. [DOI:10.1016/j.eswa.2018.06.052]
2. [2] R. Zhao and K. Mao, "Fuzzy Bag-of-Words Model for Document Representation, " IEEE Trans. Fuzzy Syst., vol. 26, no. 2, pp. 794-804, 2018. [DOI:10.1109/TFUZZ.2017.2690222]
3. [3] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. 1987.
4. [4] M. A. M. Garcia, R. P. Rodriguez, M. V. Ferro, and L. A. Rifon, "Wikipedia-Based Hybrid Document Representation for Textual News Classification," 2016 3rd Int. Conf. Soft Comput. Mach. Intell., no. November, pp. 148-153, 2016.
5. [5] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching Word Vectors with Subword Information, " Trans. Assoc. Comput. Linguist., vol. 5, pp. 135-146, 2016. [DOI:10.1162/tacl_a_00051]
6. [6] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space, " in International estimation on learning representations: Workshop Track, 2013, pp. 1-12.
7. [7] R. Collobert and J. Weston, "A unified architecture for natural language processing, " pp. 160-167, 2008. [DOI:10.1145/1390156.1390177]
8. [8] P. Li, K. Mao, Y. Xu, Q. Li, and J. Zhang, "Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base, " Knowledge-Based Syst., vol. 193, no. xxxx, 2020. [DOI:10.1016/j.knosys.2019.105436]
9. [9] H. K. Kim, H. Kim, and S. Cho, "Bag-of-concepts: Comprehending document representation through clustering words in distributed representation, " Neurocomputing, vol. 266, pp. 336-352, 2017. [DOI:10.1016/j.neucom.2017.05.046]
10. [10] M. Kamkarhaghighi and M. Makrehchi, "Content Tree Word Embedding for document representation, " Expert Syst. Appl., vol. 90, pp. 241-249, 2017. [DOI:10.1016/j.eswa.2017.08.021]
11. [11] R. A. Sinoara, J. Camacho-Collados, R. G. Rossi, R. Navigli, and S. O. Rezende, "Knowledge-enhanced document embeddings for text classification, " Knowledge-Based Syst., vol. 163, pp. 955-971, 2019. [DOI:10.1016/j.knosys.2018.10.026]
12. [12] J. Camacho-Collados and M. T. Pilehvar, "From word to sense embeddings: A survey on vector representations of meaning, " J. Artif. Intell. Res., vol. 63, pp. 743-788, 2018. [DOI:10.1613/jair.1.11259]
13. [13] D. Tang, F. Wei, B. Qin, N. Yang, T. Liu, and M. Zhou, "Sentiment Embeddings with Applications to Sentiment Analysis, " IEEE Trans. Knowl. Data Eng., vol. 28, no. 2, pp. 496-509, 2016. [DOI:10.1109/TKDE.2015.2489653]
14. [14] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin, "Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification, " in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Long Papers), 2014, vol. 1, pp. 1555-1565. [DOI:10.3115/v1/P14-1146] [PMID] []
15. [15] Q. Le and T. Mikolov, "Distributed representations of sentences and documents, " in International conference on machine learning, 2014, vol. 32, pp. 1188-1196.
16. [16] Y. Kim, "Convolutional Neural Networks for Sentence Classification," 2014. [DOI:10.3115/v1/D14-1181]
17. [17] G. Rao, W. Huang, Z. Feng, and Q. Cong, "LSTM with sentence representations for document-level sentiment classification, " Neurocomputing, vol. 308, no. May, pp. 49-57, 2018. [DOI:10.1016/j.neucom.2018.04.045]
18. [18] W. Etaiwi and A. Awajan, "Graph-based Arabic text semantic representation, " Inf. Process. Manag., vol. 57, no. 3, p. 102183, 2020. [DOI:10.1016/j.ipm.2019.102183]
19. [19] L. Yao, C. Mao, and Y. Luo, "graph convolutional networks for text classification," 2018.
20. [20] K. Bijari, H. Zare, E. Kebriaei, and H. Veisi, "Leveraging deep graph-based text representation for sentiment polarity applications, " Expert Syst. Appl., vol. 144, 2020. [DOI:10.1016/j.eswa.2019.113090]
21. [21] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng, "Improve Word Representation via Global Context and Multiple Word Prototypes, " in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012, no. July, pp. 873-882.
22. [22] J. Pennington, R. Socher, and C. Manning, "'Glove: Global Vectors for Word Representation, '" in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532-1543. [DOI:10.3115/v1/D14-1162]
23. [23] C. Chemudugunta, P. Smyth, and M. Steyvers, "Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model, " Adv. Neural Inf. Process. Syst., pp. 241-248, 2007. [DOI:10.7551/mitpress/7503.003.0035]
24. [24] T. G. Kolda and B. W. Bader, "Tensor Decompositions and Applications, " SIAM Rev., vol. 51, no. 3, pp. 455-500, 2009. [DOI:10.1137/07070111X]
25. [25] Z. Rahimi and M. M. Homayounpour, "Tens-embedding: A Tensor-based document embedding method, " Expert Syst. Appl., vol. 162, p. 113770, 2020. [DOI:10.1016/j.eswa.2020.113770]
26. [26] R. Lakshmi and S. Baskar, "Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms, " Expert Syst. Appl., vol. 137, pp. 493-503, 2019. [DOI:10.1016/j.eswa.2019.07.022]
27. [27] S. Deerwester, S. T. Dumias, G. W.Furmas, T. K.Lander, and R. Harshman, "Indexing by Latent Semantic Analysis," J. Am. Soc. Inf. Sci., vol. 41, no. 6, pp. 391-407, 1990. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 [DOI:10.1002/(SICI)1097-4571(199009)41:63.0.CO;2-9]
28. [28] T. Hofmann, "probabilistic latent semantic analysis, " in Hofmann, Thomas. "Probabilistic latent semantic analysis." Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999, pp. 289-296. [DOI:10.1145/312624.312649]
29. [29] D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation, " J. Mach. Learn. Res., vol. 3, pp. 993-1022, 2003.
30. [30] B. Jiang, Z. Li, H. Chen, S. Member, and A. G. Cohn, "Latent Topic Text Representation Learning on Statistical Manifolds, " IEEE Trans. Neural Networks Learn. Syst. 29, pp. 5643-5654, 2018. [DOI:10.1109/TNNLS.2018.2808332] [PMID]
31. [31] R. Das, M. Zaheer, and C. Dyer, "Gaussian LDA for Topic Models with Word Embeddings, " Proc. 53rd Annu. Meet. Assoc. Comput. Linguist. 7th Int. Jt. Conf. Nat. Lang. Process., pp. 795-804, 2015. [DOI:10.3115/v1/P15-1077] [PMID] []
32. [32] P. Liu, X. Qiu, and X. Huang, "Recurrent Neural Network for Text Classification with Multi-Task Learning, " Proc. Twenty-Fifth Int. Jt. Conf. Artif. Intelligen, pp. 2873-2879, 2016.
33. [33] T. N.Kipf and M. Welling, "Semi-Supervised classification with Graph Convolusional Networks," Iclr, pp. 1-11, 2017.
34. [34] C. Wu, F. Wu, T. Qi, X. Cui, and Y. Huang, "Attentive Pooling with Learnable Norms for Text Representation, " in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2961-2970. [DOI:10.18653/v1/2020.acl-main.267] [PMID] []
35. [35] Í. C. Dourado, R. Galante, M. A. Gonçalves, and R. da Silva Torres, "Bag of textual graphs (BoTG): A general graph-based text representation model, " J. Assoc. Inf. Sci. Technol., no. April, 2019. [DOI:10.1002/asi.24167]
36. [36] K. Huang, N. D. Sidiropoulos, and A. P. Liavas, "A Flexible and Efficient Algorithmic Framework for Constrained Matrix and Tensor Factorization, " IEEE Trans. Signal Process., vol. 64, no. 19, pp. 5052-5065, 2016. [DOI:10.1109/TSP.2016.2576427]
37. [37] S. Smith, J. Park, and G. Karypis, "SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory, " no. May, 2015. [DOI:10.1109/IPDPS.2015.27]
38. [38] Y. Liu, Z. Liu, T. Chua, and M. Sun, "Topical Word Embedding, " in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Topical, 2015, pp. 2418-2424. [DOI:10.1609/aaai.v29i1.9522]

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

© 2015 All Rights Reserved | Signal and Data Processing