Volume 16, Issue 1 (5-2019)                   JSDP 2019, 16(1): 143-157 | Back to browse issues page


XML Persian Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Sajedi H, Taslimi M. Author gender identification from text using Bayesian Random Forest . JSDP 2019; 16 (1) :143-157
URL: http://jsdp.rcisp.ac.ir/article-1-429-en.html
University of Tehran
Abstract:   (3871 Views)

Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields, from personalized advertising to law enforcement of reputation management. Text posts represent a large portion of user generated content, and contain information which can be relevant to discovering undisclosed user attributes, or investigating the honesty of self-reported age and gender. Because the highest rate of information exchanges is in text format, author identification from the aspects like age, gender, political and religious opinions from these contents will seem more considerable. Gender identification  that could be useful in security and marketing, also answers the following question: given a short text document, can we identify if the author is a male or a female?  This question is motivated by recent events where people faked their gender on the Internet. In this paper, author gender identification in blog’s data is investigated. In this regard, four groups of features include syntactic features, word-based features, character-based features, and function words are employed. In addition, character n-gram features is used for improving the accuracy of classification. For evaluation of the proposed method, 3212 texts were collected from Technorati.com and blogger.com. Experimental results demonstrate that these types of features are practical. furthermore, a new classification method called "Bayesian Random Forest" is introduced. Each tree in Bayesian Random Forest  is a Bayes tree. The results of experiment show that this method attains noticeable results in comparison with other classification algorithms such as Naïve Bayes, Naïve Bayes Tree, and Random Forest and it increases accuracy of gender identification to 89.5%.
 

Full-Text [PDF 3681 kb]   (1831 Downloads)    
Type of Study: Research | Subject: Paper
Received: 2017/09/27 | Accepted: 2019/01/26 | Published: 2019/06/10 | ePublished: 2019/06/10

References
1. [1] N. Cheng, R. Chandramouli , and K.P. Subbalakshmi, "Author gender identification from text," Elsevier. Digital investigation, vol. 8, pp. 78-88, 2011. [DOI:10.1016/j.diin.2011.04.002]
2. [2] Z. Miller, B. Dickinson, and W. Hu, "Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features," International Journal of Intelligence Science, 2012. [DOI:10.4236/ijis.2012.224019]
3. [3] K. Mita, A. Mukesh, "Automatic Classification of Unstructured Blog Text," Journal of Intelligent Learning Systems and Applications, vol. 5, pp. 108-114, 2013. [DOI:10.4236/jilsa.2013.52012]
4. [4] S. Argamon, M. Koppel, J. Fine, and A. Shimoni, "Gender, Genre and Writing Style in Formal Written Texts," Dept of Computer Science. Illinois Institute of Technology, pp. 321-346, 2003. [DOI:10.1515/text.2003.014]
5. [5] G. Murugaboopathy, S. Hariharasitaraman, and N. Sankarram, "Appropriate Gender Identification from the Text," International Journal of Emerg-ing Research in Management &Technology, 2013.
6. [6] A. Narayanan, H. Paskov and N. Z. Gong, "On the Feasibility of Internet-Scale Author Identi-fication," IEEE Symposium on Security and Privacy, vol. 46, 2012. [DOI:10.1109/SP.2012.46]
7. [7] Y. Zhang, Y. Dang and H. Chen, "Gender classification for Web Forums," IEEE Trans. On Systems, vol. 41, no. 4, 2011. [DOI:10.1109/TSMCA.2010.2093886]
8. [8] A. Mukherjee and B. Liu, "Improving Gender Classification of Blog Authors," Conference on Empirical Methods in Natural Language Processing, 2010, pp. 207-217.
9. [9] S. Nowson and J. Oberlander, "The identity of bloggers: Openness and gender in personal weblogs," in proc. AAAI Spring Symposia Com-put. Approaches Analyzing Weblogs, Stan-ford,CA, 2006.
10. [10] S. Hota, S. Argoman, M. Koppel, "performing gender Automatic stylistic analysis of shake-speare's characters," in Proc. Digital Humanit. Conf, 2006, pp. 100-106.
11. [11] R.S. Forsyth and D.I. Holmes, "Feature finding for text classification," Literary Linguistic Com-pute., vol. 11, No. 4, pp. 163-174, 1996. [DOI:10.1093/llc/11.4.163]
12. [12] M. Koppel, "Automatically categorizing written texts by author gender," Literary and Linguistic Computing, 2002. [DOI:10.1093/llc/17.4.401]
13. [13] N. Cheng, X. Chen, R. Chandramouli and K.P. Subbalakshmi, "Gender Identification from E-mails," computational intelligence and data min-ing, pp. 154-158, 2009. [DOI:10.1109/CIDM.2009.4938643]
14. [14] M. Corney, "Gender-preferential text mining of e-mail discourse," 18th Annual Computer Security applications Conference, 2002.
15. [15] R. Kohavi, "Scaling Up the Accuracy of NaiveBayes Classifers a Decision Tree Hybrid," Second International Conference on Knoledge Discovery and Data Mining, 1996, pp. 202-207.

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

© 2015 All Rights Reserved | Signal and Data Processing