Volume 21, Issue 1 (6-2024)                   JSDP 2024, 21(1): 15-26 | Back to browse issues page


XML Persian Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

pirgazi J, Ghanbari sorkhi A, Iranpour Mobarakeh M. Extracting and combination efficient feature from protein sequence for classify protein based on rotation forest. JSDP 2024; 21 (1) : 2
URL: http://jsdp.rcisp.ac.ir/article-1-1387-en.html
Faculty of Electrical and Computer Engineering, University of Science and Technology of Mazandaran, Behshahr, Iran
Abstract:   (834 Views)
Abstract
Protein function prediction is one of the main challenges in bioinformatics, which has many applications. In recent years, many researches in this field have been used machine learning methods. In these methods, First, different features should be extracted from the protein sequence and classification should be done based on the extracted features. The feature extraction methods are based on the physical and chemical properties of the protein sequence. Therefore, extracting suitable features from protein sequence increases and improves the performance of machine learning methods. In this paper, usage of a new set of features based on Position-Specific Scoring Matrix (PSSM), Pseudo-Position Specific Scoring Matrix (PsePSSM), K-gram, Amino Acid Composition (AAC) and the new Term Frequency and Category Relevancy Factor (TFCRF) method, which has not been used in this application so far, is proposed to extract suitable features.
In the PSSM method for protein BLAST searches, a scoring matrix is used, in which amino acid substitution scores are given separately for each position in a multi-sequence protein alignment. The PsePSSM feature is described by considering different ranking correlation factors along a protein sequenc to preserve information about the amino acid sequence. The normalized occurrence frequency of a certain number of amino acids in the protein is calculated by the ACC method. An K-gram is a set of K successive items in a protein that  include amino acid.
In the TFCRF weighting method, in addition to paying attention to how these are distributed in different sequences, how these are distributed in different classes is also paid attention to.The features extracted using this method give machine learning models a good discriminating power between data in classes. In the next step, classification is done using the extracted features using the rotation forest method. This classifier is a successful ensemble method for a wide range of data mining applications. In this method, the feature space is changed through Principal Component Analysis (PCA), which increases the power of this classifier. The proposed method has been compared to different classifiers. The results show that the efficiency of the proposed method is much better than other state-of–the-art methods in this application.
Article number: 2
Full-Text [PDF 1142 kb]   (284 Downloads)    
Type of Study: Research | Subject: Paper
Received: 2023/07/1 | Accepted: 2024/02/25 | Published: 2024/08/3 | ePublished: 2024/08/3

References
1. Heider, D., et al., A computational approach for the identification of small GTPases based on preprocessed amino acid sequences. Technology in cancer research & treatment, 2009. 8(5): p. 333-341. [DOI:10.1177/153303460900800503]
2. Löchel, H.F., et al., SCOTCH: subtype A coreceptor tropism classification in HIV-1. Bioinformatics, 2018. 34(15): p. 2575-2580. [DOI:10.1093/bioinformatics/bty170]
3. Heider, D. and D. Hoffmann, Interpol: An R package for preprocessing of protein sequences. BioData mining, 2011. 4(1): p. 1-6. [DOI:10.1186/1756-0381-4-16]
4. Armano, G. and A. Giuliani, A two-tiered 2d visual tool for assessing classifier performance. Information Sciences, 2018. 463: p. 323-343. [DOI:10.1016/j.ins.2018.06.052]
5. Yu, X., I. Weber, and R. Harrison. Sparse representation for HIV-1 protease drug resistance prediction. in Proceedings of the 2013 SIAM international conference on data mining. 2013. SIAM. [DOI:10.1137/1.9781611972832.38]
6. Spänig, S. and D. Heider, Encodings and models for antimicrobial peptide classification for multi-resistant pathogens. BioData Mining, 2019. 12(1): p. 1-29. [DOI:10.1186/s13040-019-0196-x]
7. Zhang, J., et al., Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors. BMC bioinformatics, 2020. 21: p. 1-15. [DOI:10.1186/s12859-020-03826-6]
8. De Santis, E., et al. Dissimilarity space representations and automatic feature selection for protein function prediction. in 2018 International joint conference on neural networks (IJCNN). 2018. IEEE. [DOI:10.1109/IJCNN.2018.8489115]
9. Bonetta, R. and G. Valentino, Machine learning techniques for protein function prediction. Proteins: Structure, Function, and Bioinformatics, 2020. 88(3): p. 397-413. [DOI:10.1002/prot.25832]
10. Rizzo, R., et al. Classification experiments of DNA sequences by using a deep neural network and chaos game representation. in Proceedings of the 17th International Conference on Computer Systems and Technologies 2016. 2016. [DOI:10.1145/2983468.2983489]
11. Qian, W., et al., Feature selection for label distribution learning via feature similarity and label correlation. Information Sciences, 2022. 582: p. 38-59. [DOI:10.1016/j.ins.2021.08.076]
12. Abu Khurma, R., et al., A review of the modification strategies of the nature inspired algorithms for feature selection problem. Mathematics, 2022. 10(3): p. 464. [DOI:10.3390/math10030464]
13. Törönen, P. and L. Holm, PANNZER-a practical tool for protein function prediction. Protein Science, 2022. 31(1): p. 118-128. [DOI:10.1002/pro.4193]
14. Lv, Z., C. Ao, and Q. Zou, Protein function prediction: from traditional classifier to deep learning. Proteomics, 2019. 19(14): p. 1900119. [DOI:10.1002/pmic.201900119]
15. Mahood, E.H., L.H. Kruse, and G.D. Moghe, Machine learning: a powerful tool for gene function prediction in plants. Applications in Plant Sciences, 2020. 8(7): p. e11376. [DOI:10.1002/aps3.11376]
16. Martino, A., A. Rizzi, and F.M.F. Mascioli. Supervised approaches for protein function prediction by topological data analysis. in 2018 International joint conference on neural networks (IJCNN). 2018. IEEE. [DOI:10.1109/IJCNN.2018.8489307]
17. You, R., et al., NetGO: improving large-scale protein function prediction with massive network information. Nucleic acids research, 2019. 47(W1): p. W379-W387. [DOI:10.1093/nar/gkz388]
18. Lai, B. and J. Xu, Accurate protein function prediction via graph attention networks with predicted structure information. Briefings in Bioinformatics, 2022. 23(1): p. bbab502. [DOI:10.1093/bib/bbab502]
19. Liu, X., Deep recurrent neural network for protein function prediction from sequence. arXiv preprint arXiv:1701.08318, 2017. [DOI:10.1101/103994]
20. Kulmanov, M. and R. Hoehndorf, DeepGOPlus: improved protein function prediction from sequence. Bioinformatics, 2020. 36(2): p. 422-429. [DOI:10.1093/bioinformatics/btz595]
21. Sureyya Rifaioglu, A., et al., DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks. Scientific reports, 2019. 9(1): p. 7344. [DOI:10.1038/s41598-019-43708-3]
22. Li, M., et al., A deep learning framework for predicting protein functions with co-occurrence of GO terms. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2022. 20(2): p. 833-842. [DOI:10.1109/TCBB.2022.3170719]
23. Shen, H.-B. and K.-C. Chou, Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Engineering, Design & Selection, 2007. 20(11): p. 561-567. [DOI:10.1093/protein/gzm057]
24. Akbar, S., et al., iHBP-DeepPSSM: Identifying hormone binding proteins using PsePSSM based evolutionary features and deep learning approach. Chemometrics and Intelligent Laboratory Systems, 2020. 204: p. 104103. [DOI:10.1016/j.chemolab.2020.104103]
25. Jones, D.T., Protein secondary structure prediction based on position-specific scoring matrices. Journal of molecular biology, 1999. 292(2): p. 195-202. [DOI:10.1006/jmbi.1999.3091]
26. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 1997. 25(17): p. 3389-3402. [DOI:10.1093/nar/25.17.3389]
27. Chou, K.C., Prediction of protein cellular attributes using pseudo‐amino acid composition. Proteins: Structure, Function, and Bioinformatics, 2001. 43(3): p. 246-255. [DOI:10.1002/prot.1035]
28. Bhasin, M. and G.P. Raghava, Classification of nuclear receptors based on amino acid composition and dipeptide composition. Journal of Biological Chemistry, 2004. 279(22): p. 23262-23266. [DOI:10.1074/jbc.M401932200]
29. Sorkhi, A.G., J. Pirgazi, and V. Ghasemi, A hybrid feature extraction scheme for efficient malonylation site prediction. Scientific Reports, 2022. 12(1): p. 5756. [DOI:10.1038/s41598-022-08555-9]
30. Pirgazi, J., A.R. Khanteymoori, and M. Jalilkhani, GENIRF: An algorithm for gene regulatory network inference using rotation forest. Current Bioinformatics, 2018. 13(4): p. 407-419. [DOI:10.2174/1574893612666170731120830]
31. Rhee, S.-Y., et al., Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 2006. 103(46): p. 17355-17360. [DOI:10.1073/pnas.0607274103]
32. Heider D, Verheyen J, Hoffmann D. Machine learning on normalized protein sequences. BMC research notes. 2011 Dec;4:1-0. [DOI:10.1186/1756-0500-4-94]
33. Hou, T. Zhang, W. Wang, J. Wang, W., Predicting drug resistance of the HIV‐1 protease using molecular interaction energy components. Proteins: Structure, Function, and Bioinformatics. 2009 Mar; 74(4): p. 837-46. [DOI:10.1002/prot.22192]
34. Löchel HF, Eger D, Sperlea T, Heider D. Deep learning on chaos game representation for proteins. Bioinformatics. 2020 Jan 1;36(1): p. 272-9. [DOI:10.1093/bioinformatics/btz493]

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

© 2015 All Rights Reserved | Signal and Data Processing