Noor-stem v.1 A Benchmark Dataset for Evaluating the Arabic Stemmers

Al-Aswad, Azal; Minaei-Bidgoli, Behrouz; Shenassa, Mohammad-Ebrahim; Hossayni, Sayyed-Ali; Seryani, Habib

doi:10.61186/jsdp.21.1.101

Volume 21, Issue 1 (6-2024) JSDP 2024, 21(1): 101-112 | Back to browse issues page

‎ 10.61186/jsdp.21.1.101

Mendeley

Zotero

RefWorks

Al-Aswad A, Minaei-Bidgoli B, Shenassa M, Hossayni S, Seryani H. Noor-stem v.1 A Benchmark Dataset for Evaluating the Arabic Stemmers. JSDP 2024; 21 (1) : 8
URL: http://jsdp.rcisp.ac.ir/article-1-1346-en.html

Noor-stem v.1 A Benchmark Dataset for Evaluating the Arabic Stemmers

Azal Al-Aswad

, Behrouz Minaei-Bidgoli ^*

, Mohammad-Ebrahim Shenassa

, Sayyed-Ali Hossayni

, Habib Seryani

Abstract: (1528 Views)

The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. Stemming is the main step of several processing tasks such as text mining, information retrieval, and natural language processing.Arabic stemmers face many challenges, mostly caused by the complex nature of Arabic words and their different writing styles. To our knowledge, there is no gold stemming dataset, which contains a wide variety of different possible stemming challenges, so that, stemmers face numerous and different possible real-world challenges to stem the words. Thus, we find it valuable to develop a dataset for evaluating the sustainability of stemmers in such a variety of challenging situations. In this paper, we introduce Noor-Stem, a benchmark dataset with various writing styles for the evaluation of Arabic stemmers. We use two thousand Arabic words in this dataset. We choose the words from different sources such as holy Quran as well as the Arabic websites and assign them to two groups of human experts to determine the correct stem for each word. The first chosen collection of words includes non-repetitive words of the Quran according to their morphological structure. This collection, with more than 16,000 words, is completely by its Quranic usage, labeling only the words stems. The necessity of morphological analysis in Quranic texts as an example of the index of classical Arabic texts has given rise to this evaluation. The second word collection includes 10 thousand words from the non-repetitive words of the text data in general classic Arabic texts. Out of more than 2,600,000 non-repetitive words, considering that the dataset is going to be gold and each stem must be labeled/ensured by a couple of experts, 10,000 words are chosen, regarding the comprehensive and unique patterns to fully measure the length. The variety of patterns can face each stemmer with a serious challenge to demonstrate its performance in various processes. We evaluate the performance of three Arabic stemmers (Light 10, NLTK and Tashaphyne) on this dataset. The results show that the F-measure of Tashaphyne is better than the other stemmers, which re-proves the superiority of this stemmer in this type of problem, as well.

Article number: 8

Keywords: Benchmark Dataset, Stemmer, Noor-Stem, Infix, Information Retrieval

Full-Text [PDF 687 kb] (609 Downloads)

Type of Study: Research | Subject: Paper
Received: 2022/11/2 | Accepted: 2023/12/11 | Published: 2024/08/3 | ePublished: 2024/08/3

References

1. سریانی، حبیب؛ هاشمی، سید محسن. استمر «نور»؛ موتور هوشمند تشخیص میانوند واژگان عربی، هوش مصنوعی و علوم اسلامی، دوره 1، 54 صفحه

2. N. Y. Habash, "Introduction to Arabic natural language processing," Synth. Lect. Hum. Lang. Technol., vol. 3, no. 1, pp. 1-187, 2010. [DOI:10.2200/S00277ED1V01Y201008HLT010]

3. H. Alshalabi, S. Tiun, N. Omar, F. N. AL-Aswadi, and K. A. Alezabi, "Arabic light-based stemmer using new rules," J. King Saud Univ. Inf. Sci., 2021. [DOI:10.1016/j.jksuci.2021.08.017]

4. D. H. Abd, W. Khan, K. A. Thamer, and A. J. Hussain, "Arabic Light Stemmer Based on ISRI Stemmer," in International Conference on Intelligent Computing, 2021, pp. 32-45. [DOI:10.1007/978-3-030-84532-2_4]

5. M. N. Al-Kabi, S. A. Kazakzeh, B. M. A. Ata, S. A. Al-Rababah, and I. M. Alsmadi, "A novel root based Arabic stemmer," J. King Saud Univ. Inf. Sci., vol. 27, no. 2, pp. 94-103, 2015. [DOI:10.1016/j.jksuci.2014.04.001]

6. S. Levin, "Toward Proto-Nostratic: A new approach to the comparison of Proto-Indo-European and Proto-Afroasiatic. By Allan R. Bomhard," Diachronica, vol. 2, no. 1, pp. 97-104, 1985. [DOI:10.1075/dia.2.1.09lev]

7. R. Mohammed, "New Arabic stemming based on Arabic patterns," Iraqi J. Sci., vol. 57, no. 3C, pp. 2324-2330, 2016.

8. L. S. Larkey, L. Ballesteros, and M. E. Connell, "Light stemming for Arabic information retrieval," in Arabic computational morphology, Springer, 2007, pp. 221-243. [DOI:10.1007/978-1-4020-6046-5_12]

9. Y. Jaafar, D. Namly, K. Bouzoubaa, and A. Yousfi, "Enhancing Arabic stemming process using resources and benchmarking tools," J. King Saud Univ. Inf. Sci., vol. 29, no. 2, pp. 164-170, 2017. [DOI:10.1016/j.jksuci.2016.11.010]

10. R. Mamoun and M. Ahmed, "Arabic text stemming: Comparative analysis," in 2016 Conference of Basic Sciences and Engineering Studies (SGCAC), 2016, pp. 88-93. [DOI:10.1109/SGCAC.2016.7458011]

11. R. A. Sameer, "Modified light stemming algorithm for Arabic language," Iraqi J. Sci., vol. 57, no. 1B, pp. 507-513, 2016.

12. A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, "Farasa: A fast and furious segmenter for arabic," in Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Demonstrations, 2016, pp. 11-16. [DOI:10.18653/v1/N16-3003]

13. A. Ayedh, G. Tan, K. Alwesabi, and H. Rajeh, "The effect of preprocessing on arabic document categorization," Algorithms, vol. 9, no. 2, p. 27, 2016. [DOI:10.3390/a9020027]

14. M. Mustafa, A. S. Eldeen, S. Bani-Ahmad, and A. O. Elfaki, "A comparative survey on arabic stemming: approaches and challenges," Intell. Inf. Manag., vol. 9, no. 02, p. 39, 2017. [DOI:10.4236/iim.2017.92003]

15. Y. A. Alhaj, J. Xiang, D. Zhao, M. A. A. Al-Qaness, M. Abd Elaziz, and A. Dahou, "A Study of the Effects of Stemming Strategies on Arabic Document Classification," IEEE Access, vol. 7, pp. 32664-32671, 2019, doi: 10.1109/ACCESS.2019.2903331. [DOI:10.1109/ACCESS.2019.2903331]

16. Y. A. Al[1] Y. A. Alhaj, M. A. A. Al-qaness, A. Dahou, M. Abd Elaziz, D. Zhao, and J. Xiang, "Effects of Light Stemming on Feature Extraction and Selection for Arabic Documents Classification," in Studies in Computational Intelligence, vol. 874, Springer, 2020, M. A. A. Al-qaness, A. Dahou, M. Abd Elaziz, D. Zhao, and J. Xiang, "Effects of Light Stemming on Feature Extraction and Selection for Arabic Documents Classification," in Studies in Computational Intelligence, vol. 874, Springer, 2020, pp. 59-79. [DOI:10.1007/978-3-030-34614-0_4]

17. F. S. Utomo, N. Suryana, and M. S. Azmi, "Stemming Impact Analysis On Indonesian Quran Translation And Their Tafsir Classification For Ontology Instances," IIUM Eng. J., vol. 21, no. 1, pp. 33-50, 2020. [DOI:10.31436/iiumej.v21i1.1170]

18. A. Alnaied, M. Elbendak, and A. Bulbul, "An intelligent use of stemmer and morphology analysis for Arabic information retrieval," Egypt. Informatics J., 2020. [DOI:10.1016/j.eij.2020.02.004]

19. M. El-Defrawy, Y. El-Sonbaty, and N. A. Belal, "Cbas: Context based arabic stemmer," arXiv Prepr. arXiv1611.00027, 2015. [DOI:10.5121/ijnlc.2015.4301]

20. M. Nabil, M. Aly, and A. Atiya, "Labr: A large scale arabic sentiment analysis benchmark," arXiv Prepr. arXiv1411.6718, 2014.

21. "Building an International Corpus of Arabic (ICA): progress of compilation stage," in 7th international conference on language engineering, Cairo, Egypt, 2007, pp. 5-6.

22. M. Diab, N. Habash, O. Rambow, and R. Roth, "LDC Arabic treebanks and associated corpora: Data divisions manual," arXiv Prepr. arXiv1309.5652, 2013.

23. R. M. Sallam, H. M. Mousa, and M. Hussein, "Improving Arabic text categorization using normalization and stemming techniques," Int. J. Comput. Appl, vol. 135, no. 2, pp. 38-43, 2016. [DOI:10.5120/ijca2016908328]

24. M. I. Eldesouki, W. Arafa, and K. Darwish, "Stemming techniques of Arabic language: Comparative study from the information retrieval perspective," Egypt. Comput. J., vol. 36, no. 1, pp. 30-49, 2009.

25. F. N. Al-Aswadi, H. Y. Chan, and K. H. Gan, "Automatic ontology construction from text: a review from shallow to deep learning trend," Artif. Intell. Rev., vol. 53, no. 6, pp. 3901-3928, 2020. [DOI:10.1007/s10462-019-09782-9]

26. T. Kanan, O. Sadaqa, A. Almhirat, and E. Kanan, "Arabic light stemming: A comparative study between p-stemmer, khoja stemmer, and light10 stemmer," in 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2019, pp. 511-515. [DOI:10.1109/SNAMS.2019.8931842]

27. M. Naili, A. H. Chaibi, and H. H. Ben Ghezala, "Comparative study of Arabic stemming algorithms for topic identification," Procedia Comput. Sci., vol. 159, pp. 794-802, 2019. [DOI:10.1016/j.procs.2019.09.238]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote