Farsi Question and Answer Dataset (FarsiQuAD)

ForutanRad, Javad; HourAli, Maryam; KeyvanRad, MohammadAli

doi:10.61186/jsdp.20.4.107

Volume 20, Issue 4 (3-2024) JSDP 2024, 20(4): 107-120 | Back to browse issues page

‎ 10.61186/jsdp.20.4.107

Mendeley

Zotero

RefWorks

ForutanRad J, HourAli M, KeyvanRad M. Farsi Question and Answer Dataset (FarsiQuAD). JSDP 2024; 20 (4) : 7
URL: http://jsdp.rcisp.ac.ir/article-1-1337-en.html

Farsi Question and Answer Dataset (FarsiQuAD)

Javad ForutanRad ^*

, Maryam HourAli

, MohammadAli KeyvanRad

Abstract: (260 Views)

A fast and accurate response to questions posed in natural language is a fundamental objective in the advancement of question and answer systems. These systems involve computers comprehending textual content and questions, and subsequently, delivering precise answers to users. Despite significant advancements in this field, there remains room for improvement, particularly when dealing with languages other than English, such as Persian.
In this article, we present the Persian language question and answer dataset, known as FarsiQuAD. This dataset was meticulously crafted by human annotators, drawing from Persian Wikipedia articles. FarsiQuAD is made available in two versions: Version 1 comprises over 10,000 questions and answers, while Version 2 offers an extensive collection of over 145,000 rows. This dataset is designed to seamlessly integrate with the English version of SQuAD and other databases in various languages adhering to this standard, and it is open to the public. These data serve as valuable resources for the development of artificial intelligence models based on deep learning and for the enhancement of Persian language question and answer systems.
The research findings reveal that the FarsiQuAD dataset is capable of providing answers to questions posed in the natural Persian language with an exact matching accuracy of 78% and an F1 score of 87%. However, there is still room for improvement in achieving even higher accuracy levels.
This project arises from the critical need for non-English languages to have access to more data for training deep learning models, especially in the domain of factoid questions. Hence, the primary objective of this article is to introduce the newly created dataset. Prior to this effort, well-known datasets like SQuAD predominantly focused on English, and similar datasets has been developed in other languages, including French, German, Korean, and Japanese. Nevertheless, the dearth of question datasets in the Persian language was evident. The quality and diversity of questions are pivotal aspects, and as this dataset continues to grow, it will contribute to the broader landscape of research in this domain, allowing for valuable cross-linguistic comparisons and integration with research conducted in other languages.

Article number: 7

Keywords: Question And Answer Dataset, Question And Answer systems, Reading comprehension, Deep Learning, Natural Language Processing

Full-Text [PDF 1290 kb] (98 Downloads)

Type of Study: بنیادی | Subject: Paper
Received: 2022/09/1 | Accepted: 2023/12/11 | Published: 2024/04/25 | ePublished: 2024/04/25

References

1. [1]. Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.‏ [DOI:10.18653/v1/D16-1264]

2. [2]. Yuanjun Li, Yuzhu Zhang, Question Answering on SQuAD 2.0 Dataset, s. University, Editor, 2018.

3. [3]. d'Hoffschmidt, M., Belblidia, W., Brendlé, T., Heinrich, Q., & Vidal, M. FQuAD: French question answering dataset. arXiv preprint arXiv:2002.06071, 2020. [DOI:10.18653/v1/2020.findings-emnlp.107]

4. [4]. Möller, T., Risch, J., & Pietsch, M. Germanquad and germandpr: Improving non-english question answering and passage retrieval. arXiv preprint arXiv:2104.12741, 2021. [DOI:10.18653/v1/2021.mrqa-1.4]

5. [5].임승영, 김명지, & 이주열. KorQuAD: 기계독해를 위한 한국어 질의응답 데이터셋. 한국정보과학회 학술발표논문집, 539-541, 2018.

6. [6].김영민, 임승영, 이현정, 박소윤, & 김명지. KorQuAD 2.0: 웹문서 기계독해를 위한 한국어 질의응답 데이터셋. 정보과학회논문지, 47(6), 577-586, 2020. [DOI:10.5626/JOK.2020.47.6.577]

7. [7]. So, B., Byun, K., Kang, K., & Cho, S. Jaquad: Japanese question answering dataset for machine reading comprehension. arXiv preprint arXiv:2202.01764, 2022.

8. [8]. Ayoubi MY Sajjad & Davoodeh Persianqa: a dataset for persian question answering. https://github.com/SajjjadAyobi/PersianQA, 2021.

9. [9]. Mozafari, J., Fatemi, A., & Nematbakhsh, M. A. BAS: an answer selection method using BERT language model. arXiv preprint arXiv:1911.01528, 2019.

10. [10]. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

11. [11]. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

12. [12]. Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. Parsbert: Transformer-based model for persian language understanding. Neural Processing Letters, 53(6), 3831-3847, 2021. [DOI:10.1007/s11063-021-10528-4]

13. [13]. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.

14. [14]. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.‏

15. [15]. Lample, G., & Conneau, A. (2019). Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.‏

16. [16]. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.‏ [DOI:10.18653/v1/2020.acl-main.747]

17. [17]. Persian Wikipedia. Available from: https://github.com/miladfa7/Persian-Wikipedia-Dataset

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote