Search published articles


Showing 1 results for Question and Answer Dataset

Javad Forutanrad, Maryam Hourali, Mohammadali Keyvanrad,
Volume 20, Issue 4 (3-2024)
Abstract

A fast and accurate response to questions posed in natural language is a fundamental objective in the advancement of question and answer systems. These systems involve computers comprehending textual content and questions, and subsequently, delivering precise answers to users. Despite significant advancements in this field, there remains room for improvement, particularly when dealing with languages other than English, such as Persian.
In this article, we present the Persian language question and answer dataset, known as FarsiQuAD. This dataset was meticulously crafted by human annotators, drawing from Persian Wikipedia articles. FarsiQuAD is made available in two versions: Version 1 comprises over 10,000 questions and answers, while Version 2 offers an extensive collection of over 145,000 rows. This dataset is designed to seamlessly integrate with the English version of SQuAD and other databases in various languages adhering to this standard, and it is open to the public. These data serve as valuable resources for the development of artificial intelligence models based on deep learning and for the enhancement of Persian language question and answer systems.
The research findings reveal that the FarsiQuAD dataset is capable of providing answers to questions posed in the natural Persian language with an exact matching accuracy of 78% and an F1 score of 87%. However, there is still room for improvement in achieving even higher accuracy levels.
This project arises from the critical need for non-English languages to have access to more data for training deep learning models, especially in the domain of factoid questions. Hence, the primary objective of this article is to introduce the newly created dataset. Prior to this effort, well-known datasets like SQuAD predominantly focused on English, and similar datasets has been developed in other languages, including French, German, Korean, and Japanese. Nevertheless, the dearth of question datasets in the Persian language was evident. The quality and diversity of questions are pivotal aspects, and as this dataset continues to grow, it will contribute to the broader landscape of research in this domain, allowing for valuable cross-linguistic comparisons and integration with research conducted in other languages.
 


Page 1 from 1     

© 2015 All Rights Reserved | Signal and Data Processing