HQA-Data: A historical question answer generation dataset from previous multi perspective conversation

This data article contains a quality assurance dataset for training the chatbot and chat analysis model. This dataset focuses on NLP tasks, as a model that serves and delivers a satisfactory response to a user's query. We obtained data from a well- known dataset known as “The Ubuntu Dialogue Corpus” for the purpose of constructing our dataset. Which consists of about one million multi-turn conversations containing around seven million utterances and one hundred million words. We derived a context for each dialogueID from these lengthy Ubuntu Dialogue Corpus conversations. We have generated a number of questions and answers based on these contexts. All of these questions and answers are contained within the context. This dataset includes 9364 contexts, 36,438 question-answer pairs. In addition to academic research, the dataset may be used for activities such as constructing this QA for another language, deep learning, language interpretation, reading comprehension, and open-domain question answering. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/p85z3v45xk.


a b s t r a c t
This data article contains a quality assurance dataset for training the chatbot and chat analysis model.This dataset focuses on NLP tasks, as a model that serves and delivers a satisfactory response to a user's query.We obtained data from a well-known dataset known as "The Ubuntu Dialogue Corpus" for the purpose of constructing our dataset.
Which consists of about one million multi-turn conversations containing around seven million utterances and one hundred million words.We derived a context for each dia-logueID from these lengthy Ubuntu Dialogue Corpus conversations.We have generated a number of questions and answers based on these contexts.All of these questions and answers are contained within the context.This dataset includes 9364 contexts, 36,438 question-answer pairs.In addition to academic research, the dataset may be used for activities such as constructing this QA for another language, deep learning, language interpretation, reading

Value of the Data
• This is a dataset for question answering that is created using the chat logs or conversation histories of users.It is unique because it is the only dataset that uses user dialogues.• The dataset is a valuable resource for training and testing chatbots and chat analysis algorithms in NLP research and related fields, as well as for improving the performance of conversational AI systems in businesses.• The dataset can also be used by students to learn and practice NLP approaches, and by anyone interested in NLP and AI to explore various techniques.• By modifying this dataset, it can be utilized for the development of language models in other languages, such as 'Bengali,' as well as in other domains.• This dataset also can be used in Reading Comprehension tasks.To understand the contexts and find the answers based on questions.

Objective
Our dataset is derived from a popular dataset, The Ubuntu Dialogue Corpus is composed of about one million talks extracted from the Ubuntu chat logs, which were used to acquire technical help for a variety of Ubuntu-related issues [1] .This is an established dataset containing the chat history of users, but no QA-formatted dataset was available for this data.Therefore, we have built a whole new Question Answering dataset.Where we have transformed the user chat log to the context and generated questions and answers based on the conversation history.

Data Description
A chatbot is a robot that responds with relevant responses to customer questions.Chat analysis is the process of evaluating a conversation or collecting relevant information, as well as monitoring the behaviour or sentiment of users.A valuable question-response dataset is required for the development of a chatbot and a chat analysis model.In light of this, we have derived a QA dataset from a primary dataset, The Ubuntu Dialogue Corpus.This Ubuntu Dialogue Corpus dataset only covers multi-turn conversations between people [1] .We manually read out the QA pairs from the generated Question Answers, and if the QA is relatable then we kept that QA pair.It helps to increase the overall precision of the QA system.It is available to the public to help and encourage more research into making automated QA.
Our created dataset is available in a data repository at mendeley data and there is a folder called Dataset.In this Dataset folder there are four different files.These four files are test_data.csv,test_data_json_file.json, train_data.csv,train_data_json_file.json.
Fig. 1 Our dataset is shown to be represented in two different data formats, the first of which is data formatted in CSV, and the second of which is data formatted in JSON, as seen in figure-01.The Train section of each of these files has a total of 29,150 question-answer pairs and a total of 7323 context pairs.The Test section files contain the question-answer pairs and context.There are a total of 7288 question-answer pairs and a total of 2041 contexts in the Test section Table 1 .
The names of the columns with brief explanations of their contents of the CSV files of the dataset are found in table-01.The names of these columns are dialogueID, which is the unique identifier of the conversation, Context, which contains the overall conversation that has been contributed by multiple users, QuestionID, which is the unique identifier for both Questions and Answers, Answer start, and Answer end, which are the starting and ending indexes of the Answer for the Question.

Table 1
Name of the columns with the descriptions of the CSV files.

Column Name Description dialogueID
The string representation of the unique identifier for each context.

Context
The text representation, which contains the whole conversation.

QuestionID
The string representation of the unique identifier for each generated question.

Question
The representation of the string which contains questions generated from the context.Answer Contains the answers to the generated questions.This is also the representation of the strings.

Answer Start
This is an integer number that contains the starting index of the answer from the context.Answer End This is an integer number that contains the ending index of the answer from the context.In our dataset Fig. 2 , is showing the structural representation of JSON files.It is containing the following attributes: dialogueID which is unique of each context, Context is containing the users conversation, qas which is a list of objects and the elements of that objects are QuestionID, Question, Answer, Answer Start, and Answer End.

Experimental Design, Materials and Methods
The source of the raw data that we collect is the "Ubuntu Dialogue Corpus," which consists of almost one million conversations between more than one person extracted from the Ubuntu chat logs and used to receive technical support for various Ubuntu-related problems [1] .We get those data in csv files that have five columns dialogueID, date, from, to and text.'dialogueID' is the unique conversation id on a related topic.Each chat room, or channel, has a particular topic, and we can separate those topics using dialogueID that is in csv.Using dialogueID we can extract chat logs in a context that will be context of chat logs.In the corpus there are numerous contexts that are very large that are more than approximately 10,0 0 0 words.To read all the context and generate is more difficult and more complex to do.We are using a question answer generator T5 model [3] .This pre-trained model generates questions and their accompanying replies depending on the given context or passage.Our extracted context from the raw dataset that passes to the T5 model will generate qa pairs.It generates more than 9 lakh question answer pairs.Though it is a model not well pre trained in that it generates more garbage and unrelated questions.So, we read the question answer pairs and if it has meaning and is related to our context, we keep only those pairs.We are able to get only 36,438 question-answer pairs from 9363 contexts.We are following the format of SQuAD V2 reading comprehension dataset [2] .In our dataset we have context, question, question, answer, and the starting and ending position of answer in the context.In the Fig. 3 we can see a process of how we generate qa pairs.

Efficiency Test of this Dataset with some Algorithms
We are testing our datasets in Hugging Face transformers pre-trained model fine tuning using pytorch with two different models Bidirectional Encoder Representations from Transformers (BERT) [4] .And second one is DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base [5] .With a wrapper of question answering that model can extract the question answer from the given context.To measure the model, we are using different metrics: Exact match, F1 score, Rouge, and BLEU.Table-02 shows the score of the models.
comprehension, and open-domain question answering.We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/p85z3v45xk .© 2023 The Author(s).Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) Data identification number: 10.17632/p85z3v45xk.1 Direct URL to data: https://data.mendeley.com/datasets/p85z3v45xk

Table 2
Evaluation Metrics of BERT and DistilBERT.