BTSD: A curated transformation of sentence dataset for text classification in Bangla language

The Bangla Transformation of Sentence Classification dataset addresses the resource gap in natural language processing (NLP) for the Bangla language by providing a curated resource for Bangla sentence classification. With 3,793 annotated sentences, the dataset focuses on categorizing Bangla sentences into Simple, Complex, and Compound classes. It serves as a benchmark for evaluating NLP models on Bangla sentence classification, promoting linguistic diversity and inclusive language models. Collected from publicly accessible Facebook pages, the dataset ensures balanced representation across the categories. Preprocessing steps, including anonymization and duplicate removal, were applied. Three native Bangla speakers independently assessed the Transformation of Sentence labels, enhancing the dataset's reliability. The dataset empowers researchers, practitioners, and developers to build accurate and robust NLP models tailored to the Bangla language. It offers insights into Bangla syntax and structure, benefiting linguistic research. The dataset can be used to train models, uncover patterns in Bangla language usage, and develop effective NLP applications across domains.


a b s t r a c t
The Bangla Transformation of Sentence Classification dataset addresses the resource gap in natural language processing (NLP) for the Bangla language by providing a curated resource for Bangla sentence classification.With 3,793 annotated sentences, the dataset focuses on categorizing Bangla sentences into Simple, Complex, and Compound classes.It serves as a benchmark for evaluating NLP models on Bangla sentence classification, promoting linguistic diversity and inclusive language models.Collected from publicly accessible Facebook pages, the dataset ensures balanced representation across the categories.Preprocessing steps, including anonymization and duplicate removal, were applied.Three native Bangla speakers independently assessed the Transformation of Sentence labels, enhancing the dataset's reliability.The dataset empowers researchers, practitioners, and developers to build accurate and robust NLP models tailored to the Bangla language.It offers insights into Bangla syntax and structure, benefiting linguistic research.The dataset can be used to train models, uncover patterns in Bangla language usage, and develop effective NLP applications across domains.
© 2023 The Author(s).Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )

Value of the Data
• The Bangla Transformation of Sentence Classification dataset fills a crucial gap in resources for the Bangla language in the field of natural language processing, specifically for sentence classification.It offers a carefully annotated and categorized dataset containing 3793 Bangla sentences, enabling the development and training of NLP models tailored to the unique characteristics of the Bangla language.• The dataset's diverse representation of sentence types and source domains allows for advancements in understanding Bangla syntax and structure, making it a valuable resource for linguistic research.• Researchers, practitioners, developers, and data scientists in the field of natural language processing can benefit from the Bangla Transformation of Sentence Classification dataset, as it provides valuable resources for building more accurate and robust NLP models tailored to the Bangla language.Linguists and language enthusiasts can leverage this dataset to gain insights into Bangla syntax and structure, promoting a better understanding of the language.• The dataset can be used to train and evaluate NLP models for sentence classification in the Bangla language, leading to the development of more accurate and effective applications.Researchers can analyze the dataset to uncover patterns and trends in Bangla language usage across different domains, such as literature, news articles, and social media.

Objective
Bangla Transformation of Sentence Classification dataset is to provide a curated resource for NLP researchers and practitioners working on Bangla sentence classification.It aims to facilitate the development of tailored NLP models for the Bangla language by addressing the resource gap [1] .The dataset focuses on classifying Bangla sentences into three categories, promoting linguistic diversity and inclusive language models [2] .It serves as a benchmark for evaluating NLP model performance on Bangla sentence classification, enabling effective approach identification.The ultimate objective is to advance the understanding and processing of Bangla text, leading to more accurate and robust sentence classification models that benefit the Bangla-speaking population [3] .

Data Description
The cornerstone of our research is the 'Bangla Transformation of Sentence Dataset (BTSD),' a meticulously curated collection of sentences specifically tailored for this study.The dataset, available as the raw data file named "Bangla Transformation of Sentence Dataset(BTSD).xlsx" in the repository, consists of 3793 sentences sourced from publicly accessible Facebook pages.The BTSD dataset has undergone careful curation to ensure its reliability and suitability for our research objectives.One crucial aspect of this curation process was maintaining an equal distribution of sentences across three distinct categories: Simple, Complex, and Compound.This balanced representation facilitates the model's ability to learn and generalize across various linguistic structures and complexities.Fig. 1 illustrates the distribution of sentence categories within the dataset.We acknowledge the significance of the Bengali language in our research context.Bengali belongs to the Indo-Aryan branch of the Indo-European language family, closely related to languages such as Assamese and Odia.It serves as the primary language in Bangladesh and the Indian states of West Bengal, Tripura, and Assam.Bengali is also spoken by diaspora communities worldwide.As the official language of Bangladesh and one of the 22 scheduled languages of India, Bengali boasts a substantial global speaker population, estimated at approximately 228 million [4] .Table 1 provides a detailed description of the variables present in the dataset.Table 2 presents a list of the 20 most frequently occurring words in the dataset, along with their corresponding frequencies.However, it is important to acknowledge the limitations of this list.We did not remove stopwords from the dataset, which can impact the informativeness of the list.Stopwords are commonly used words in the language that do not carry significant meaning and are typically excluded from text analysis.Therefore, the inclusion of stopwords in the list may not provide a comprehensive representation of the most significant terms in the dataset.Nevertheless, analyzing the most common words still provides valuable insights into the common vocabulary present in the text samples.It helps identify significant linguistic fea-

Table 1
Dataset columns and its descriptions.

Variable name Description Raw Sentence In Bangla Language
The string representation of original text in the Bengali language.The original Bangla sentence obtained from Facebook pages.Example: (Birds return home in the evening) (Dusk falls and the birds return home) (When the evening comes, the birds return home)

Labels of Transformation Sentence
The string representation of labels is assigned to each transformed sentence.The category of the sentence, classified as Simple, Complex, or Compound.Example: tures and patterns within the dataset, guiding the development of language models and algorithms.By focusing on the prevalent words, more accurate predictions and classifications can be achieved.Furthermore, the most common words list assists in data preprocessing tasks such as stop-word removal and feature selection, contributing to the creation of a more refined and effective dataset for training NLP models.

Experimental Design, Materials and Methods
The dataset creation workflow follows a systematic process.Initially, posts from Facebook were manually extracted, and their content was compiled into an Excel file.Subsequently, the aggregated dataset underwent several preprocessing steps, including anonymization, duplicate removal, and filtering out any instances of profanity language.In the third stage, a meticulous assessment of the dataset's Transformation of Sentence labels was carried out by three native Bangla speakers.Each assessor independently assigned labels based on three distinct polarities: Simple, Complex, and Compound.
The categorization of sentences into simple, complex, and compound is a widely recognized classification scheme employed in linguistic analysis to examine sentence structures across different languages, including Bengali.Although these classifications are not exclusive to Bengali linguistics, they serve as fundamental tools in the field of language analysis.To provide a more precise elucidation of these classifications [5] : I. Simple Sentence: A simple sentence comprises a single independent clause that conveys a complete thought or idea.It consists of a subject and a predicate.For instance, the sentence " " (I love Bengali) exemplifies a simple sentence in Bengali.II.Complex Sentence: A complex sentence encompasses an independent clause and one or more dependent clauses.Dependent clauses contribute supplementary information or contextual details to the independent clause.Consider the sentence " , " (When I study Bengali, I feel good).In this sentence, the dependent clause " " (When I study Bengali) provides additional information to the independent clause " " (I feel good).III.Compound Sentence: A compound sentence consists of two or more independent clauses connected by coordinating conjunctions or appropriate punctuation marks.Each independent clause can function independently as a separate sentence.For example, the sentence " , " (I study Bengali, and my friend writes in Bengali) exemplifies a compound sentence.Here, the independent clauses " " (I study Bengali) and " " (My friend writes in Bengali) are connected by the coordinating conjunction " " (and).
The data was annotated by skilled native Bangla speakers following a comprehensive protocol: inter-annotator agreement (IAA) measures were employed.A subset of the data was randomly selected and annotated by multiple annotators independently.The annotations were then compared and analyzed for agreement using standard IAA metrics, such as Cohen's kappa coefficient or percentage agreement.The level of agreement between annotators was a crucial factor in ensuring the reliability and validity of the annotated dataset.Table 3 shows the annotation protocol methodological pseudo code.
The accuracy of four state-of-the-art neural network-based deep learning models in classifying text data into three classes from our dataset was assessed.All models were trained for 50 epochs, where each epoch represents a complete pass through the entire dataset.The batch size was set to 64, indicating that the model would update its weights after processing 64 samples at a time.A comparative analysis was conducted to evaluate the performance of LSTM, bi-LSTM, Conv1D, and combined Conv1D-LSTM-based models, as outlined in Table 4 .The highest accuracy of 91.17% was achieved by the Conv1D-LSTM Based Model.
This thorough assessment ensures the dataset's reliability and accuracy, enhancing its value for research purposes.The dataset presented in this article serves as a foundation for research not only in sentence classification but also opens avenues for exploration in various domains of language processing in the Bangla language.It provides a valuable resource for researchers seeking to delve into broader aspects of Bangla language analysis, contributing to advancements in the field of natural language processing and facilitating a deeper understanding of the intricacies of the Bangla language.

Ethics Statements
No human or animal studies were conducted in this research.We anonymized all content from social media pages, and no records of personal information were kept.We adhered to Facebook's redistribution policies [6 , 7] , and no permission was required for using content from publicly open Facebook pages.

Fig. 1 .
Fig. 1.The class distribution of each label.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Fig. 2 depicts the distribution of text length within the dataset, specifically categorized into three types: Simple, Complex, and Compound.The graph provides insights into the varying lengths of sentences across these categories, highlighting potential differences in sentence structure and complexity.This information is crucial for developing a comprehensive dataset as it helps in understanding the distribution patterns and ensures a balanced representation of text lengths in the training data.It aids in creating models that can effectively handle sentences of different lengths, enhancing the dataset's usability for various natural language processing tasks.Table2presents a list of the 20 most frequently occurring words in the dataset, along with their corresponding frequencies.However, it is important to acknowledge the limitations of this list.We did not remove stopwords from the dataset, which can impact the informativeness of the list.Stopwords are commonly used words in the language that do not carry significant meaning and are typically excluded from text analysis.Therefore, the inclusion of stopwords in the list may not provide a comprehensive representation of the most significant terms in the dataset.Nevertheless, analyzing the most common words still provides valuable insights into the common vocabulary present in the text samples.It helps identify significant linguistic fea-

Fig. 2 .
Fig. 2. Distribution of text length (Simple, Complex, Compound).(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 2
Most 20 common words and it's frequency.

Table 4
Performance of neural network-based deep learning models on our BTSD dataset.