Unicode-8 based linguistics data set of annotated Sindhi text

Sindhi Unicode-8 based linguistics data set is multi-class and multi-featured data set. It is developed to solve the natural languages processing (NLP) and linguistics problems of Sindhi language. The data set presents information on grammatical and morphological structure of Sindhi language text as well as sentiment polarity of Sindhi lexicons. Therefore, data set may be used for information retrieving, machine translation, lexicon analysis, language modeling analysis, grammatical and morphological analysis, Semantic and sentiment analysis.

http://www.thekawish.com/beta/ The corpus is processed for NLP operations such as sentiment and morphological analysis, UPOS and SPOS tagging, lemma and stemming identification.

Data format
Data is in csv format

Experimental features
Unigram based analysis, token analysis, Tagging with UPOS and SPOS, Sentiment classification and morphological classification and analysis, Lemma and stemming identification Data source location Karachi, Sindh, Pakistan Data accessibility Data set may be downloaded from http://www.sindhinlp.com/ and github

Value of the data
Data set is developed on basis of acquired results of Sindhi online natural languages processing (NLP) tool for parsing, tagging, morphological and sentiment analysis, stemming and lemmatization of Sindhi text.
Data set is valuable to comprehend the grammatical, sentimental, syntactic and morphological structure of Sindhi text.
Dataset is significant source for machine learning and NLP analysis for information retrieving, language modeling, machine translations, sentiment analysis and computational linguistics operations.

Data
More research work has been done on English language [1] thus, lot of NLP resources are available for English language, which are not suitable for other languages such as Sindhi language. Right hand written languages are also important for NLP applications, machine and deep learning processes. Sindhi language is right hand written language and using Arabic-Persian writing style [2]. A good number of websites, blogs and social media pages are available on world wide web (www), thus, there is very good number of data available for computational linguistics, NLP, machine translations, information retrieving and machine learning processing. Polarity, UPOS annotation, SPOS annotation, Lemma and Stemming process for Sindhi text. Sindhi NLP tools are used to annotate Sindhi corpus for various purposes like tagging, sentiment analysis, lemma and stemming identification and etc. Fig. 1 shows annotation process for Sindhi text (Waddan jo ahtaraam karann hik sutho amal aahay aen asaan te farz be aahay).  shows the sentiment analysis [3] of Sindhi text (Waddan jo ahtaraam karann hik sutho amal aahay aen asaan te farz be aahay) The dataset is consisted of 19 attributes and 6841 records. Target classes of dataset are categorical therefore, it may be good for supervised analysis. Table 1 shows the statistical analysis of Class attributes of dataset.

Experimental design, materials and methods
Sindhi corpus documents are processed for annotation and sentiment analysis in Sindhi NLP tool separately. The results of annotation and sentiment analysis are accumulated to develop dataset. Unigram model is used to find probability of each lexicon in corpus. Dataset is processed for normalization and statistical analysis. There is no missing value found in the dataset. Brief introduction of attributes is given below: 1. UPOS: Universal Part of speech tag set [4,5] is used to annotate the Sindhi tokens. UPOS is class attribute, which is consisted of 18 categories. Sindhi tokens are tagged properly with UPOS tag set.  For example, Sindhi sentence (Mango is good fruit.) may be tagged with UPOS and Sindhi part of speech (SPOS) as shown in Table 2.
The frequency of UPOS tags is dissimilar from each other in the dataset, which shows the divergence of Sindhi lexicons. Fig. 1 shows the frequency of UPOS tag set, annotated to Sindhi tokens. Fig. 4 presents the high number of Nouns and low number of Subordinating conjunction.   PART and SPOS tag Adverb and Preposition. PART POS annotates Sindhi negation and possessive lexicons, whereas, SPOS adverb POS annotates Sindhi negation lexicons and Preposition POS annotates possessive markers available in Sindhi language, therefore, Sindhi adverb and preposition are used in place of PART POS of UPOS tag set. Sindhi treebank is novel contribution to NLP because it is not used properly for the purpose of computational linguistics operations.
The frequency of SPOS tags is different than the frequency of UPOS tags because of difference of UPOS tag PART and SPOS tag Adverb. Table 3 shows the annotation process of SPOS to Sindhi text. 3. Gender: According to Sindhi Grammar [6,7], there are two types of Gender. One is masculine called (Jins Muzkar) in Sindhi and second is feminine called (Jins Moans) in Sindhi. Noun, adjective and diacritic change the position of gender from masculine to feminine and vice versa. This attribute is class attribute and presents the lexicons with its proper gender. Table 4 shows the examples of Sindhi lexicons and their gender.   This dataset shows Masculine gender with digit 1, feminine gender with digit 2 and digit 0 shows no gender, which is used for periods, punctuations and symbols. Fig. 6 shows the number of frequencies of gender types. [6,7]. One is singular and second is plural . This feature of Sindhi dataset shows the singular or plural status of lexicon.

Number: Number is of two types in Sindhi Grammar
Diacritics and extensions of words such as make the plural number of nouns, pronouns and adjectives. Table 5 shows the status of singular and plural numbers of Sindhi text.
Singular number of noun, pronoun and adjective is shown with digit 1 and plural number of noun, pronoun and adjective is shown with digit 2 whereas, o shows no number, which is used for periods, punctuations and symbols. Fig. 7 shows the total number of frequencies of singular and plural number types.  Fig. 8 shows the total number of frequencies of Polarity types. Table 4 Mapping of genders to Sindhi lexicons. Bound form may be called secondary form which is divided into three categories [7], Complex, Compound and Reduplicated. Free form is shown with digit 1 and bound or secondary form is shows with digit 2 whereas, digit 0 is used for punctuations, symbols and periods. Fig. 9 shows the total number of frequencies of Morphology forms.   Reduplicated words are the feature of Sindhi text, which make Sindhi text complex to process for NLP and computational linguistics operations. This dataset presents reduplicated words with digit 1 and no reduplicated words with digit 0. 7. Lemmatization: Lemmatization is process of identifying the original lexicon by reducing affix or suffix which holds grammatical and morphological structure whereas, stemming reduce the inflections, diacritics, affixes and suffixes of lexicon and derive root word. For example, Sindhi word (To come) is complex word, therefore, the lemma of this word is (come). Word (Achu) shows complete meaning with proper grammar and morphological structure. Digit 1 shows lexicon as lemma and digit 0 shows no lemma. Fig. 11 shows the total number of frequencies of Lemma in the dataset. 8. Diacritic: Diacritic changes meaning of lexicon by attaching glyph to word or letter. For example, Sindhi lexicon (was) is changed to another Sindhi lexicon (he) by attaching glyph to Sindhi letter (ha). The diacritic changes the meaning and grammatical structure of Sindhi lexicon (was). The first Sindhi lexicon (was) is verb which shows action happened in past and second Sindhi lexicon (he) is determiner. This attribute of dataset shows the diacritic feature of Sindhi lexicon with digit 1 and shows no diacritic feature with digit 0. Fig. 12 shows the total number of frequencies of diacritic words. 9. Infinitive: Sindhi linguists give importance to infinitive verbs called in Sindhi (Massdar). Sindhi infinitive verbs are generated by attaching suffix to stemming or lemma words. Table 6 shows the example of Sindhi infinite verbs.   Uni-gram probability is measured by applying statistical language model to display the impact of Sindhi tokens in the dataset.