Identification of offensive language in Urdu using semantic and embedding models

Automatic identification of offensive/abusive language is very necessary to get rid of unwanted behavior. However, it is more challenging to generalize the solution due to the different grammatical structures and vocabulary of each language. Most of the prior work targeted western languages, however, one study targeted a low-resource language (Urdu). The prior study used basic linguistic features and a small dataset. This study designed a new dataset (collected from popular Pakistani Facebook pages) containing 7,500 posts for offensive language detection in Urdu. The proposed methodology used four types of feature engineering models: three are frequency-based and the fourth one is the embedding model. Frequency-based are either determined by the term frequency-inverse document frequency (TF-IDF) or bag-of-words or word n-gram feature vectors. The fourth is generated by the word2vec model, trained on the Urdu embeddings using a corpus of 196,226 Facebook posts. The experiments demonstrate that the stacking-based ensemble model with word2vec shows the best performance as a standalone model by achieving 88.27% accuracy. In addition, the wrapper-based feature selection method further improves performance. The hybrid combination of TF-IDF, bag-of-words, and word2vec feature models achieved 90% accuracy and 97% AUC. In addition, it outperformed the baseline with an improvement of 3.55% in accuracy, 3.68% in the recall, 3.60% in f1-measure, 3.67% in precision, and 2.71% in AUC. The findings of this research provide practical implications for commercial applications and future research.


INTRODUCTION
The advancement in communication technologies has brought geographically scattered people of the world closer to each other, thus forming new virtual societies (Torkey et al., 2021). Popular social websites such as Facebook, Twitter, YouTube, Instagram, etc. have provided new forms of social interaction among people. These websites are so popular that the number of active users is expected to reach 3.43 billion by 2023 (Statista, 2021). Among them, Facebook is the first one that surpassed 1 billion active users monthly and ranked first among popular social websites (Statista, 2021). The majority of people are using these 3 Most of the prior works on offensive language detection are in the English language but some 4 studies have addressed this issue in other languages, like Danish [10], German [11], and Italian 5 [12]. Recently, Akhter et al. [13] proposed an approach for offensive language detection in Urdu 6 and Roman Urdu using YouTube comments. They have used word n-grams and char n-grams 7 features but ignored more recent and effective feature extraction approaches, like bag-of-words, 8 TF-IDF, and some sort of embedding/contextual dimensions, In addition, their data set size is very 9 small (2151 instances). Therefore, it is needed to explore new features on a comparatively large 0 dataset so that findings could be generalized.
PeerJ Comput. Sci. reviewing PDF | (CS-2022:07:75716:1:0:NEW 10 Oct 2022) Manuscript to be reviewed Computer Science 3.43 billion by 2023 [2]. Among them, Facebook is the first one that surpassed 1 billion active users monthly and ranked first among popular social websites [2]. The majority of people are using these platforms to express their feelings and share their thoughts. However, some users exploit the anonymity provided by these platforms by posting offensive posts and comments.
The use of impolite words while using any form of social media is called offensive language [3]. It has been usually used to insult people regarding their religions, races, ethnicities, disabilities, and gender [4]. Offensive language in the form of cyberbullying [5], hate speech [6], and harassment [7] has become a serious problem, affecting many internet users. It is evident from the events that occurred in the world that social websites are easy tools for propagating offensive language, that is harmful to our societies. There is a strong connection between offensive language and actual hate crimes against various communities. The ethnically motivated violence against Muslims in Myanmar and the Pittsburgh synagogue shooting, are examples of such incidents. Therefore, it becomes of utmost essential to spot it in advance, to be able to take some preventive measures. Furthermore, this advanced spotting of offensive language activity on social media must be automatic. It is not possible to do this critical task manually, on such a huge scale. This will ensure the safety of online communities and individuals.. Urdu language usage is rapidly increasing because social media websites are providing localization facilities to their users. The statistics (alphapro.pk) show that there are approximately 35 million active users of Facebook in Pakistan, and this number increases at the rate of 17% annually. The Urdu language draws its vocabulary and grammatical structure from Arabic, Persian, Turkish, and Sanskrit languages. It derives its vocabulary and Unicode characters from these languages, thus special care is needed to distinguish these characters [8]. In addition, it has a different writing style, from right to left, and has more phonic sounds than all of the above-mentioned languages. Also, there is a lack of standardization of language writing rules. The most common styles in the Urdu language are Nasakh and Nastalique [9]. Each has its own rule. In Urdu, a character can acquire four different shapes i.e. initial, middle, isolated, and final position in a connected sequence, for example, Urdu letter has four shapes ( , , , and ). The deficiency of a single standard rule creates a lot of difficulties in text tokenization and language modeling i.e. unigram, bigram, or trigram. In addition, Urdu compound words consist of two or three meaningful words. Literature also reported challenges in stemming, such as stemming the infixes, ambiguous affixes, stemming errors, and stemming the plurals. In addition, Urdu has 40 distinct alphabets. Due to its complex morphological and grammatical structure, few prior studies worked on it and a lack of available datasets are reported.
Most of the prior works on offensive language detection are in the English language but some studies have addressed this issue in other languages, like Danish [10], German [11], and Italian [12]. Recently, Akhter et al. [13] proposed an approach for offensive language detection in Urdu and Roman Urdu using YouTube comments. They have used word n-grams and char n-grams features but ignored more recent and effective feature extraction approaches, like bag-of-words, TF-IDF, and some sort of embedding/contextual dimensions, In addition, their data set size is very small (2151 instances). Therefore, it is needed to explore new features on a comparatively large dataset so that findings could be generalized.
PeerJ Comput. Sci. reviewing PDF | (CS-2022:07:75716:1:0:NEW 10 Oct 2022) Manuscript to be reviewed Computer Science , 3.43 billion by 2023 [2]. Among them, Facebook is the first one that surpassed 1 billion active users monthly and ranked first among popular social websites [2]. The majority of people are using these platforms to express their feelings and share their thoughts. However, some users exploit the anonymity provided by these platforms by posting offensive posts and comments.
The use of impolite words while using any form of social media is called offensive language [3]. It has been usually used to insult people regarding their religions, races, ethnicities, disabilities, and gender [4]. Offensive language in the form of cyberbullying [5], hate speech [6], and harassment [7] has become a serious problem, affecting many internet users. It is evident from the events that occurred in the world that social websites are easy tools for propagating offensive language, that is harmful to our societies. There is a strong connection between offensive language and actual hate crimes against various communities. The ethnically motivated violence against Muslims in Myanmar and the Pittsburgh synagogue shooting, are examples of such incidents. Therefore, it becomes of utmost essential to spot it in advance, to be able to take some preventive measures. Furthermore, this advanced spotting of offensive language activity on social media must be automatic. It is not possible to do this critical task manually, on such a huge scale. This will ensure the safety of online communities and individuals.. Urdu language usage is rapidly increasing because social media websites are providing localization facilities to their users. The statistics (alphapro.pk) show that there are approximately 35 million active users of Facebook in Pakistan, and this number increases at the rate of 17% annually. The Urdu language draws its vocabulary and grammatical structure from Arabic, Persian, Turkish, and Sanskrit languages. It derives its vocabulary and Unicode characters from these languages, thus special care is needed to distinguish these characters [8]. In addition, it has a different writing style, from right to left, and has more phonic sounds than all of the above-mentioned languages. Also, there is a lack of standardization of language writing rules. The most common styles in the Urdu language are Nasakh and Nastalique [9]. Each has its own rule. In Urdu, a character can acquire four different shapes i.e. initial, middle, isolated, and final position in a connected sequence, for example, Urdu letter has four shapes ( , , , and ). The deficiency of a single standard rule creates a lot of difficulties in text tokenization and language modeling i.e. unigram, bigram, or trigram. In addition, Urdu compound words consist of two or three meaningful words. Literature also reported challenges in stemming, such as stemming the infixes, ambiguous affixes, stemming errors, and stemming the plurals. In addition, Urdu has 40 distinct alphabets. Due to its complex morphological and grammatical structure, few prior studies worked on it and a lack of available datasets are reported.
Most of the prior works on offensive language detection are in the English language but some studies have addressed this issue in other languages, like Danish [10], German [11],and Italian [12]. Recently, Akhter et al. [13] proposed an approach for offensive language detection in Urdu and Roman Urdu using YouTube comments. They have used word n-grams and char n-grams features but ignored more recent and effective feature extraction approaches, like bag-of-words, TF-IDF, and some sort of embedding/contextual dimensions, In addition, their data set size is very small (2151 instances). Therefore, it is needed to explore new features on a comparatively large dataset so that findings could be generalized. . Among them, Facebook is the first one that surpassed 1 billion 41 users monthly and ranked first among popular social websites [2]. The majority of people are 42 these platforms to express their feelings and share their thoughts. However, some users expl 43 anonymity provided by these platforms by posting offensive posts and comments. 44 The use of impolite words while using any form of social media is called offensive languag 45 It has been usually used to insult people regarding their religions, races, ethnicities, disab 46 and gender [4]. Offensive language in the form of cyberbullying [5], hate speech [6 47 harassment [7] has become a serious problem, affecting many internet users. It is evident fro 48 events that occurred in the world that social websites are easy tools for propagating off 49 language, that is harmful to our societies. There is a strong connection between offensive lan 50 and actual hate crimes against various communities. The ethnically motivated violence a 51 Muslims in Myanmar and the Pittsburgh synagogue shooting, are examples of such inci 52 Therefore, it becomes of utmost essential to spot it in advance, to be able to take some prev 53 measures. Furthermore, this advanced spotting of offensive language activity on social media 54 be automatic. It is not possible to do this critical task manually, on such a huge scale. Th 55 ensure the safety of online communities and individuals.. 56 Urdu language usage is rapidly increasing because social media websites are providing locali 57 facilities to their users. The statistics (alphapro.pk) show that there are approximately 35 m 58 active users of Facebook in Pakistan, and this number increases at the rate of 17% annually 59 Urdu language draws its vocabulary and grammatical structure from Arabic, Persian, Turkis 60 Sanskrit languages. It derives its vocabulary and Unicode characters from these languages 61 special care is needed to distinguish these characters [8]. In addition, it has a different writing 62 from right to left, and has more phonic sounds than all of the above-mentioned languages. 63 there is a lack of standardization of language writing rules. The most common styles in the 64 language are Nasakh and Nastalique [9]. Each has its own rule. In Urdu, a character can a 65 four different shapes i.e. initial, middle, isolated, and final position in a connected sequenc 66 example, Urdu letter has four shapes ( , , , and ). The deficiency of a single standar 67 creates a lot of difficulties in text tokenization and language modeling i.e. unigram, bigra 68 trigram. In addition, Urdu compound words consist of two or three meaningful words. Lite 69 also reported challenges in stemming, such as stemming the infixes, ambiguous affixes, stem 70 errors, and stemming the plurals. In addition, Urdu has 40 distinct alphabets. Due to its co 71 morphological and grammatical structure, few prior studies worked on it and a lack of ava 72 datasets are reported.  Akhter et al. [13] proposed an approach for offensive language detection in 76 and Roman Urdu using YouTube comments. They have used word n-grams and char n-77 features but ignored more recent and effective feature extraction approaches, like bag-of-w 78 TF-IDF, and some sort of embedding/contextual dimensions, In addition, their data set size i 79 small (2151 instances). Therefore, it is needed to explore new features on a comparatively 80 dataset so that findings could be generalized.

Computer Science
). The deficiency of a single standard rule creates a lot of difficulties in text tokenization and language modeling i.e., unigram, bigram, or trigram. In addition, Urdu compound words consist of two or three meaningful words. Literature also reported challenges in stemming, such as stemming the infixes, ambiguous affixes, stemming errors, and stemming the plurals. In addition, Urdu has 40 distinct alphabets. Due to its complex morphological and grammatical structure, few prior studies worked on it and a lack of available datasets are reported.
Most of the prior works on offensive language detection are in the English language but some studies have addressed this issue in other languages, like Danish (Sigurbergsson & Derczynski, 2019), German (Wiegand, Siegel & Ruppenhofer, 2018), and Italian (Bosco et al., 2018). Recently, Akhter et al. (2020) proposed an approach for offensive language detection in Urdu and Roman Urdu using YouTube comments. They have used word n-grams and char n-grams features but ignored more recent and effective feature extraction approaches, like bag-of-words, TF-IDF, and some sort of embedding/contextual dimensions. In addition, their data set size is very small (2,151 instances). Therefore, it is needed to explore new features on a comparatively large dataset so that findings could be generalized.
To overcome these limitations, we have gathered a larger collection of Urdu posts and comments from public Facebook pages. These pages are of different Pakistani media newsgroups like religious groups, political party groups, and popular bloggers, thus covering many categories. Offensive language could be in various forms; therefore, it is necessary to separate offensive posts and comments from others. After filtering, we got them annotated by five experts following a set of guidelines (see Appendix A), containing 7,500 posts in total, 3,750 of them are offensive and 3,750 of them are not. After that, frequency-based and word-embedding features are extracted, followed by building a binary classification model using five popular machine-learning algorithms. Word n-gram, Bag of Words, TF-IDF, and word2vec feature extraction methods are explored.
To develop an effective identification model, we address the following research questions in this study: RQ1: How to detect offensive Urdu language on Pakistani social media platforms? RQ2: What are the most contributing features of frequency-based and word embedding types, while using them as standalone as well as hybrid combinations, for offensive language detection?
In summary, the main highlights of the paper are given below: 1. To the best of our knowledge, the first offensive language detection dataset in Urdu, data extracted from popular Pakistani Facebook pages consisting of 7,500 instances and annotated by domain experts following a given set of guidelines. 2. This article presents an ensemble model-based offensive language detection framework for the Urdu language. 3. To the best of our knowledge, the embeddings of word2vec for the Urdu language are designed first time, using a corpus of 196,226 Facebook posts for offensive language detection. 4. The comparison of ML techniques reveals that voting based ensemble model demonstrated the best performance. 5. The proposed model outperformed the baseline with an improvement of 3.55% in accuracy, 3.68% in recall, 3.60% in f1-measure, 3.67% in precision, and 2.71% in AUC. 6. The wrapper feature selection method further improves the performance significantly by achieving a threshold of 90% in accuracy, and 97% in AUC. 7. The comparison between features reveals that word2vec as a standalone model demonstrated the best performance for offensive language detection. 8. The proposed model could be helpful for real-time applications in the Urdu language and its findings could benefit social media users and owners. The rest of the article is organized as follows: Section 2 describes prior works in offensive language detection and the research gap. Then, Section 3 explains the steps of the proposed pipeline in detail. After that, Section 4 presents various experiments and results. Discussion on results and their implications are discussed in Section 5. Section 6 presents the conclusion and future directions.

RELATED WORK
Offensive language is the expression of hatred, expressed verbally ranging from simple profanity to much more severe types. The uninhibited behavior in computer-mediated communication is an early concern when the internet started. In 1992, Collins (1992 explored the concept of flaming in computer-mediated communication. In recent years, the computer linguistic community has started to give attention to offensive language detection, in online social media, due to its popularity and large usage. Most of the prior studies used Twitter for corpus creation, while some studies have also used Facebook and YouTube as data sources. Since one of our goals is to create an annotated Urdu language corpus for offensive language, therefore, we have provided a brief overview of studies about corpus collection and annotation with some review about classification methods for identifying offensive language detection.
In 1997, Spertus (1997) used abusive/hostile messages or flame terms for offensive language identification and applied data-driven methods to automatically detect these messages. He combined syntax and semantic features at the sentence level, to create 47element feature vectors using 720 messages. For classification, he has used a decision-tree generator that correctly categorizes 64% flame messages and 98% non-flame messages. People may personally attack each other using hostile or abusive language when writing emails or in newsgroups. Later, in 2002, Martin (2002 hypothesized that flames are easy to recognize because of their extreme nature and developed an annotated corpus of 1,140 messages, collected from the Usenet newsgroup. Later, Razavi et al. (2010) used Martin's dataset and the natural semantic module (NSM) organization log files dataset, to create an automatic flame detection procedure. They extracted features at different levels and used multilevel classification for flame detection using an Insulting and Abusing Language Dictionary. As detecting online harassment is a challenging task, therefore, Yin et al. (2009) developed an abusive language detection model to find online harassment by extracting TF-IDF, n-gram, sentiment, and contextual features. In online communication, verbal abuse is a serious problem, and detecting and removing blacklist words are very important. To address it, Yoon, Park & Cho (2010) proposed a profanity filtering system in the Korean language to filter phoneme-modified profane words using phoneme-based string alignment. They used a lexicon of 9,300 prototype vulgar words for experiments.
On the other end, cyberbullying is the use of technology to bully someone. Reynolds, Kontostathis & Edwards (2011) used a machine-learning approach to detect language patterns used by bullies and their victims and developed rules for automatically detecting cyberbullying. They collected data from the website 'Formspring.me', which was labeled by Amazon's Mechanical Turk and their model achieved 78.5% accuracy. It is a fact that languages on social media are highly unstructured, informal, and misspelled, that's why offensive language detection models cannot accurately detect offensive language. Chen et al. (2012) used lexical syntactic features to detect offensive language and identified offensive users with enhanced accuracy. They achieved 98.24% precision and 94.34% recall at the sentence level and 77.9% precision and 77.8% recall at the user level.
Until 2013, most researchers used textual features to detect online cyberbullying, ignoring contextual features. Later, Dadvar et al. (2013) were the first to use contextual features (profile information and user characteristics) to improve the performance of cyberbullying detection. Their dataset consisted of 4,226 comments from 3,858 distinct YouTube users and was manually labeled. They hypothesized that the inclusion of user profile information improved the precision and recall to 77% and 55% respectively. The use of curse words in online communication is very common. Using this concept, Wang et al. (2014) studied people's cursing behavior on Twitter using 51 million tweets from 14 million users. They found that curse words occurred at the rate of 1.15% on Twitter and 7.73% of all the tweets in their dataset consisted of curse words. They concluded that cursing on Twitter is closely related to two negative emotions: sadness and anger.
Hate speech is a special type of offensive language, targeted toward a specific person or group. Gitari et al. (2015) presented a multi-step approach for hate speech classification by creating a lexicon, using hate speech-related semantic and subjectivity features. They concluded that semantic, hate and theme-based features improve both precision and recall. Later in 2016, Silva et al. (2016) conducted the first large-scale study to find hate speech targets on Whisper and Twitter datasets. They used syntactic structures to find hate targets in the posts. Their results showed that on Twitter and Whisper platforms; race, behavior, and physical individuality are the top hate categories. Then Davidson et al. (2017) separated hate speech from instances of offensive language. They used crowdsourcing to label tweets. Their model achieved a precision of 91%, a recall of 90%, and an f1-score of 0.90% using bigram, unigram, trigram with TF-IDF, part-of-speech (POS), and sentiment features.
Some studies such as Saha et al. (2021) provided an exhaustive exploration of different transformer models in three low-resource languages (Tamil, Kannada, and Malayalam), and presented a genetic algorithm technique to ensemble different models. Then Husain & Uzuner (2021) investigated the effect of transfer learning across different Arabic datasets and concluded that there is a limited effect of transfer learning on the performance of the classifier, particularly for highly dialectic comments. Similarly, Vargas et al. (2021) provided a new approach for offensive and hate speech detection by incorporating an offensive lexicon, for the Brazilian Portuguese language, validating their approach for both offensive and swearing linguistic expressions.
We summarize the prior literature on offensive language detection in Table 1. By looking at the language column, we can observe that most of the prior works in offensive language detection are in resource-rich languages, i.e., English, European, and a few others like Arabic, Indonesian and Amharic, etc. In contrast, Urdu is a resource-poor language and there is only one work presented in the literature on offensive language in Urdu (Akhter et al., 2020). This work used a small dataset that is collected from popular news (ARY Digital) YouTube webpage. The dataset contains 2,151 instances in total. Moreover, the features used to detect offensive language were very basic, i.e., word n-grams, char n-grams, and their combinations. We observe the following gaps in the literature: • Most of the datasets used by prior studies are not publicly available.
• There is only one study on offensive language detection in Urdu (Akhter et al., 2020).
• The Urdu dataset used by Akhter et al. (2020) is very small consisting of only 2,151 instances.
• Lack of appropriate feature engineering: Most of the studies used character n-grams and word n-grams lexical features, Akhter et al. (2020) also used these two.
• Lack of comparison between ML models: Most of the studies used one or two basic machine learning techniques, and have no comparison of simple and ensemble ML models to select the best model for this task.
Therefore, our research contributes in these directions by developing a comparatively large dataset in Urdu, comparing the performance of lexical and embedding features, and comparing basic machine learning and ensemble models to assess their performance

Framework methodology
Social media text is usually in unstructured form and has a wide variety of visualization formats depending upon the specific platform. It is very hard to observe the offensive language in Urdu using only char and word n-gram feature models. Therefore, we use TF-IDF, bag of words, and word2vec feature extraction models in comparison with word n-gram and char n-gram feature models. The pipeline of the proposed offensive language detection model in the Urdu language is presented in Fig. 1. It takes annotated dataset (posts/comments from Pakistani public pages of Facebook) as input, and preprocesses it by removing punctuation marks, white spaces, accents, and inconsistencies from Urdu text. It then tokenizes Urdu text to further prepare it for feature extraction. After tokenization and stop words removal, features are extracted using word unigram, bag of words, TF-IDF, and word2vec extraction methods. Then state-of-the-art classifiers, evaluation metrics, and a 10-fold cross-validation method are used for experimentation. The outcome of the framework is the binary label (offensive or not offensive) of the post.

Problem formulation
Offensive language detection is the binary classification problem, formally described as follows.
Suppose we have a collection of Facebook posts (P 1 ,P 2 ,...P n ) and their corresponding vectors (X 1 ,Z 1 ),(X 2 ,Z 2 ),...(X n ,Z n ) The variable n represents the total number of posts; X i is the feature vector related to the post P i (X i ∈ R T ,R T refers to the total number of features) and Z i ∈ offensive,not offensive. To classify whether a Facebook post P i is offensive or not, a predictive function is defined as follows. (2) Objective function: Our goal is to learn a predictive function that helps to predict whether a post is offensive or not so that future instances can be classified correctly.

Dataset preparation
Here, we present the process of data collection and annotation to create the offensive language dataset in Urdu, and describe the statistics of the resulting dataset.

Domain selection & data collection
We use Facebook's graph application programming interface (API) to collect posts/comments from Facebook pages. To build a dictionary of seed words, initially, a manual list of offensive words in Urdu is designed. This list is then used to search for other words and keywords used in Facebook posts as offensive. After searching for posts containing these words, the Facebook posts were manually inspected, and more phrases and words are identified. This ended up with enough keywords/words being used as offensive. Some keywords contain more than one word, and some contain only one word. It is necessary to disclose that the selection of these words does not relate to their frequency of occurrence in a post. If a word appears once in a Facebook post to offend someone, it is included in the dictionary such as: • (Rhinoceros mouth) 243 Here, we present the process of data collection and annotation to create the offensive languag 244 dataset in Urdu, and describe the statistics of the resulting dataset.  246 We use Facebook's graph application programming interface (API) to collect posts/commen 247 from Facebook pages. To build a dictionary of seed words, initially, a manual list of offensiv 248 words in Urdu is designed. This list is then used to search for other words and keywords used 249 Facebook posts as offensive. After searching for posts containing these words, the Facebook pos 250 were manually inspected, and more phrases and words are identified. This ended up with enoug 251 keywords/words being used as offensive. Some keywords contain more than one word, and som 252 contain only one word. It is necessary to disclose that the selection of these words does not rela 253 to their frequency of occurrence in a post. If a word appears once in a Facebook post to offen 254 someone, it is included in the dictionary such as: 258 are examples of offensive keywords. Popular and diverse Facebook pages from different Pakista 259 newsgroups (religious groups, political parties groups, and popular bloggers) are selected to bui 260 the data corpus. We targeted the most popular Facebook pages because these pages dissemina 261 public opinions rapidly. Likewise, the diversity of source pages makes our dataset a goo 262 representative of different categories, e.g., religion, politics, etc. The sampling criteria and metric 263 used to select a public page from chosen categories are given below: 264 1. The number of followers and likes should be greater than 30000, allowing more activ 265 public pages to be selected from categories.   246 We use Facebook's graph application programming interface (API) to collect posts/commen 247 from Facebook pages. To build a dictionary of seed words, initially, a manual list of offensi 248 words in Urdu is designed. This list is then used to search for other words and keywords used 249 Facebook posts as offensive. After searching for posts containing these words, the Facebook po 250 were manually inspected, and more phrases and words are identified. This ended up with enou 251 keywords/words being used as offensive. Some keywords contain more than one word, and som 252 contain only one word. It is necessary to disclose that the selection of these words does not rela 253 to their frequency of occurrence in a post. If a word appears once in a Facebook post to offe 254 someone, it is included in the dictionary such as: 258 are examples of offensive keywords. Popular and diverse Facebook pages from different Pakista 259 newsgroups (religious groups, political parties groups, and popular bloggers) are selected to bu 260 the data corpus. We targeted the most popular Facebook pages because these pages dissemina 261 public opinions rapidly. Likewise, the diversity of source pages makes our dataset a go 262 representative of different categories, e.g., religion, politics, etc. The sampling criteria and metr 263 used to select a public page from chosen categories are given below:   246 We use Facebook's graph application programming interface (API) to collect posts/comme 247 from Facebook pages. To build a dictionary of seed words, initially, a manual list of offensi 248 words in Urdu is designed. This list is then used to search for other words and keywords used 249 Facebook posts as offensive. After searching for posts containing these words, the Facebook po 250 were manually inspected, and more phrases and words are identified. This ended up with enou 251 keywords/words being used as offensive. Some keywords contain more than one word, and som 252 contain only one word. It is necessary to disclose that the selection of these words does not rel 253 to their frequency of occurrence in a post. If a word appears once in a Facebook post to offe 254 someone, it is included in the dictionary such as: 258 are examples of offensive keywords. Popular and diverse Facebook pages from different Pakista 259 newsgroups (religious groups, political parties groups, and popular bloggers) are selected to bu 260 the data corpus. We targeted the most popular Facebook pages because these pages dissemin 261 public opinions rapidly. Likewise, the diversity of source pages makes our dataset a go 262 representative of different categories, e.g., religion, politics, etc. The sampling criteria and metr 263 used to select a public page from chosen categories are given below: These are examples of offensive keywords. Popular and diverse Facebook pages from different Pakistani newsgroups (religious groups, political parties groups, and popular bloggers) are selected to build the data corpus. We targeted the most popular Facebook pages because these pages disseminate public opinions rapidly. Likewise, the diversity of source pages makes our dataset a good representative of different categories, e.g., religion, politics, etc. The sampling criteria and metrics used to select a public page from chosen categories are given below: 1. The number of followers and likes should be greater than 30,000, allowing more active public pages to be selected from categories. 2. Only those pages are selected that employ the Urdu language most frequently for posts and comments. By employing the above criteria, we selected 19 Facebook public pages as described in Table 2. Using the seed word dictionary, we collected Facebook posts/comments containing any of these keywords for 36 months ranging from June 01, 2017, to May 30, 2020. The reason why we choose this period was the general elections held in 2018 in Pakistan and other religious activities. The process of preparation of the dataset with annotation is presented in Fig. 2. Initially, the collection of Facebook posts led to 32,480 instances containing at least one dictionary word. After that, the data was considered for cleaning using the following four steps: (1) since our focus is on the Urdu language, therefore we removed all the posts and comments which are in English, and Roman Urdu, (2) after that, we removed all the punctuation marks, null values, URLs, Emojis, special symbols and numbers from Urdu post/comments because they have no contribution in offensive language detection, (3) we performed normalization on Urdu text to convert homophone variations of Urdu writings to a common symbol e.g., characters ' 268 By employing the above criteria, we selected 19 Facebook public pages as described 269 Using the seed word dictionary, we collected Facebook posts/comments containing an 270 keywords for 36 months ranging from June 01, 2017, to May 30, 2020. The reason why 271 this period was the general elections held in 2018 in Pakistan and other religious acti 272 process of preparation of the dataset with annotation is presented in Fig. 2. Initially, the 273 of Facebook posts led to 32,480 instances containing at least one dictionary word. Aft 274 data was considered for cleaning using the following four steps: 1) since our focus is o 275 language, therefore we removed all the posts and comments which are in English, a 276 Urdu, 2) after that, we removed all the punctuation marks, null values, URLs, Emo 277 symbols and numbers from Urdu post/comments because they have no contribution in 278 language detection, 3) we performed normalization on Urdu text to convert homophone 279 of Urdu writings to a common symbol e.g., characters ' ' and ' ' are to be replaced by 280 removes spaces and duplication in the text, and 4) using the Urdu stop word list, we rem 281 words from our corpus. After the cleaning steps, the dataset finally led to 12,416 posts. T 282 is further considered for annotation 283 3.2.2 Data Annotation 284 A set of annotation guidelines are designed by considering religious, political, vulgarity 285 and regional contexts to rationalize why a post may be considered offensive or not off 286 took guidance from prior work [30] to make a set of annotation guidelines (Appendix A 287 For annotation of offensive language dataset, we have two available options, 1) Crow 288 and 2) Manual labeling by human experts. We employed the second option. The 12 289 dataset and guidelines are then shared with five Urdu experts; among them, two Ph.D 290 one is a master's student, and the other two are MS-qualified professionals. All ann

Manuscript to be
Computer Science ' and ' 268 By employing the above criteria, we selected 19 Facebook public pages as described in 269 Using the seed word dictionary, we collected Facebook posts/comments containing an 270 keywords for 36 months ranging from June 01, 2017, to May 30, 2020. The reason why w 271 this period was the general elections held in 2018 in Pakistan and other religious activ 272 process of preparation of the dataset with annotation is presented in Fig. 2. Initially, the 273 of Facebook posts led to 32,480 instances containing at least one dictionary word. Afte 274 data was considered for cleaning using the following four steps: 1) since our focus is on 275 language, therefore we removed all the posts and comments which are in English, an 276 Urdu, 2) after that, we removed all the punctuation marks, null values, URLs, Emoj 277 symbols and numbers from Urdu post/comments because they have no contribution in 278 language detection, 3) we performed normalization on Urdu text to convert homophone 279 of Urdu writings to a common symbol e.g., characters ' ' and ' ' are to be replaced by ' 280 removes spaces and duplication in the text, and 4) using the Urdu stop word list, we rem 281 words from our corpus. After the cleaning steps, the dataset finally led to 12,416 posts. Th 282 is further considered for annotation 283 3.2.2 Data Annotation 284 A set of annotation guidelines are designed by considering religious, political, vulgarity, 285 and regional contexts to rationalize why a post may be considered offensive or not offe 286 took guidance from prior work [30] to make a set of annotation guidelines (Appendix A 287 For annotation of offensive language dataset, we have two available options, 1) Crow 288 and 2) Manual labeling by human experts. We employed the second option. The 12, 289 dataset and guidelines are then shared with five Urdu experts; among them, two Ph.D 290 one is a master's student, and the other two are MS-qualified professionals. All anno

Manuscript to be
Computer Science ' are to be replaced by ' the above criteria, we selected 19 Facebook public pages as described in Table 2. word dictionary, we collected Facebook posts/comments containing any of these 6 months ranging from June 01, 2017, to May 30, 2020. The reason why we choose s the general elections held in 2018 in Pakistan and other religious activities. The paration of the dataset with annotation is presented in Fig. 2. Initially, the collection osts led to 32,480 instances containing at least one dictionary word. After that, the idered for cleaning using the following four steps: 1) since our focus is on the Urdu efore we removed all the posts and comments which are in English, and Roman that, we removed all the punctuation marks, null values, URLs, Emojis, special umbers from Urdu post/comments because they have no contribution in offensive tion, 3) we performed normalization on Urdu text to convert homophone variations gs to a common symbol e.g., characters ' ' and ' ' are to be replaced by ' ' and also s and duplication in the text, and 4) using the Urdu stop word list, we removed stop r corpus. After the cleaning steps, the dataset finally led to 12,416 posts. This dataset idered for annotation nnotation ation guidelines are designed by considering religious, political, vulgarity, sectarian, ontexts to rationalize why a post may be considered offensive or not offensive. We from prior work [30] to make a set of annotation guidelines (Appendix A).
of offensive language dataset, we have two available options, 1) Crowdsourcing l labeling by human experts. We employed the second option. The 12,416 posts idelines are then shared with five Urdu experts; among them, two Ph.D. students, Manuscript to be reviewed puter Science ' and also removes spaces and duplication in the text, and (4) using the Urdu stop word list, we removed stop words from our corpus. After the cleaning steps, the dataset finally led to 12,416 posts. This dataset is further considered for annotation.

Data annotation
A set of annotation guidelines are designed by considering religious, political, vulgarity, sectarian, and regional contexts to rationalize why a post may be considered offensive or  not offensive. We took guidance from prior work (Kumar et al., 2018) to make a set of annotation guidelines (Appendix A). For annotation of offensive language dataset, we have two available options, (1) crowdsourcing and (2) manual labeling by human experts. We employed the second option. The 12,416 posts dataset and guidelines are then shared with five Urdu experts; among them, two Ph.D. students, one is a master's student, and the other two are MS-qualified professionals. All annotators are experts in the Urdu language. The majority voting criteria are adopted to decide the final label. As the initial dataset is imbalanced (contains 3,750 offensive posts, and the remaining are not offensive), therefore, for experiments, we draw a sample of 7,500 posts randomly; 3,750 are offensive and 3,750 are not offensive. Regarding data statistics, we observed 57.2%, 31.9%, and 21% annotation agreements among five, four, and three annotators, respectively.

Data preprocessing
Pre-processing is an important task to prepare the input Urdu text for classification using several steps such as normalization of text, segmentation of Urdu words, spell correction, tokenization, and stop word removal from the text. We performed normalization to convert homophone variations of Urdu writings to a common symbol e.g., characters ' 268 By employing the above criteria, we selected 19 Facebook public pages a 269 Using the seed word dictionary, we collected Facebook posts/comments 270 keywords for 36 months ranging from June 01, 2017, to May 30, 2020. The 271 this period was the general elections held in 2018 in Pakistan and other 272 process of preparation of the dataset with annotation is presented in Fig. 2 273 of Facebook posts led to 32,480 instances containing at least one dictiona 274 data was considered for cleaning using the following four steps: 1) since o 275 language, therefore we removed all the posts and comments which are 276 Urdu, 2) after that, we removed all the punctuation marks, null values, 277 symbols and numbers from Urdu post/comments because they have no c 278 language detection, 3) we performed normalization on Urdu text to conver 279 of Urdu writings to a common symbol e.g., characters ' ' and ' ' are to be 280 removes spaces and duplication in the text, and 4) using the Urdu stop wor 281 words from our corpus. After the cleaning steps, the dataset finally led to 12 282 is further considered for annotation 283 3.2.2 Data Annotation 284 A set of annotation guidelines are designed by considering religious, politi 285 and regional contexts to rationalize why a post may be considered offensi

Manuscr
Computer Science ' and ' 268 By employing the above criteria, we selected 19 Facebook public pages a 269 Using the seed word dictionary, we collected Facebook posts/comments c 270 keywords for 36 months ranging from June 01, 2017, to May 30, 2020. The 271 this period was the general elections held in 2018 in Pakistan and other r 272 process of preparation of the dataset with annotation is presented in Fig. 2. 273 of Facebook posts led to 32,480 instances containing at least one dictionar 274 data was considered for cleaning using the following four steps: 1) since o 275 language, therefore we removed all the posts and comments which are i 276 Urdu, 2) after that, we removed all the punctuation marks, null values, 277 symbols and numbers from Urdu post/comments because they have no co 278 language detection, 3) we performed normalization on Urdu text to convert 279 of Urdu writings to a common symbol e.g., characters ' ' and ' ' are to be r 280 removes spaces and duplication in the text, and 4) using the Urdu stop word 281 words from our corpus. After the cleaning steps, the dataset finally led to 12 282 is further considered for annotation 283 3.2.2 Data Annotation 284 A set of annotation guidelines are designed by considering religious, politic 285 and regional contexts to rationalize why a post may be considered offensiv

Computer Science
' are to be replaced by ' y employing the above criteria, we selected 19 Facebook public pages as described in Table 2. sing the seed word dictionary, we collected Facebook posts/comments containing any of these eywords for 36 months ranging from June 01, 2017, to May 30, 2020. The reason why we choose his period was the general elections held in 2018 in Pakistan and other religious activities. The rocess of preparation of the dataset with annotation is presented in Fig. 2. Initially, the collection f Facebook posts led to 32,480 instances containing at least one dictionary word. After that, the ata was considered for cleaning using the following four steps: 1) since our focus is on the Urdu anguage, therefore we removed all the posts and comments which are in English, and Roman rdu, 2) after that, we removed all the punctuation marks, null values, URLs, Emojis, special ymbols and numbers from Urdu post/comments because they have no contribution in offensive anguage detection, 3) we performed normalization on Urdu text to convert homophone variations f Urdu writings to a common symbol e.g., characters ' ' and ' ' are to be replaced by ' ' and also emoves spaces and duplication in the text, and 4) using the Urdu stop word list, we removed stop ords from our corpus. After the cleaning steps, the dataset finally led to 12,416 posts. This dataset s further considered for annotation .  of the Urdu language, we cannot use space to specify the boundary between two words. The space-omission and space insertion are two main challenges related to Urdu word segmentation. After that, tokenization is performed and there are two methods, one is based on punctuation marks and the other is based on white spaces. Stop words are those words that have no impact on text classification therefore stop words are removed using the Urdu stop word list to prepare the text for classification.

Features extraction
Feature extraction is the most important step in any natural language processing task (NLP). It has been observed that whenever irrelevant features are used, then it may lead to misclassification. Therefore, to design an effective offensive language detection model for the Urdu language, we consider the following state-of-the-art features: & Knight, 1998). They are also called lexical bundles or multi-word expressions (Csomay, 2013) or a set of co-occurring words within a given window. N-gram is usually a sequence of N words in each sample of text. The sequence may be phonemes, syllables, letters, words, or base words. They always predict the occurrence of a word based on the occurrence of N-1 prior words. As our task is related to NLP, therefore we have used the word n-gram model for Urdu offensive language detection and have extracted unigrams, bi-grams, and trigrams from our text.

Bag Of words
In information retrieval and NLP tasks, the bag-of-words method is commonly used as a vector representation for document classification. It transforms text into fixedlength vectors by counting how many times each word appears in the text, also called vectorization (Zhang, Jin & Zhou, 2010). In this method, instead of using predefined words, a domain of corpus is created from the training data to capture opinion words. After designing the corpus, the frequency of each word in the sentence is calculated and this frequency is used as a feature for training a classifier.

TF-IDF
TF-IDF is often used as a weighting factor (https://en.wikipedia.org/wiki/Weighting_factor) in information retrieval (https://en.wikipedia.org/wiki/Text_mining, and https://en. wikipedia.org/wiki/User_modeling) (Aizawa, 2003). We also used this method for feature generation to investigate their impact on offensive language detection. This method calculates the importance of a word in a document being a part of the whole corpus. In addition, it also computes the ratio of the word in the whole corpus by taking the log of total documents in the corpus divided by the number of documents in which the term appears. It is the product of term frequency and inverse document frequency. To extract TF-IDF, and n-gram features we used the scikit-learn library on a labeled corpus of 7,500 posts and comments.

Word embeddings
Word embedding is one of the most popular representations of text vocabulary and can capture the context of a word in a text, such as semantic and syntactic similarity and relationship with other words. Mikolov et al. (2013) developed the word2vec method to learn word embeddings. It is an unsupervised shallow two-layer neural network trained for generating high-quality, distributed, continuous dense vector representations of words. Word2vec supports two model architectures to produce a distributed representation of words, i.e., continuous bag-of-words and continuous skip-gram models (https://en.wikipedia.org/wiki/Distributed_representation). In the continuous bag-of-words model, the current word is predicted from a window of surrounding context words, whereas the skip-gram algorithm predicts the surrounding window of context words using the current word. We used the skip-gram model and generated 100 dimensions. Furthermore, to train the word2vec embedding model, we used a corpus of 196,226 Facebook posts to create embeddings for a unique vocabulary using the Genism library of Python. To the best of our knowledge, word embedding has not been used to explore and detect Urdu offensive language in the literature. We have used the word2vec feature generation method in the Urdu language and have compared its performance with word n-gram, bag-of-words, and TF-IDF methods.

Classifiers and evaluation measures
In this study, five ML algorithms have been selected for experimental setup, to develop a robust model for offensive language detection in Urdu. The models are logistic regression (Logistic-Reg), random forest (RF), stochastic gradient descent (SGD), support vector  Fig. 3. The majority voting methodology is adopted in the ensemble model and Logistic-Reg, SVM, and SGD are the three ML models being used. We compare the performance of five ML models, and the best model is defined.
In addition, the 10-fold cross-validation method is used for model training and testing. The results have been reported using standard accuracy, precision, recall, F1-score, area under the curve (AUC), and Matthews correlation coefficient (MCC) measures.
The mathematical definitions of these metrics are described as follows:

Accuracy
It is measured as the ratio of the number of correctly predicted instances (positive and negative) to all predictions. Accuracy = TruePositive + TrueNegative TruePositive + TrueNegative + FalsePositve + FalseNegative . (3)

Precision
It is the measure that summarizes the fraction of actual instances of an offensive class to the total number of instances assigned an offensive class label.

Recall
It summarizes how well the offensive class is predicted and it is calculated as

F1-Score
F1-score is the combination of precision and recall metrics that balances both measures. It is calculated as The AUC It relates the true positive rate to the false positive rate and provides an aggregate measure of performance across all possible classification thresholds.

The MCC
It is a reliable statistical rate that produces a high score only if the prediction obtained good results in all of the four confusion matrix categories, i.e., true positives, false negatives, true negatives, and false positives.

EXPERIMENTAL SETUP AND RESULTS ANALYSIS
In this section, four experiments are performed. The first experiment compared the performance of five ML methods. Also, the impact of each type of feature on offensive language detection is investigated. In the second experiment, the performance of the proposed model is compared with the baseline. In the third experiment, feature selection is performed using the wrapper method and its implications are discussed. In the last experiment, the impact of several combinations of proposed features is examined and conclusions are drawn. The necessary parameters of ML models are presented in Table 3 to replicate the results.

Experiment 1: performance comparison of Ml methods and feature models
Here experiments are conducted to meet two objectives: (1) performance comparison of five ML models using proposed features, (2) investigation of the impact of four types of features for offensive language detection in Urdu. The results are evaluated using accuracy, precision, recall, F1-measure, AUC, ROC curves, and MCC. The ML methods are trained and tested using the standard 10-fold cross-validation technique. The performance of each classifier with word unigram, TF-IDF, bag-of-word, and word2vec methods in the accuracy metric is presented in Table 4. It has been observed that the ensemble model outperforms all other classifiers. In addition, the word2vec feature model demonstrated better performance as compared to bag-of-word, TF-IDF, and word unigram. The reason behind achieving 88% accuracy with the word2vec feature is that this model employs the semantic/contextual information related to the language of posts/comments. Overall, the word unigram presented the lowest performance as compared to other feature models. On the other end, RF demonstrated the least performance as compared to the other four ML methods. In addition, it is observed that the word2vec feature model with all the classifiers demonstrated the best performance in comparison with other feature models. In contrast, we did not obtain promising results when all features are combined. Similarly, the performance of five ML models with four feature models using precision metric is demonstrated in Table 5. Again, the performance of word2vec is better than all other features and achieved 88.28% precision with the ensemble model. The accuracy and precision measures justify that our results are consistent along both measures, and word2vec and ensemble models are the best feature and ML models. However, predictive performance with all features is not promising using the precision measure (i.e., 86.22% but 88.26% with word2vec).  The performance of four feature types is also compared using the recall evaluation metric for offensive language detection. 10-fold cross-validation and five ML methods are used for experimentation. The performance of the word2vec feature model is observed to be better than word unigram, bag-of-word, and TF-IDF feature models as shown in Table 6. The best performance (88.27% recall) is achieved with word2vec and the ensemble model. On the other hand, word unigram achieved the lowest recall with the RF model and the best recall with the ensemble model. The outperformance of the word2vec feature model with the ensemble model remains consistent along with the recall measure.
Tables 7 and 8 shows the performance of five ML models with four feature models, using the AUC and MCC measures. Here again, we can observe that the performance of the word2vec feature model is better than word n-gram, TF-IDF, and bag-of-word feature models. Moreover, the ensemble model presented better performance as compared to the other four ML methods as shown in Tables 7 and 8. Thus, along with six evaluation metrics, the outperformance of the word2vec feature model with the ensemble model is observed to be consistent. However, we did not obtain promising results when all features are combined. The impact of proposed features is also investigated using the ROC curve as presented in Fig. 4. For experimental setup, the ensemble model is used as a classifier and 10-fold cross-validation is used for training and testing purposes. It is depicted in Fig. 4 that word2vec demonstrates the best performance and has covered more area under the curve. The performance of TF-IDF is comparable with word2vec features but slightly lower. We find symmetry in the results of ROC and other evaluation measures. Hence, the performance of features is consistent along six metrics. Thus, this proves the significance of the word2vec feature model and ensemble model for the detection of offensive language in Urdu.

Experiment 2: comparison with baseline
The latest work on offensive language detection in the Urdu language is presented by Akhter et al. (2020). In this section, we present a comparison of our model with Akhter et al. (2020). First, there is a difference in the nature and size of the dataset, as the baseline paper has created a dataset that was collected from comments in the Urdu language from different YouTube videos. These comments were manually collected from videos by the authors themselves. In our case, we have collected actual comments, shared by users, on 19 different Urdu websites including newspapers, political parties, etc. Therefore, our dataset presents originality not only in the context of comments but also in the text of the comments. Second, our dataset is much larger, starting from more than 800,000 comments, it has concluded to almost 200,000 comments after preprocessing (Table 2), whereas the baseline paper contains 2,000 comments in the Urdu language, and they have not mentioned the source of Roman Urdu. Third, Akhter et al. (2020) got the comments annotated by a panel of three persons who were all students, whereas we used the services of five language expert annotators. This has strengthened the quality of our labeling. Regarding experiments, the baseline (Akhter et al., 2020) focused more on char and word n-gram feature extraction techniques and used several (more than 15) classifiers, whereas, in our work, we have focused on a variety of the latest feature extraction techniques and used more relevant classifiers generally used in literature for a similar task. In this context, we have performed experiments of Akhter et al. (2020) using our dataset and reported only those experiments which outperformed. Both the word n-gram-based and character n-gram-based, experiments are reported in Table 9. Our model with four feature extraction methods and ensemble as a classifier is also presented and our model has demonstrated better results as compared to the standard baseline (Akhter et al., 2020). The AUC and MCC performances are also reported. We have used six performance metrics for the comparison of results. In addition, the best score against each performance metric is also highlighted in Table 9. It is depicted in Table 9, that our model has outperformed the baseline along with all performance metrics. The improvement is 3.55% in accuracy, 3.68% in recall, 3.60% in f1-measure, 3.67% in precision, and 2.71% in AUC. One important point in Table 9 is that the performance of the baseline technique on our dataset is less than those reported in their paper (Akhter et al., 2020). The main reason for this is the enormity of our dataset. As mentioned above, their dataset does not contain sufficient variation or originality due to how it has been generated by the authors themselves, by watching different YouTube videos. On the other hand, our dataset is made by collecting genuine texts shared by thousands of different people. This brings much more originality and variation to our dataset.
The comparison of our model with the baseline is also evaluated using the ROC curve as presented in Fig. 5. The performance of the baseline is represented by the char tri-gram in Fig. 5 because, in the baseline approach, the char tri-gram presented the best performance, as shown in Table 8 upper part. This is exactly the reason for its being selected as the baseline method. It has been shown clearly that word2vec presented better performance as compared to baseline and the other three types of features. Regarding our method, the worst performance is noted by bag-of-words, but it is still better than the baseline method. This proves the significance and effectiveness of our approach as compared to the baseline.

Experiment 3: impact of feature selection on classification performance
To reduce the complexity of feature models and to enhance the performance of the proposed detection model for offensive language, we employed the technique of feature selection (Atlam et al., 2020). A well-known feature selection method is selected to find the best subset of features. i.e., wrapper method. This method employs a search strategy (forward selection) by evaluating the possible subsets of features, using a machine learning algorithm. The evaluation of each subset is based on the quality of performance produced by the selected machine learning algorithm. The evaluation criteria may be any performance metric depending upon the nature of the problem. In our case, we have selected the  SVM algorithm with an accuracy metric for the best subset selection. The reason behind choosing SVM is that it presented a significant feature subset while testing each ML model in the wrapper selection method. In the word unigram feature model, we have compared the performance of the selected subset with all Uni-gram features and did not find any improvement in the accuracy. Therefore, we did not consider the results of the word unigram.
After employing the wrapper method, we found the best subset of 67 features for TF-IDF, the subset of 72 features for bag-of-word, and the subset of 72 features for the word2vec model. Regarding performance using the accuracy metric, it is clearly shown in Table 10 that we found significant improvement in performance with the selected features for each type of feature. For example, the largest improvement is observed for the bag-of-word feature model whereas word2vec demonstrated a small improvement. An ensemble method with 10-fold cross-validation is used to generate the results.
It is clear from Table 10 that feature selection impacts all performance metrics. If we compare the result precision-wise; the bag-of-words achieved the best improvement of 3.74%, as compared to TF-IDF and word2vec with improvements of 1.81%, and .27% respectively. Similarly, the accuracy and AUC measures also improved with feature selection. Furthermore, like precision, the maximum improvement is observed with bag-of-words using accuracy and AUC metrics. These provide evidence of the usefulness of feature selection in the proposed methodology.

Experiment 4: impact of hybrid combination on classification performance
In the previous experiments, we presented the performance of the proposed features as a standalone model. In this section, we have conducted experiments to investigate the impact of various combinations of the proposed features for offensive language detection. Among four feature sets in the prior experiments, we found that word2vec has consistently outperformed the rest (with and without feature selection), whereas the performance of word unigram is the lowest among all. Ignoring the lowest one, we have made several combinations of the rest of the three feature sets. An ensemble model with 10-fold cross-validation and six evaluation metrics is used for the experimental setup.
The best performance in accuracy measure (89.23%) is achieved when we combine all three features as shown in Table 11. Similarly, this combination demonstrates the best performance on precision, recall, f1-measure, AUC, and MCC measures as well. Hence, it is the best performance obtained by our proposed model. In addition, we have observed that the second-best performance is achieved by combining bag-of-word with word2vec. The third-best performance is observed by using the combination of TF-IDF with word2vec as presented in Table 11. Thus, we can conclude that word2vec is the best feature method not only as a standalone model but also achieved the best performance when used with other features in combination. Although all evaluation metrics presented very promising values, AUC (96.78) value is very significant. Thus, the combination of the features proved its significance for offensive language detection in Urdu.

Examples of offensive and not-offensive posts
After an exhaustive evaluation of the proposed model, we have added here the predicted class labels of six randomly selected posts/comments from the test set as shown in Table 12. If we analyze the language of comments 1 and 2, we can conclude that the presence of offensive words might have made the proposed model label these posts as offensive. However, comment 3 does not contain any offensive words but it is the context of the comment that has guided the proposed model to declare it as offensive. In addition, a similar trend is also observed for not-offensive comments/posts labeled by the proposed model. The system labeled comments 4 and 6 as not offensive because there is no offensive word in both comment. However, the class label of comment 5 is decided as not-offensive using the context of the text available.

DISCUSSIONS AND IMPLICATIONS
In recent years, social media is getting popular in every aspect of life and offensive language has become the dark side of this technology. The offensive language in social media causes extremism and intolerance, that pressurizes vulnerable groups, such as religious minorities, social activists, religious scholars, political leaders, etc. The findings of this research aid  The ignorant people say that the law is for the poor, but when the lawmakers catch the rich, it is the poor who take to the streets to save the rich.

5
Why do you go after this poor manm livelihood? Kashmir will be free from his statements

Manuscript to be reviewed
Computer Science to uncover a more influential set of features that are supportive of offensive language detection in Urdu on the Facebook platform. This research has valuable insights for online users, website owners, and law and enforcement agencies to identify offensive language on websites. It is highly desirable to detect this type of material from social media to reduce various crimes in society. In this perspective, we have developed a detection model for offensive language in Urdu by utilizing more effective semantic and word embedding features with the ensemble method. This proposed model is tested on a real-life annotated Facebook posts dataset to find the real insights for this research to make it more practical.
The results provide evidence of our detection model is effective in detecting offensive language in Urdu. We have evaluated the impact of robust features as a standalone model and as a hybrid combination with an ensemble model. The most significant contribution is the embodiment of the detection model that achieves 90% accuracy and improves 5% accuracy as compared to the baseline. Based on these findings, our proposed model can be utilized by any organization to identify offensive content in the Urdu language. In addition, our corpus has various unique aspects. First, it covers many categories, e.g., religious, political, news, regional, ethnic, vulgarity, etc. To the best of our knowledge, it is the first Urdu offensive language corpus that covers so many categories of offensive language using Pakistani social media platforms. Second, our annotated dataset has a higher number of offensive posts/instances as compared to not-offensive posts i.e., 51% offensive instances and 49% not-offensive. The inter-annotator agreement measure is 67% which is comparable with other studies. A related survey (Fortuna & Nunes, 2018) shows that the existing datasets consist of a very low number of offensive/hate instances as compared to not hate/offensive instances. Third, it is observed that in our dataset, 27% of offensive posts consist of vulgar words, 22% of offensive posts contain sectarian words, 5% of posts have regional offensive words, and 30% of posts consist of ethnic words.
In addition, our research draws various practical implications. The outcome of this research can be used to develop a filter for online platforms of social media to early identify and discard offensive/unwanted material. It is also observed that offensive language has a strong relation with events occurring throughout the year. We found a lot of religious offensive words in comments during religious events such as Moharram and Eid Milat-n-Nabi, etc. Similarly, the political parties' public pages have many comments which incite offensive language toward their opposite political leaders. Although we did not annotate the type of target, however, we found that there are many comments and posts which incite offensive words against popular political leaders, religious scholars, and human rights personnel. Thus, a fine-grained annotation of the targets of the offensive posts can be done. This enrichment may facilitate government organizations and social media platforms to identify and remove various types of offensive language from social media.

CONCLUSION AND FUTURE WORK
In this case study, the objective was to design a binary classification model to identify offensive content in the Urdu language. To meet this challenge, a new corpus was constructed containing Urdu posts and comments from various popular Pakistani Facebook pages. The corpus was annotated by five domain experts and the final dataset is about 7,500 instances. In contrast, the dataset used by the baseline was comparatively small. In addition, four types of feature extraction methods are utilized to generate semantic and word embedding features. The methods are word n-gram, bag-of-words, TF-IDF, and word2vec-based word embeddings. Five popular ML methods with 10-fold cross-validation and six state-of-the-art evaluation metrics are used for the experimental setup. The baseline study used only the word n-gram and char n-gram features. The findings of this study reveal that word2vec outperformed the other three types of features and standard baseline as a standalone model and achieved 88.20% accuracy. In addition, to improve the proposed framework accuracy, feature selection is incorporated using the wrapper method. We observed improvement in all evaluation metrics and classification accuracy improved significantly. The ensemble model demonstrated the best performance as compared to other ML methods. In addition, we compared the performance of different combinations of features and concluded that any combination of features with the word2vec method shows optimal performance.
There are a few avenues for future work. The latest contextual feature methods and NLP techniques may be used to improve the accuracy of the proposed model. Another direction is to utilize a Rule-based approach to handle the problem of offensive language identification. Regarding ML models, deep neural networks and evolutionary algorithms can be applied to develop more robust offensive language detection models. Similarly, the proposed methodology can be employed for other related problems in similar domains.