CLASSIFICATION OF VULGAR COMMENTS USINGCONVOLUTIONAL NEURAL NETWORKWITH GLUON NATURAL LANGUAGE PROCESSING

1. Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh. 2. Department of Computer Science and Engineering, Independent University, Dhaka, Bangladesh. ...................................................................................................................... Manuscript Info Abstract ......................... ........................................................................ Manuscript History Received: 10 July 2020 Final Accepted: 14 August 2020 Published: September 2020

Internet negativity has perpetually been a hot topic. The obscurity and therefore the sense of distance of people's web presence have inspired to expose themselves freely caused by lack for cyber security [1,2] and lack of proper social education. As a result, the popularity of internet specially social medium like facebook, twitter, what's app, youtube have risen too high nowadays. But the negative portion of social media has become a new trend of abuse, cybercrime, offensiveness. Vulgarity and Spam can be any bothersome and obliged conduct that could possibly break the security approaches of any arrange security. Nowadays, spam is any vulagr message or fake message sent by spammers to draw the authentic customers. There can be diverse objectives of spammers for sending these spam and disgusting messages like for publicizing any item,cyber tormenting, annoying, threating. There are different sorts of vulgarism and spam as depicted beneath [3]: Toxic email: Email Spam is a kind of spam sent through email. The connections present in sent sends may delude clients to locales having phishing or malevolent properties. The spammers play in all respects cleverly in the event of email spamming. They collect different email addresses from visit rooms, diverse, sites and so on. furthermore, pitch it to different spammers.
Vulgar comments: Comment Spam is a sort of post in which spammer posts harsh and hostile information in organizing destinations. In remark spam , spammer dissipates the spam as remarks on online journals , discussions , Wikipedia and so forth.
Instant messenger spam: Instant Messenger Spam gives its clients a majority of catalog of its everything clients having scientific information like name , age , sexual orientation and so on. Spammers searching for such vulnerable ISSN: 2320-5407 Int. J. Adv. Res. 8(09), 516-524 517 information, assembles all subtleties and by marking in, sends unconstrained and spontaneous messages which may contain any infections , noxious connections and so forth.
Junk fax :Junk fax is similar to email spam. The main distinction lies is that spam in junk fax is obtained as faxes through fax transmission. These are fundamentally utilized for swelling in commercials.
Spontaneous Text Messages : Spontaneous Text Messages is a type of Mobile Spam. This spam type focuses on the message box of any client on his cell phone in any case, this sort is less common than email spam.
Social Media Spam : Social Networking Spam is a sort in which spammed content is posted on social systems. Spam can be as remarks, tweets, talks, pictures and so forth. Fundamentally, above clarified types can be considered inside Social Networking Spam.
Nowadays, artificial intelligence is being used in sectors like medical [4][5][6][7][8], pattern analysis [9], agriculture [10][11][12], human assistantship [14], education, industry, home appliance [15] and so on and based on this, it is has become a great medium to apply on Now, governments of different countries are trying to prevent cybercrime, bullying and most importantly vulgarity and spam from the internet. Our proposed system has designed in a way inspiring from Pavel, M. I. et. al [29], to classify the vulgar and spam comments from social media comment so that internet can be again user friendly to all sort of people.

Related Works:-
S. Srivastava et al [16] presented capsule network based aggressive comment classifier. The proposed worked was based on focal loss along with single model capsule network which achieves 98.46% accuracy on the on their first dataset and mentioned to apply on TRAC dataset which is Hindi-English combined dataset of aggressive and nonaggressive comment with RNN classifier where α and γ was 2 and 0.25 respectively.
C. Rădulescu et al [17] et al took an integrated approach to identify spams by combining Machine Learning, Natural Language Processing and URL examination methodologies. The combined methodology has shown better and more exact results than when each modularity was applied individually. M. McCord et al detected vulgar comments on Twitter in [18] applying the api method from tweeter. The authors used Random Forest Classifier algorithm to detect vulgar comments or response where they got 95.7 F1 score and 95.7%.
For identifying spam remarks applying Natural Language Processing Techniques Kandasamy et al [19] integrated three approaches-the implementation of URL analysis, supervised machine learning techniques, natural language processing to classify the vulgar and toxic comment from social media where they got 94% accuracy.
Carreras et al built up a strategy to demonstrate that the success rate of AdaBoost is much higherto classify vulgar comments than Decision Tree and Naïve Bayes Algorithm. The authors shows that toxic comments and cyber bullying began to turn into a significant issue of the World Wide Web from back in late 1990s. Techniques for webindecent separating dependent on substance examination were additionally investigated to recognize vulgar pages [20].
S. Shubha et al [21] showed a classification framework to review comment implementing Machine Learning Bayes Sentiment Classification (MLBSC) which was proposed in their paper. The method firstly extract user comments and related comments are listed based on prior training. Then they Evaluate label of class applying probabilistic Bayes classifiers. Finally they got 88.75% accuracy applying MLBSC on 70 testing comments.

Methodology:-
The proposed methodology is divided into two core part CNN [23,24,25] for text classification with feature extraction implementing natural language processing. Figure 1 shows the work diagram of the methodology. 518

Data Acquisition
We have used two types of dataset in this research, one is acquired from Kaggle which is very popular publicly available dataset named "Wikipedia Talk Page Comments annotated with toxicity reasons"which content almost 1,60,000 comments with manually labelling and another one is our own made dataset extracting comments from YouTube, Facebook and tweeter for testing only. The dataset contains total six classes which are described down below, [27] a.Toxic: a common class for all comments which are vulgar, toxic and bullying type. In the training dataset, these six categories are binarilyindividually labelled where one comment or sentence can belong to more than one category or no category if they are not toxic or vulgar. Figure 3. shows the training dataset before preprocessing.

Data Preprocessing
In preprocessing, first of all null data are removed so that it may not occur any trouble while the processing portion. A blank row of comment is counted as a null data and is removed. After that, the huge amount of data needs to be separated into vulgar and non vulgar part. To do this, we implement sampling technical, which simply search is every comment if there is class with value 1 which indicates it is in any of vulgar class. If at least one class is true, then it is vulgar, rest are non vulgar.

Feature analysis
In order to differentiate between vulgar and legitimate comments, we compute nine features for each comment. The features are assigned numeric values and they represent various differentiating characteristics of the two types of comments. The features are as follows: 1. URL in the replies: harassing comments may get an increment of responses regardless of the fact that spammers need to divert users to sites with a relatively small number of viewers in order to maximize the ranking of the website.That is the reason the quantity of connections in the remarks must be taken into records when distinguishing toxic comment. 519 2. Spaces in the comments: Vulgar comments may have lots of white spaces beacausereplies or comments creates huge impact on the user who reads these comments. 3. Sentences in the comment: Total sentences in a toxic comment is lower than the total sentences in a rational statement despite the fact that use of words and sentences is generally consistent. 4. Punctuation marks :Presence of large number of punctuation marks, especially exclamation and dot characters is a defining feature of vulgar comments, as punctuation characters tend to attract the attention of the reader. In contrast, legitimate comments tend to contain only sentence delimiting punctuation marks. However, sometimes legitimate comments can contain larger than normal number of punctuation characters to express strong feelings. 5. Word duplication: Vulgar or toxic comments tend to have a lot of repetition of words compared to legitimate comments with contextual flow of word structure. To capture this property we take the ratio of the number of unique words to the total word count in the comment and define it as word duplication ratio RWD.Theequation is expressed in the following way: =

Gluon Natural language Processing
Nature Language Processing (NLP) is a hypothesis propelled scope of computational procedures for the programmed analysis and portrayal of human language. Practically, It is a procedure which empowers a machine to process a characteristic language (like English) and perform all of the things that a human can do. Prior to going into profound ideas of NLP, a lot of deficient sentences which regularly shows up in a tweet, youtube and facebook comment are recognized. Subsequent to looking into on dataset, some regular sentences in spam remarked were found. They are 'include me at', 'take me out on the town', 'you'll giggle when you see this pic of you', 'You appear to be unique in this photograph', 'my companion sent me this pic with you in it', 'my companion demonstrated to me this pic of you', 'tail me back', 'markdown drugs', 'I discovered you in this video', and some other sensitive words that can't be written for unable circumstances. If these expressions are found in the facebook, email, YouTube then the user is classified as spam. In this thesis, three elements of NLP are applied which are (1) stop words removing ,(2) tokenization and (3) stemming. For processing English there is no need of stop words like I, you, in, the, we ,is, was, me, our, it, etc. o each of these words are expelled and just the keywords are removed. The following stage is to discover the root word or stem of the catchphrase. For this, stemming methods are utilized. A basic stemming calculation has been utilized in this paper. A lot of spam words that can show up in a tweet is distinguished, similar to 'pornography', 'Viagra' and so on. The stemmed watchwords are contrasted and the arrangement of recognized spam words. On the off chance that the words coordinate, at that point the client is viewed as spam. At this stage, in the event that the client isn't found as spam, at that point the third system of Machine Learning is utilized.
Gluon comes with a really propelled NLP toolbox, which makes working with content simple. It additionally fuses pre-prepared Language Models, the mystery sauce for Transfer Learning. Thing is, how might we anticipate that a model should make sense of whether content is lethal or not, on the off chance that it can't "communicate in" English by any means. To implement Gluon with NLP, first of all, an independent language model is trained before embedding matrix by a standard LSTM encoder.Then LSTM output is pooled which is generated by sequence of tokenization aiming to feed dense layer.
Tokenization : Tokenization [28] is one of the vital part of Natural Language Processing which is implementing for word segmentation. In the first step, we split constant content information like "I go, you do, we have" ,so on and encode them into numerical vectors [29]. After that, each words are converted into its numerical portrayalFollowing that, words are mapping this tokenization to sectioned content information andbasically restores the content as a numeric componentsto included words in the vocabulary. This concise clarification serves to give establishments to 520 the word2vec"method is applied. A method named word2vec is applied which is used to compare words that show up in comparable positions in the specific circumstances.
Stop Word Removing :Stop word [30] expulsion is a standout amongst the most usually utilized preparing strategy at the present time. Web indexes utilized this procedure to disregard each one of those words or character that has no esteem or less incentive in the season of creating precision structure the dataset. At the season of making the list, most motors are modified to evacuate some specific words that has less weight. Chiefly the rundown of words or characters that are not added to 20 the last informational index is known as the stop word list. This recoveries both existence as it evacuates the words at ordering time and overlooked at looking time. We expel these arrangement of words as they convey no data for example pronoun, relational word, conjunctions. In English language there arenormally more than five hundred stop words . In our methodology, using the NLTK library [31] we removed the stop words to make better classification.
Stemming :It is the procedure to diminish a group of comparable words and give them a root word which is called lemma. Stemming [32] is including word and giving a lot of words same sort of importance. We use stemming to characteristic language preparing.When we found a comparative kind of word we give a specific load to it for estimating and at that point when another word is discovered it is chosen as another kind. To get the best outcome the procedure of lemma is implemented.

CNN for text Classification
Convolutional Neural Networks is mainly popular for image classification because of the abily to exploit 2 statistica l characters which are local stationary and compositional structure. In this thesis, CNN [8,9,10] is applied for text classification where main dataset represents the mentioned two statistical characters relying on the fact that neighboring words in a sentence present dependency, however, their processing is not straight forward. In image classification, pixels are some integer values with specific threshold value, but in the case of sentence or words, we need to encode first before fed to the networks [11]. To do so, we applied NLTK library [12]of python for using vocabulary which is structured as an list containing words which are shown in the set of comment's texts. Then we need to map each word to encode it integer ranging between one to size of vocabulary. The fluctuation in reports length (number of words in a record) should be tended to as CNNs require a consistent information dimensionality. For this reason the cushioning strategy is received, loading up with zeros the archive grid so as to achieve the most extreme length among all reports in dimensionality. In the subsequent stage the encoded reports are changed into lattices for which each column compares to single word. The created lattices go through the implanting layer where each word (push) is changed into a low-measurement portrayal by a thick vector .When all is said in done the word installing strategies have been prepared on a substantial volume dataset of words delivering for each word a thick vector with a particular measurement and fixed qualities. The word2vec implanting strategy for instance, has been prepared on 100 billion words from Google News delivering a vocabulary of 3 million words. The inserting layer coordinates the info words with the fixed thick vector of the pre-prepared implanting techniques that have been chosen. The estimations of these vectors don't change amid the preparation procedure, except if there are words not officially incorporated into the vocabulary of the inserting strategy in which case they are instated arbitrarily.

Result & Analysis:-
Implementing the dataset from Kaggle [22] and extract commenting from youtube, facebook and tweeter using chrome's comment extraction extenstion, almost 1,59,000 data are stored in training dataset with leveling the classes shown in Figure 2. The dataset is labelled based on toxic, severe toxic, Obsecene, threat, insult, identity hate speech and others; where 1 means presence of its class, 0 mean not in that class. A comment or text can belong to multiple classes. Figure 3 shows relation of each classes of the labelled dataset.
521  The CNN classifier classifies based on extracting the eight features and implementing gluon natural language processing where it rejects all unnecessary words by stopwords and tokenization, removing common sentence structures, numbers, barriers of url and so on.After the post comment analysis, a list of vulgar words is processing with NLP to classy the vulgar words from test dataset. Figure 4 shows the predicted output of the system.  . displayed the multiclass output of classification to visualize where toxic levelis in more than 15,000 comments, obseene is in more than 8,500 comments and so on.The architecture is got 95.4% accuracy with CNN and NLP. Table I shows the comparison of the model with other classifying algorithms where our proposed models gives the best accuracy.

Conclusion:-
We led an inside and out examination on remarks to reveal some insight into various highlights of spam remarks. We were keen on structure a framework which distinguishes vulgar or spam and non-vulgar comments as for various attributes that are characterized. The combination of Gluon NLP with tokenization, stop words, stemming and nine numerical features and huge dataset made the result more error free. In our characterization tests, we showed that our usage of the spam discovery framework gives the best outcomes by utilizing the Gluon NLP and CNN classifierwhere we got 95.4% accuracy, 82.32% sensitivity and 95.4% f1 score which are higher than other regular algorithms. For the further process, we are planning to develop an advance real-time vulgar video and post detector to remove it instantly before appearing in publicly. To do so, we are adding a lot of data for deep learning and reinforcement modeling combining with server-side processing for practical implementation.