A Comprehensive study on Text Classification: Application of Convolutional Neural Networks and Deep Learning methods

Text classification is an essential part in many applications, such as web searching, information filtering, language identification and sentiment analysis such as predicting the sentiment of tweets and movie reviews, as well as classifying email as spam or not. Classifying our content and products into categories help users to easily search and navigate within website or application, Deep learning methods are proving very good at text classification. Deep learning is a set of algorithms and techniques to imitate how the human brain works, called neural networks. Different Neural networks such as Convolutional Neural Networks (CNN when used with Deep learning algorithms, like Word2Vec or GloVe , obtain better vector representations for words and also improve the accuracy of classifiers trained with traditional machine learning algorithms. The authors have made a comprehensive study on Text Classification using convolutional neural network (CNN) .The authors will discuss different models and methods and the experimental results based on variety of datasets.

CNN-non-static: Same as above but the pretrained vectors are fine-tuned for each task.Results of the base model against other methods are listed in the above table and thus we can conclude the followings: 1.
The baseline model with CNN-rand does not perform well on its own.But, with the use of pre-trained vectors, the gains were high.

2.
The model with static vectors (CNNstatic) also performs well, giving almost same results as deep learning models.

3.
These results suggest that the pretrained vectors are good, 'universal' feature extractors and can be utilized across datasets.
Random vs static representations: From the results, we can say the static word embedding using pretrained Word2Vec always performs better.On the other hand, the dynamic vector representation model However, when building a baseline model or when our data set is small and context is domain specific which means that we cannot find corresponding vector from pretrained word embedding models.
(GloVe, fastText etc), in such cases BoW may work better than word embedding.[18] CNNs role as a feature extractor: Yoav

•
the base of deep learning where the algorithms are inspired by the structure of the human brain.Neural networks rely on training data to learn and improve their accuracy over time.Text classification is a method of extracting generic tags from unstructured text.These generic tags come from a set of pre-defined categories.Text classification is one of the fundamental tasks in natural language processing and is an emerging field of study such as such as Marketing, Product Management, Academia, entertainment industries and Governance that are already leveraging the process of analyzing and extracting information from textual data.[1]Therefore, it has attracted immense attention from many researchers across the globe.A key problem in text classification is feature representation and extraction, which is mainly based on the bag-of-words (BoW) model.[2] Recently, the rapid development of pre-trained word embedding and deep neural networks has brought new inspiration to various NLP tasks.[3]Word embedding is the neural representation of words for text analysis in a form of a real valued vector.With the pretrained word embeddings, neural networks describe their great performance in many NLP tasks.Recurrent Neural Network (RecurrentNN) and Convolutional Neural Network(CNN) are used widely for text or sentence classification.RNN analyzes a sentence word by word and stores the semantics of all the previous text in a fixedsized hidden layer.This could be suitable to capture semantics of long texts.But, RNN is a biased model, where later words are more dominant than earlier words.[4]Therefore, it reduce the effectiveness when it is used to capture the semantics of a whole document, because key components could appear anywhere in a document rather than at the end.To tackle the bias problem, the Convolutional Neural Network (CNN), an unbiased model is introduced to NLP proceedings, which can fairly determine discriminative phrases in a text with a max-pooling layer.Thus, the convolutional neural network may better capture the semantic of texts as compared to recursive or other networks.II.LITERATURE SURVEY In Convolutional Neural Networks for Sentence Classification by Yoon Kim, a series of experiments with Convolutional Neural Networks (CNN) built on top of word2vec was presented.The suggested model was tested against several benchmarks.In Movie Reviews (MR) and Customer Reviews (CR), the task was to detect positive/negative sentiment.In Stanford Sentiment Treebank (SST-1), there were already more classes to predict: very positive, positive, neutral, negative, very negative.In Subjectivity data set (Subj), sentences were classified into two types, subjective or objective.In TREC the goal was to classify a question into six question types, the results show that after little tuning of hyperparameters the model performs excellent with the use of pre-trained vectors.[5]The article Text Understanding from Scratch by Xiang Zhang and Yann LeCun shows that it's possible to apply deep learning to text understanding from character-level inputs all the way up to abstract text concepts with help of temporal Convolutional Networks (ConvNets) (CNN).Here, the authors assert that ConvNets can achieve excellent performance without the knowledge of words, phrases, sentences and any other syntactic or semantic structures with regards to a human language.[6]The model was tested on the DBpedia ontology classification data set with 14 classes.The results indicate both good training (99.96%) and testing (98.40 %) accuracy, with some improvement from thesaurus augmentation.In addition, the sentiment analysis test was performed on the Amazon Review data set.In this study, the researchers constructed a sentiment polarity data set with two negative and two positive labels.The result is 97.57% training accuracy and 95.07%testing accuracy.The model was also tested on Yahoo!Answers Comprehensive Questions and Answers data set with 10 classes (Society & Culture, Science & Mathematics, Health, Education & Reference, Computers & Internet, Sports, Business & Finance, Entertainment & Music, Family & Relationships, Politics & Government) and on AG's corpus where the task was a news categorization into four categories (World, Sports, Business, Sci/Tech.).Obtained results confirm that to achieve good text understanding ConvNets require a large corpus in order to learn from scratch.Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao introduced recurrent convolutional neural networks for text classification without human-designed features in their document Recurrent Convolutional Neural Networks for Text Classification [7].The team tested their model using four data sets: 20Newsgroup (with four categories such as computers, politics, recreation, and religion), Fudan Set (a Chinese document classification set that consists of 20 classes, including art, education, and energy), ACL Anthology Network (with five languages: English, Japanese, German, Chinese, and French), and Sentiment Treebank (with Very Negative, Negative, Neutral, Positive, and Very Positive labels).After testing, the model was compared to existing text classification methods like Bag of Words, Bigrams + LR, SVM, LDA, Tree Kernels, Deep learning methods are proving very good at text classification, achieving state-of-the-art results on a suite of standard academic benchmark problems.Let's look at some proposed methods and models one by one for text classification using neural network.Word embeddings with CNN : word embedding is a type of word representation for text analysis, in the form of a real-valued vector that that allows words with similar meaning to have a similar representation.Yoav Goldberg, in his primer on deep learning for natural language processing, states that neural networks offer better performance than classical linear classifiers, especially when used with pretrained word embeddings.[9]He also comments that networks with convolutional and pooling layers are effective at text classification as they easily take out salient features (tokens) in a way that is independent to their position within the input sequences.Single Layer CNN Model: With a single layer CNN architecture, one can get good results for document classification using differently sized kernels with the filters to allow grouping of word representations at different scales.Yoon Kim in his study of the use of pretrained word vectors for Text classification tasks with Convolutional Neural Networks discovered that using pre-trained static word vectors for the task works very well.In this paper, let's discuss the model architecture proposed by Yoon Kim.Yoon Kim in his study of the classification tasks with Convolutional Neural Networks found that pre-trained word embeddings that were trained on very large text corpora, (such as the word2vec vectors) may offer good universal features for use in natural language processing.Kim in his model shows that the sentences are mapped to embedding vectors and are available as a matrix input to the model.Using differently sized kernels (2 or 3 words at a time) convolutions are performed across the input word by word.The output mappings are then processed using a max pooling layer to filter out the extracted features.Bellow is a a diagram provided by Kim that describes the sampling of the filters with differently sized kernels as different colours (red and yellow).

Fig 2 :
Fig 2 : "Convolutional Neural Networks for Sentence classification".[8] will fine-tune the parameters initialized by Word2Vec vectors to learn the meaningful representation for each task which sometimes may result in better performance than the static one but not true for all cases and can have lower accuracy.For example, good is most similar to bad in word2vec as they are almost syntactically equivalent.But for vectors in the non-static channel that were finetuned on the SST-2 dataset, this is no longer the case.Similarly, good is closer to nice and is great for expressing sentiment which is reflected in the learned vectors.Multichannel vs. Single Channel Models: Though the multichannel models can prevent overfitting (ensuring that the learned vectors do not deviate too far from the original values) the results, however are mixed.Instead of using an additional channel for the non-static portion, one could maintain a single channel but employ extra dimensions that are allowed to be modified during training.BoW vs. Word Embedding: For text related problems in the NLP field Word embeddings is indeed a huge success and outperform Bag of Words (BoW) as BoW has the limitations as large feature dimension, sparse representation etc.
variants.Such tasks include: translations, question answering etc.The convolutions and pooling operations of a CNN model lose information about the base order of words.However, one can add positional features to the input to reduce the problem.Pooling also reduces the output dimensionality.By using the max operation, one keep the information about whether or not the feature appeared in the sentence, but we lose the information about where exactly it appeared.VI.CONCLUSION and FUTURE SCOPE Any model that is built on the base of word embedding causes the model to perform extremely well.If hidden layers and the kernel sizes are increased to any neural network it is expected do better in the task.Researchers can also explore other pre-trained word embedding options such as GloVe and FastText with static and dynamic modes and then we can compare the results with Word2Vec.[18]There are several directions in which text classification can play crucial role like Ontology learning, Knowledge extraction, Storytelling, Text summarization etc.

Table 2 :
Tiyasa Chatterjee et al Int.J. Sci.Res.Comput.Sci.Eng.Inf.Technol, November-December-2021, 7 (6): 364-371 369 CNN-multichannel: A model with two sets of word vectors.Each set of vectors is treated as a 'channel' and each filter is applied to both channels, but gradients are backpropagated only through one of the channels.Hence the model is able to fine-tune one set of vectors while keeping the other static.Both channels are initialized with word2vec.Results of the model proposed by Yoon Kim against other methods are listed in the table bellow: Results of the model.[5] Volume 7, Issue 6, November-December-2021 | http://ijsrcseit.com Which deep neural network performs better when dealing with text data highly depends on how often the comprehension of global or long-range semantics is required.For tasks where length of text is the prior, it's a good practice to go with RNN