A Novel Method for Reducing Overhead of Training Sentiment Analysis Network

: Sentiment analysis based on statistics has rapidly developed in deep-learning. Bilateral attention neural network (BANN), especially Bidirectional Encoder Representations from Transformers (BERT), has reached high accuracy. However, with the increase of network depth and large-scale corpus, the computational overhead of BANN increases geometrically. How to reduce training corpus scale has correspondingly become an important research focus. This paper proposes a reduced corpus scale method called Concept-BERT, which consists of the following steps: firstly, using Formal Concept Analysis (FCA), Concept-BERT mines the association rules among corpus and reduces corpus attributes, and hence reducing corpus scale; secondly, reduced-corpus is inputed to BERT and the result is obtained; finally, the attention of Concept-BERT is analyzed. Concept-BERT is experimented for sentiment analysis on CoLA, SST-2, Dianping and Blogsenti, and its accuracy reaches 81.1, 92.9, 77.9 and 86.7 respectively. Our experimental results show that the proposed method has the same accuracy as BERT, using low-scale corpus and low overhead, and low-scale corpus doesn't affect model attention.


Introduction
Sentiment analysis plays an important role in Natural Language Processing (NLP), which helps us to mine the emotional tendency in sentiment. With the advent of data time, sentiment analysis has become more extensive applications such as movie evaluation [1], opinion mining [2], behavior prediction [3] and social networks [4]. Generally, there are two kinds of sentiment analysis methods: one is based on rules, and another is based on statistics. In the past, the former was the main method of sentiment analysis, which was a combination of linguistics, semantics and pragmatics. Nowadays, with the improvement of computing and data scale, statistics-based method gradually replaces rulebased ones. There are many sentiment analysis models with excellent effects such as Neural Network Language (NNL) [5], Support Vector Machines (SVM) [6], Word2vec [7], Embedding from Language Models (ELMo) [8], Generative Pre-Training (GPT) [9] and BERT [10].
The understanding capability of these above models is closely related to the scale of corpus used in the training. In general, the larger training corpus is, the higher the accuracy of the models is. However, the current problem is that the corpus is unstructured. There are many interfering and invalid corpus in training corpus. The training requires a lot of computing overhead, which cost is high. Unstructured corpus causes a great waste of computing. Therefore, it is necessary to find a suitable method to deal with the unstructured corpus. Concept lattice, also called FCA, is a natural data processing tool, which was proposed by Wile in 1982 [11]. Concept lattice constructs the concept hierarchy according to binary relationship between attributes and objects. In fact, it embodies the relationship among object, attribute, concept extent and concept intent. It contains comprehensive information, which used in many situations, such as library detection [12], online community [13], recommendation system [14].
Concept lattice can be used in the preprocessing of sentiment analysis corpus. Analyzing the internal relevance, dependence and corpus structure is more important than focusing on corpus itself. Using concept lattice association rules to reduce corpus scale and constructing fuzzy concept lattice based on ontology can reflect the constraints of corpus relevance. Attribute dependence on fuzzy transformation and information structure dependence can measure corpus dependence. Therefore, this paper analyzes corpus, combining concept lattice, to mine corpus features, and reduces corpus scale for sentiment analysis.
The contributions of this paper are listed as follows: 1. The grammar structure of training corpus is generated by analyzing itself, in order to avoid over-learning same structure corpus for models. We extract feature information of corpus to form feature attribute concept lattice, and use high-frequency feature attribute clustering to reduce concept lattice. In this way, the reduced-scale concept lattice structure and concept lattice reduction rules are generated; 2. We use the generated concept lattice reduction rules to reduce the scale of same structure corpus and generate corpus concept lattice; 3. According to concept lattice reduction method and bidirectional attention neural network, we propose Concept-BERT sentiment analysis model. Concept-BERT is scientifically tested on CoLA, SST-2, Dianping and Blogsenti to prove its valid.
The paper is organized as follows: the next section gives the basic notations of FCA and related works; Section 3 introduces the framework of Concept-BERT. Section 4 provides some experiments. Finally, section 5 concludes this paper.

Related Works
In this section, we will introduce some statistic-based methods and some notations of FCA. In 2003, Yoshua Bengio proposed the Neural Network Language (NNL) [5], which is the first model that applies neural networks to NLP. NNL uses one-hot, and it predicts words by learning position information of one sentence. The word embedding proposed by him has made outstanding contributions to the development of word embedding neural network in the future. However, at that time, data amount and computing was poor. The model was not well developed and was replaced quickly by a method based on Support Vector Machines [6]. In the era of small-scale data, SVM only requires a small number of samples to achieve good results. However, its robustness for missing data is poor, and its capability of high-dimensional kernel functions explanation is not strong.
In 2013, the Google team led by Tomas Mikolov proposed Word2vec [7], including two important models: Continuous Bag-Of-Words Model (CBOW) and Continuous Skip-gram Model (Skip-gram) [15]. Word2vec uses context to reduce the dimensionality of word embedding, and has also achieves good results in implicit semantics [16]. In 2018, Matthew E. Peter proposed Embedding from Language Model (ELMo) [8]. ELMo uses dynamic embedding, generating different word embeddings according to different contexts, to solve word ambiguity [17]. In the same year, OpenAI proposed Generative Pre-training (GPT) [9]. GPT is the first model to replace LSTM with Transformer for sentiment analysis [18]. Transformer is an encoder with strong feature extraction capabilities, which is stack of Encoder and Decoder. Few months later, Google team proposed the Bidirectional Encoder Representation from Transformers (BERT) [10]. It is a general model for NLP. It adds bidirectional Transformer on the basis of GPT, which improves model's capability to predict specific locations. Compared with existing model, BERT has achieved good results on most tasks.
In the background of deep-learning and big data, deep networks and large-scale corpus have become the development direction of sentiment analysis. Many studies have shown that a deeper network and a large-scale corpus tends to improve the performance of model. However, its computational cost is high, and many corpora is unstructured. It causes model learning many invalid and interference corpus during training. Therefore, corpus preprocessing becomes an essential process of sentiment analysis. How to effectively reduce corpus scale and discover hidden associations of corpus has gradually become a new research focus.
In 2016, Tao Chen and Ruifeng Xu et al. proposed a method of regional sentiment analysis [19]. This method divides corpus into multiple regions, and uses a Convolutional Neural Networks to assign regional weights. It does sentiment analysis on the important corpus. In 2017, Bryan McCann, James Bradbury et al. verified the importance of sentiment granularity for sentiment analysis [20]. They proposed that the smaller emotion granularity, the more necessary it is to filter important emotions in context. In 2020, Qianming Xue, Wei Zhang et al. proposed a domainrecognizer, which uses two feature extractors to detect features of invariant sentiment [21]. Tomoki Ito, Kota Tsubouchi et al. proposed an interpretable neural network lexical initialization learning method, judging primitive-level word emotions out of contextfirstly and judging global emotions [22]. Su-Jin Shin, Kyungwoo Song et al. proposed hierarchical clustering to reduce corpus scale [23].
Concept lattice has excellent development, and its reduction methods are also very mature, such as K-means clustering reduction [24], generalized spectral clustering reduction [25], granular computing [26], fuzzy attribute parameterization [27], context vector reduction [28]. These methods are very effective for attribute reduction of concept lattice. In addition, concept scaling can effectively filter the formal concepts [29], and BiC2PAM algorithm can improve concept lattice generation effect after adding context knowledge constraints [30].
Concept lattice takes the concept as the basic analysis element and obtains the background from formal context. It sets partial order relationship to get a complete concept lattice. Definition 1 (Formal Background): Object G , attribute A and binary relation I between object and attribute constitutes a new formal background K . In general, it is defined as triplet In I , for any object g in G and attribute a in A , if object g has an attribute a , it is recorded as ( , ) g a K ∈

Method and Model
In this section, we firstly explain the basic idea and framework of Concept-BERT, and then propose the method for training corpus reduction.

Framework of Concept-BERT
Sentiment analysis can be understood as reflecting specific corpus to different sentiment categories. After giving corpus and tags, sentiment analysis is performed as the following operations [35]: where, T represents corpus belong to the sentiment. D represents opposite situation. Concept-BERT is composed of two parts. One part uses high-frequency feature clustering method to reduce the feature attributes of corpus. Another part is sentiment analysis by input the reduced-corpus to BERT with one-hot encoding, as shown in Figure 1.

Training Corpus Reduction
Sentiment features are important factors that affect the accuracy of sentiment analysis. The training method of BERT is randomly masking a word in a sentence with the [mask] tag. It predicts the probability of a word in a certain position. The grammar structure has a decisive influence on its effectiveness. For example, "This movie makes me happy" and "This movie makes me glad". These two sentences have same grammar structure and express positive sentiments. Their difference is sentiment word. One is "happy" and another is "glad". In this case, according to Definition 4, these two words can be replaced by one of them, and another word is unnecessary. Therefore, these two feature attributes can be reduced, Unstructured corpus often has invalid and interference sentence, so the impact of these factor needs to be reduced. Considering the probability of negative factors, we use frequency-filtering to filter them. TF-IDF is a common method to compute the importance degree of words, which is composed of word frequency and reverse document frequency [36]. The calculation of TF-IDF is as follows: The calculation of TF is as follows: where, , i j n is a word frequency in the corpus, and , k j k n ∑ is the sum of all words frequency.
IDF is used to measure the universal importance of a word. The calculation of it is as follows: where, | | D represents the total of corpora, and | { ; represents the total of corpora containing i t .
In this way, we can not only delete the interference and invalid corpus, but also find the highfrequency features in corpus. For high-frequency features, we use K-means clustering to cluster them. Assuming that the cluster is divided into 1 2 ( , ,..., ) k C C C , the goal is to minimize the square error E : where, i µ is the mean vector of i C : We use association vector to calculate the semantic similarity and features cluster, the calculation of it is as follows: where, 1,i ω represents position i of one corpus, and 2,i ω represents position i of another corpus. Therefore, for the training corpus D : Algorithm 1: Input: D Output: Re duced D − Step 2: Calculate the semantic similarity of association vectors 1 2 ( , ) Sim ω ω Step 3: Run K-means clustering Step 4: Reduce concept lattice by association rules and Definition 4 For example, the corpora "Nicholson 's understated performance is wonderful." … "It's both degrading and strangely liberating to see people working so hard at leading lives of sexy." Firstly, Concept-BERT performs feature importance analysis, and screen key features to construct formal conceptual background, as shown in Table 1.  Notes: A, B, C, D, E, F, G, H and I represents "funny", "best", "great", "performance", "fascinating", "worst", "just", "too" and "hard". As shown in Table 1, each column represents an attribute, and each row represents an object. When one object including the attribute, it is recorded as 1. On the contrary, it is recorded as 0. Then, it reduces the concept lattice by algorithm 1. The reduce rules, concept lattice and reduced concept lattice is shown in Figure 2.  Figure 2, reduce rules are "Fascinating hard=[100%]=>best", "worst=[100%]=>too" and "just=[100%]=>great". In order to let Concept-BERT learn more attributes from a small-scale corpus, reduction rule also calculates each weight of object-attribute as object weight besides association rules, and then smaller weight object is reduced. From Figure  2, after reducing by association rules, the weights of 12 objects are calculated respectively. Where, 5 of 12 objects need to be reduced due to smaller weights with only one attribute. There are 20 concepts, 9 attributes and 12 objects before reduction. There are 13 concepts, 6 attributes and 7 objects after reduced. Compared with the two cases, 7 concepts, 3 attributes and 5 objects are reduced by association rules. The upper-level objects are abstract, and it contains few attributes. The under-level objects are specific, and it contains more attributes. The object-attribute detail information before reduction is shown in Table 2, and the object-attribute detail information after reduction is shown in Table 3.  Table 2 and Table 3, Concept_1 to Concept_18 represents each concept that generates a concept lattice. There may be multiple objects and attributes in each concept.

Bidirectional Encoder Representation from Transformers
Concept-BERT is improved based on BERT. The description of BERT is as follows [10]: BERT is a stack of bidirectional Transformer. Its encoder is composed of Self-attention and Feed Forward. Its decoder is composed of Self-attention, Encoder-Decoder Attention and Feed Forward. Concept-BERT inherits from BERT, and its presentation layer, hidden layer, Self-attention heads and Feed Forward are 12, 768, 12 and 3072, respectively, as shown in Figure 3.  Figure 3, in generally, the input of BERT is one-hot coded word embedding composed of token embedding, segment embedding and position embedding. It is transmitted among different layers through residuals. The calculation of feature matrix is as follows [18]:

Computational Overhead of Concept-BERT
Concept-BERT uses the forward-matching regular expression form, and combines with the following loss function: Using this method can reduce the computational overhead for deep networks. The forwardmatching regular expression is as follows: Under the optimization of forward-matching regular expression, the state of 1 L + layer as follows:

Experiments and Result Analysis
We verify the effectiveness of Concept-BERT on CoLA, SST-2, Dianping and Blogsenti by Tesla V100 [37]. Detail information on the sentiment analysis experiment settings and results are shown in the following section.

Dataset
In this paper, the corpora we used contains three labels: number, comment and sentiment. CoLA is a corpus for judging grammatical correctness. The purpose of adding CoLA is to test the scalability of Concept-BERT in other NLP tasks. SST-2 is a movie evaluation corpus. Dianping is a Chinese corpus that records customers' evaluations of products. Blogsenti is the comment of blog replies. Before experiment, we randomly sort the corpus to ensure its repeatability. Then, we use 60% of corpus as the training, 20% as the validation, and the remaining 20% as the test. We also divide reduced-corpus in the same way after using Concept-BERT. We keep same test to ensure the stability of the experiment in both cases. Finally, we added [CLS] and [SEP] at the beginning and ending of each sentence according to model requirements. Table 4 shows the detailed information of corpora.

Construct Concept Lattice
We use Concept-BERT to extract the high-frequency features of CoLA, SST-2, Dianping and Blogsenti by Algorithm 1 and Definition 4. The four corpora are reduced under the semantic similarity and K-means clustering concept reduction rules. The reduced concept lattice is shown in Figure 4.  Figure 4, node is concept, which contains multiple objects and attributes. The blue part represents concept has attributes, and the black part represents concept has objects. These objects and attributes have similar composition structures or association rules. The connection among concepts represents the partial ordering association on different concepts. The objects in upper-level concept include that in lower-level. The upper-level concepts are abstract and have less attributes, and the lower-level are specific and more attributes. The objects are comment, and the attributes are keywords and stars. It mines the derivation rules among keywords by the reduction rules of comment, star and keywords. Then, it reduces attributes for keywords that contains repetition, similarity, and deduction. This process reduces comments quantities, and it uses as few keywords as possible to contain as much sentiment information as possible, thereby reducing the scale of training corpus.

Parameter Setting
We use the follow learning rate function during the training: We use randomly small-scale training corpus to set the learning rate function and find the best learning rate. The result is shown in Figure 5.  Figure 5 each sub-graph, we can clearly see changes between the learning rate and loss on all corpora. We find the maximum learning rate before its loss increases sharply. We set learning rate as -3 e on CoLA, 3 e − on SST-2, -4 e on Dianping and -2 -6 -e e on Blogsenti.

ACC and Epoch
After setting the learning rate of each corpus, we experiment 40 Epochs, and the experimental results are shown in Figure 6. Figure 6. ACC-Epoch of Corpora From Figure 6, ACC gradually rises and stabilizes with Epoch in all corpora. Loss has been decreasing. Val_loss decreases firstly, and then increases. When Loss and Val_loss decrease together, this indicates that the model is learning effectively. When Loss decreasing and Val_loss increasing, this indicates that the model works well in the training, but not works well in the validation and test. At this case, the model learns overfitted.
From Figure 6, when Concept-BERT reaching best effect, the ACC is similar as BERT best effect. Concept-BERT has reaches high accuracy using concept reduction corpus training.

Evaluation Metrics
In order to strictly analyze the performance of Concept-BERT, we adopted multiple measures, including precision (P), recall (R), F1-score (F1) and accuracy (ACC).
Accuracy, recall, and F1-score are a set of metrics widely used in model evaluation. TP indicates that the sample is positive, and the model predicts it as a positive sample. TN indicates that the sample is negative, and the model predicts it as a negative sample. FP indicates that the sample is negative, and the model predicts it as a positive sample. FN indicates that the sample is positive, and the model predicts it as a negative sample.
The precision is the ratio of positive samples classified correctly to samples classified as positive.
The recall is the ratio of positive samples correctly classified to actual positive samples.
F1-score is the weighted average of precision and recall. 1 2 precision recall F precision recall ACC is the ratio of correctly classified test samples to the total number of test samples. It is used to measure the capability of a model to correctly predict classification of new data. Table 5 records Concept-BERT results on each corpus.  Table 5, Concept-BERT achieves a higher score than the benchmark when using concept reduction in training, and the computational overhead is reduced. We compare the results with BERT, as shown in Table 6. From Table 6, it shows that Concept-BERT has achieved the same ACC as BERT. In terms of training corpus scale, Concept-BERT's scale is about half of BERT. When we experimenting on Tesla V100, we are able to reduce about 46.6% time during training. The computing of Tesla V100 is 7.5 TFLOPS. Therefore, the computational overhead of Concept-BERT and BERT training on each corpus is shown in Table 7.

Attention Analysis
Attention is the core of bidirectional attention model. Therefore, the attention of Concept-BERT is also an important basis for judging its effectiveness. In order to verify Concept-BERT attention, we use "I like this movie. I hate the ending of this movie" to analyze it after model training. The analysis results are shown in Figure 7. , not the next word. Therefore, attention to the next word only fits inside of sentences. Figure 7 (b) describes most attention focusing on the previous word. For [SEP], its attention is directed to [SEP], not the previous word. Therefore, attention to the previous word only fits inside of sentences. Figure 7 (c) depicts most attention focusing on related word, including the etymology. This attention goes beyond single sentence. It is applied in two different sentences. Figure 7

Conclusion
This paper proposes a method called Concept-BERT for sentiment analysis using low-scale training corpus. It analyzes the structure of large-scale corpus, and uses high-frequency clustering method to extract features on sentiment. Then, it uses association rules to reduce corpus scale for constructing the reduced concept lattice. The reduction rules are generated during constructing concept lattice. It reduces the impact of invalid and interference corpus by generated reduction rules, thereby reducing corpus scale. We use Tesla V100 with 7.5 TFLOPS computing to do experiment on CoLA, SST-2, Dianping and Blogsenti. The experimental results show that Concept-BERT achieves the same score in low-scale training corpus as BERT in large-scale training corpus. Concept-BERT reduces computational overhead while ensuring the effect of sentiment analysis. Combining the four corpora, its computational overhead reduced about 46.6%.
Several problems remain unsolved. Some of the interesting problems are how to improve the attention and feature extraction capabilities of Concept-BERT to relevance of different word embeddings, and how to applicate Concept-BERT to spatio-temporal data and service recommendation [38][39][40][41][42][43][44][45]. In the future, we will focus on solving this problem.