Comment information extraction based on LSTM and Neural Networks

With the advent of the era of big data, the amount of data has also increased geometrically. People’s ability to obtain effective information has gradually declined. At present, most e-commerce platforms only focus on the sentiment analysis of positive and negative reviews. It is difficult for users and businesses to extract user opinions and views from the massive review data. For the product review data of a certain hard disk, use the LSTM model to train the sentiment classification model. Finally, the neural network is used to find the keywords of the comment data and the word cloud diagram is used to display the analysis results. Through the research, it can be found that LSTM emotion classifier can classify comments with high accuracy and words closely related to comment emotion tendency can be found according to the weight of neural network.


Introduction
Today is the era of data explosion. In the face of massive amounts of data, artificially extracting effective information is very time-consuming and inefficient. With the help of natural language processing technology, the programmatic mining of massive data to help users solve the problem of information acquisition is an important direction of artificial intelligence at present. Online shopping has become the main way people to shop, and it's easy to find that reviews are the main factor that affects people's shopping. For a large number of reviews, mechanized information extraction can significantly improve users' shopping efficiency and save users' time. At present, text analysis technology is still in the developing stage. Although there are many people studying it, it is still unable to reach a mature level. Now, machines can only recognize text through vectors, but they still can't understand words from the human way of thinking.
Text sentiment analysis is a difficult point in the field of natural language analysis. At present, the most common analysis method is to categorize comments by establishing a sentiment classification model. Few people will analyze the text in detail, which makes users can only intuitively obtain people's general attitude towards the product, but can't directly obtain other users' specific views on the product. Only obtaining people's emotional tendency towards goods can't reflect the value of comment data. Therefore, adding keyword extraction in text sentiment analysis can increase the value of review information and enable users to quickly understand people's general views [1]. For users, being able to have a certain understanding of products from reviews can alleviate the user's product selection problem to a certain extent; for merchants, they have ability to extract the user's point of view from product reviews and look at products from the user's point of view, which is conducive to the improvement of the quality of goods and services of merchants. In this way, not only improve the user's buying experience, but also improve the product quality of the merchant. Finally, a win-win situation between users and merchants is realized.
Badger J et al found that the text effect of random forest and logistics is better after adding keywords [2]. Zaghloul W et al research found that the effect of neural networks on text classification is not worse than that of support vector machines [3]. VM Kreňáková et al uses popular transformation models and traditional TF-IDF algorithm, pre-training word embedding and other methods to compare and evaluate the performance impact of text classification. Finally, the study found that the BERT-BASE uncased model has the best effect [4]. Hammami Linda at al research found that a rule-based method can effectively classify cancer forms from Italian pathology reports [5]. Hemmatian F et al studied the use of various methods to research opinions mining, analyzing the advantages and disadvantages of various methods[6]. Kastrati zenun analyzed the emotional tendency of the teacher's comments on students by using the emotional dictionary. Sentiment analysis based on comments is realized [7]. Dongwen Zhang et al found that using word2vec and SVM can effectively perform sentiment analysis on Chinese comments [8]. Alajlan A A et al research found that machine learning can effectively analyze whether speech is dangerous [9]. Sarah Omar Alhumoud and Asma Ali Al Wazrah research studies and analyzes the performance of RNN model in Arabic sentiment analysis [10]. Oleynik Michel et al found that the shallow method of clinical phenotyping is better than deep learning after pre training [11]. Miao T found that using Ship-Gram model to extract keywords has improved performance than other model [12]. Quan C et al set a built a blog sentiment corpus to analyze the sentiment of Chinese blogs [13].
This article trains a product review sentiment classification model based on product review data. By the model, product reviews can be classified. Then, the neural network-based classification model is trained to extract the key variables according to the model weight, so as to extract the keywords in the review.

Text vectorization
After cleaning the text, the text is quantified using the form of one-to-one correspondence between the text and the subscript.
Since the word bag model and One-Hot algorithm don't extract the text information completely, resulting in the loss of more text information, word2vec algorithm is used for text vectorization. Word2vec uses an ordinary fully connected neural network, which can effectively preserve the relationship between contexts [14].

Word2vec model
It can be seen from the Figure 1 that the word2vec consists of an input layer, a hidden layer and an output layer. The input layer is the One-Hot encoding of text characters. The output layer is the input encoding of two characters above and below the text characters. A character in the model is used to fit the character of its context, so the vector composed of the value of the intermediate hidden layer is the word vector of the character.  This article establishes a word2vec model to train text data. In order to visualize the relationship between texts, the PCA algorithm is used to reduce the 100-dimensional text vector to 2 dimensions.
This article only visualizes fifty words to observe the results of the model. Through data analysis, we found that pretty good, very quickly and compatible are close together, indicating that these words play the same role [15]. This is in line with the actual situation, the above text is a good comment on the goods. This shows that the model has a good effect. Then, evaluate the effect of the model by analyzing the similarity between words. Use the word GREAT as an example to analyze similar words. The results are shown in Table 1.

Model training
This article use tensorflow to train the lstm network. LSTM model is suitable for processing and predicting important events with very long interval and delay in time series. Similarly, it can be used to process comment data with contextual nature. Compared with the rnn network, lstm can be effectively avoiding the phenomenon of gradient dispersion.
The lstm network is processed in units of characters. Each character needs to pass through a neuron. Lstm has the function of long-term memory [16]. After each character passes through the neuron, the long-term memory will be updated. Finally, the sigmoid function is used for activation. Adding the relationship between comment texts to classify texts can effectively improve the accuracy of classification results. [17][18].
After using the lstm model for training, the training effect of the model is shown in Table 2.

Keyword extraction method
There are many methods for text vectorization, but it is difficult to find the words corresponding to the vector based on the vectorization of the text. If you need to find words based on vectors, there are fewer vectorization methods. The most direct method is to use the word frequency method of vectorization. By the way, we can not only quantify the text data, but also find the corresponding words according to the variables [19].
On the premise of having great quality review data, using this vectorization method to find the weights and input feature variables between positive and negative reviews in the neural network, so that the main features that affect the classification results can be analyzed. The variables with larger weights are extracted for inverse solution to obtain keywords that affect good and bad reviews. These words have a guiding effect on positive and negative reviews, so they can be considered as review keywords [20].
Use jieba word segmentation to segment the comment data, then use the stop vocabulary to filter and clean the comment data to remove useless words. Ultimately, use the term frequency algorithm to vectorize the words.

Building neural network sentiment classification model
After the text word is vectorized, all words in the comments are used as feature input, then the feature vector corresponding to the word is x 1 , x 2 , …, x n , Hidden layer neuron is y 1 , y 2 , …, y n , output layer neuron is z 1 , z 2 [21].
Connection between hidden layer neurons and feature vectors: Connection between the hidden layer and the output layer: Finally, use the Softmax function to predict the result. According to formula (1)(2), we can get formula (3).
, the is the weight between the input feature and the output layer, the input feature x i corresponding to the larger is the keyword in the comment(The algorithm uses Pytorch library based on Python language version to construct neural network for modeling and calculation).

Model results
This article analyzes the review data of a certain hard disk product. Due to the limited product review data displayed on the platform, only 2000 review data are used for analysis. After establishing the text sentiment classification based on neural network, the model is fitted according to the comment data preprocessed by LSTM model. The accuracy of the model fitting result is 96%. According to the fitted model, the weight of each connection can be obtained. Finally, is solved according to and so that the parameter with larger weight in each neuron can be obtained.
The result of extracting keywords from hard disk comments is shown in Figure 2.

Conclusion
With the geometric growth of information, information extraction technology plays a vital role in people's life. Quickly obtaining key information from comment data can effectively save people's time and improve people's life efficiency. This article establishes an lstm model based on the comment data to classify the hard disk comment data. According to the sentiment classification results of the lstm model, a neural network is used for fitting. Finally, according to the neuron weight of the fitted model, comment keywords can be found from the positive and negative comments. The keywords extracted by the weight of neural network can effectively extract the words related to the comment emotional tendency. However, when there is a great imbalance in the review data, it is easy to have poor performance. We hope to establish keyword dictionaries about different types of goods in the future. Then, it is believed that the extraction effect will be significantly improved by using the dictionary of corresponding commodities in the information extraction results of different comment data.