Research on human-vehicle interaction based on natural language processing

With the rapid development of intelligent networked vehicles and driverless technology, the importance of the dialogue between human and vehicle artificial intelligence has also become prominent. In the case of the driver outputting speech and converting it into text, how to correctly distinguish the driver’s intention and make the correct response is the purpose of the research. Google released the first-generation BERT (Bidirectional Encoder Representations from Transformers) training model in 2018, bringing natural language processing methods to a new way of understanding. This article analyzes the current mainstream machine learning algorithms and applies them to natural language processing (NLP) to classify text data sets. The purpose is to study various machine learning models to achieve higher accuracy of text classification. Research results not only lay the foundation for improving vehicle driver interaction, but also serve as a reference for the natural language processing part of the future car networking and driverless technology.


Introduction
AI artificial intelligence is a vital part of intelligent networked vehicles and driverless technologies, natural language processing is the core technology of human communication with AI artificial intelligence, especially text Pre-processing method and machine learning algorithms. At present, the most effective technology in the industry for artificial intelligence natural-language processing is the release of BERT (Bidirectional Encoder Representations from Transformers) by Google in 2018. It is expected that natural language processing technologies will be developed on this basis in the next five years. The field of automotive artificial intelligence is a good technological breakthrough opportunity. This article will be developed based on BERT by google and compare its current hot natural language processing technology. This article discuses the model applicability and prediction accuracy according to the difference of feature selection and algorithms model. Natural language processing could be used in the automotive interactive system to perform more precise functional positioning of the driver's voice commands, and can also be applied to provide appropriate feedback on the driver's current emotion judgment. This purpose of this paper is that to improving the algorithms model fitting and predicting accuracy based on experiment outcome and data pre-processing methods, and developing a foundation for improving driver's interactive experience.

TESTING method
The experimental dataset contains over 650,000 English texts of a social network user reply and their sentiment classification items (Sentiment Classification). The specific sentiment classification items are divided into 5 levels, and each text corresponds to a sentiment classification level.The experiment will

Programming language and integrated development environment
The programming language used in this article is version 3.6 of python. Experiments are conducted on the jupiter notebook integrated development environment and integrated packages such as sklearn and scipy whcih are used to call the algorithm model. the text feature processing divided into two different ways: N-gram word frequency and word embedding.

Text pre-processing
We will divide the text into two feature groups according to Table 1 and process them separately to deal with different machine learning models. In the two feature groups that deal with N-gram word frequency statistics (TF-IDF) and word embedding (Word embedding), all text in the training dataset is structured first. However, the volume of the dataset is extremely large and also contain special symbols, abbreviations, and some other languages that will affect the model fitting. Therefore, it is necessary to perform a machine translation of other languages in the pre-processing stage to reduce the "noise" of the data model. Then the text is divided into basic word elements (tokens) and converted into English lowercase. At the same time, there are a large number of pause words, conjunctions, and meaningless words (words less than three English letters) in the text. These words appear frequently and will seriously affect the fit of the model training. These three types must be removed. After completing the above steps, the pure English text can be extracted. In addition, for the changes in the tense of a single English word in the text, this will cause the single word frequency to be counted separately in the word frequency-inverse text frequency statistics. We need to perform the text Stemming and Lemmatization aim to classify tokens with the same meaning. The above is the pre-processing stage.

Feature extraction
After pre-processing the text content into lexical elements (tokens). Features should be filtered and feature extracted. At present, the "bag-of-words model" is a very effective feature construction method for text mining. Considering the proportion of words in English with three yuan and above Small, so this experiment only considers unigram and bigram to be added to the bag of words. At the same time, considering that there will be high-frequency and low-frequency words in the bag of words, we need to filter the word bag. The filtering method is to filter words with a frequency higher than 95% and 0.001% of low-frequency words in the bag, The number of bag-of-words matrices is (650000, 36480).

Model training-logistic regression applicability analysis and parameter setting
In the model training stage, a total of 5 models were selected in this experiment to train the above word bag matrix. The training models are logistic regression, naive Bayes, LinearSVC, LSTM and Text-CNN. Among them, in the logistic regression model, the model learns weights and biases through the training set. Each wi multiplied by a feature xi represents the importance of this feature; the offset is also the intercept. When testing, enter a new x, you can calculate z.
However, the range of z should be (1,5), not the range of logistic regression probability [0,1]. So in order to generate probability, we pass z to the sigmoid function σ(z), and then sigmoid maps the real number domain to [0,1], which is convenient for obtaining the probability to complete the discrimination of a single label. In terms of parameter setting, set setsolver ='lbfgs' to deal with the polynomial loss value (Loss), and then set max_iter = 1000. In addition, in order to better fit the model, a quarter of the training data set was split into the test data set, and finally 10 cross-validation was completed on the training model.

Naive Bayes Applicability Analysis
The polynomial naive Bayes classifier is suitable for classification with discrete features. It considers the conditional probabilities of the classified features in each dimension, and then combines the probabilities of the features to classify the predictions, thereby reducing the dimensions of the parameters, thereby Save time and storage space. In the parameter setting, set the additive (Laplace/Lidstone) smoothing parameter alpha = 2, then split a quarter of the training data set into the test data set, and finally complete 10 cross-validation.

LinearSVC applicability analysis
LinearSVC implements "one-to-one" multiple classification strategies to train multiple types of models. In the training process, samples of a certain category are classified into one category, and the remaining samples are classified into another category. K SVMs are constructed from the samples of k categories, and the unknown samples are classified as the samples with the maximum value of the classification function.

CNN applicability analysis
The neural network model uses pretrained word embeddings as input, and each word in each pretrained document (Pretrain Embedding Document) will be represented by 300 vectors. In addition, the BiLSTM model is used in this experiment to enhance the fit of the model. The BiLSTM model can predict extended words based on the context to promote the model to better understand the training text. Next, we use dropout and pooling methods to capture the important features in this article to avoid model overfitting, and then connect the tf-idfas layer with the input layer to make the training model fully activate this layer. 1. The first layer is the input layer, where we input word embeddings. If we have N words, the dimension of the vector is K, and the size of this matrix is N*K.
2. The next layer performs convolution on the word vector after multiple filtering. For example, the area size exceeds (2,3,4) words at a time. It will be aggregated into different feature groups (listed as 1).
3. Then the results of these feature maps (from the convolutional layer) are merged into a onedimensional long feature vector. Finally, all layers will be fully connected by adding dropout and L2 normalization to the softmax layer.
4. Modify the basic text-CNN model: • Merge the embedded vocabulary completed with training and the original training set.
• Add two more convolutional layers, apply batch normalization, and then connect them to the SoftMax layer. In terms of parameter settings, two convolutional layers are added to the CNN model. In the Convolution1D() function, the first parameter is the filter size, and the second parameter is the kernel size. The third is the filling state. In the first parameter, the dimensionality of the output filter in the convolution, this value is set to 256, which means that due to the huge amount of network embedding, more filters can be provided. After batch normalization and activation function "relu", we compress it into 128 dimensions.
The second parameter is the length of the one-dimensional convolution window, which is set to {3,4,5} as the length of the feature mapping. Set the Dense parameter to 6 on the output parameter, which means (number of classes)+1, and calculate "5+1=6".

LSTM applicability analysis
The long short-term memory network (LSTM) improved from the neural convolutional network architecture is a deep learning method. It is suitable for the task of distinguishing sentences that contain the same words but in different sequences. In the LSTM model, we put the sequence string converted into a one-hot vector into the embedding layer, and then the embedding part learns weights from the order of words in each sentence. After the model is aggregated, the output layer will extract three-length numerical vectors through softmax. In the model tested in this article, we convert the text and label values into one-hot vectors. In terms of parameter settings, setting the first parameter of the LSTM() function to 100 and the second parameter to dropout=0.2 have met the text classification requirements.

TESTING method
The experimental procedures and parameter settings used in this article are running on the same computer. The data volume and running time should be taken into consideration to reduce the number of cross-validation. If the computer performance is satisfied, the number of verifications should be increased to ensure the model fit.

Validation results
The model fit is 72.4%, and the specific loss function and accuracy are as follows in the training set and test set  The accuracy rate of LSTM in the training set reaches 70%, but the actual accuracy rate in the test set is 62.3%. The reason for the difference between the accuracy of the model and the actual classification is the excessive noise in the data set.

CNN verification results
CBB + Pertrain developed based on the BERT model achieved the highest accuracy rate in this experiment. Its accuracy rate fluctuates according to its pretrain embedding number (Pretrain Embedding Number), and the floating interval is (66.06%-70.33). Among them, when the number of pre-training embeddings is 4, its prediction accuracy is the highest (70.033%), but its training time is the longest of all models. If you need to consider other training with larger text volume in the future, you need to propose a higher level of hardware. Requirements. As shown below.

TF-IDF feature group
The actual accuracy rates of the three models (logistic regression, naive Bayes and LinearSVG) using the TF-IDF feature set are 64.4%, 62.2%, and 65.7% respectively. The overall performance is lower than that of the neural network model using word embedding. The fundamental reason is that the model is a single level, and the complexity is lower than that of neural network learning. In addition, machine learning cannot perform a repetitive and suffix vocabulary prediction in advance of the text, making the final prediction result overall lower.