ACR-SA: attention-based deep model through two-channel CNN and Bi-RNN for sentiment analysis

Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) have been successfully applied to Natural Language Processing (NLP), especially in sentiment analysis. NLP can execute numerous functions to achieve significant results through RNN and CNN. Likewise, previous research shows that RNN achieved meaningful results than CNN due to extracting long-term dependencies. Meanwhile, CNN has its advantage; it can extract high-level features using its local fixed-size context at the input level. However, integrating these advantages into one network is challenging because of overfitting in training. Another problem with such models is the consideration of all the features equally. To this end, we propose an attention-based sentiment analysis using CNN and two independent bidirectional RNN networks to address the problems mentioned above and improve sentiment knowledge. Firstly, we apply a preprocessor to enhance the data quality by correcting spelling mistakes and removing noisy content. Secondly, our model utilizes CNN with max-pooling to extract contextual features and reduce feature dimensionality. Thirdly, two independent bidirectional RNN, i.e., Long Short-Term Memory and Gated Recurrent Unit are used to capture long-term dependencies. We also applied the attention mechanism to the RNN layer output to emphasize each word’s attention level. Furthermore, Gaussian Noise and Dropout as regularization are applied to avoid the overfitting problem. Finally, we verify the model’s robustness on four standard datasets. Compared with existing improvements on the most recent neural network models, the experiment results show that our model significantly outperformed the state-of-the-art models.


INTRODUCTION
Nowadays, with extensive amounts of user-generated text on social media, sentiment analysis (SA) has become essential for NLP (Naseem et al., 2020). The sentiments involved in social networks are indeed sources for modeling business strategies to achieve the business goal. Although the amount of data in social media repositories increases exponentially, the traditional algorithms often fail to extract the sentiments from such big data.
Numerous studies have been conducted for sentiment analysis using a traditional classification based on manual feature engineering. Traditional methods do not present features that have massive weight. Usually, it focuses on the statically and word indicators, such as term frequency-inverse document frequency (TF-IDF). Researchers recently started to use deep learning (DL) approaches based on the distributed representation to deal with data specifications during training datasets, feature engineering, and other meeting problems on traditional techniques . DL approaches have been significantly implemented in multiple areas of NLP, such as family information extraction (Shi et al., 2019), behavior detection systems during the COVID-19 pandemic (Rosa et al., 2020), and emotion detection (Chen & Hao, 2020). These approaches revealed that it is improving accuracy and decreasing the prediction time. Neural networks (NN) such as CNN and RNN accomplished comparable results for the sentiment classification. Shi et al. (2019) proposed a model that improves aspect-based sentiment classification performance using CNN and can extract valuable features from the text data. However, CNN needs to rely on several convolution layers to extract higher features for capturing long-term dependencies because of the locality of convolutional and pooling (Zhang, Zhao & Lecun, 2015). RNN can overcome the mentioned issue due to a single layer that can handle long-term dependencies (Hassan & Mahmood, 2018). Long Short-Term Memory (LSTM) is an advanced RNN that uses three gates (input, forget, output). These gates help detect whether data in the previous state should be retrained or forgotten in the current state. Therefore, LSTM can address long-term information preservation and vanishing gradient text . Similarly, Gate Recurrent Unit (GRU) is another variant of LSTM that has only two gates by consolidating the input gates and the forget gates. Nevertheless, the basic LSTM and GRU only scan in one sequence direction.
Bidirectional Long Short-Term Memory (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU) are further developments that scan in both sequence directions and simultaneous access to forward and backward contexts (Li et al., 2020a). Abid et al. (2019) proposed joined architecture that used BiLSTM and BiGRU at first to capture long-term dependencies, and CNN utilized to reduce the loss and extract local information. The method achieved reliable classification accuracy on various benchmark datasets. However, their model did not consider the differences between the importance of different parts of sentences. Meantime, attention mechanisms have attracted the researcher's concentration as it trains the model to learn better and let the network know where to perform its tasks. Studies showed that the attention-based mechanism positively impacts several NLP tasks because it trains the model to learn better and lets the network know where to perform its tasks. Zhou et al. (2017) proposed a BiLSTM attention-based model to extract features with a crucial classification effect. In comparison, Yang et al. (2016) proposed the hierarchical RNN method to learn attention weights based on the local context. Although these neural network models have been successful in sentiment analysis, this single attention mechanism is often used to capture important input information. However, under different representation subspaces, the same words may express different levels of emotion in different places so that all information constitutes the overall meaning of the entire input sequence. Nevertheless, there is a problem with the input sequence, and the file is eventually decoded to a specific length vector no matter how long the input text sequence is. Our previous work attention-based model utilizes CNN with LSTM (ACL) for sentiment classification (Kamyab, Liu & Adjeisah, 2021). In this model CNN utilizes to extract contextual features and bidirectional LSTM to capture long-term dependencies. Attention mechanism employed at the output of CNN to give different focus on the features. However, this model only works for short comments.
We propose an attention-based deep model using two-channel CNN and Bi-RNN sentiment analysis (ACR-SA) to solve the existing issues. The model considers forward and backward contextual dependencies, extracts the most critical feature, and pays attention to each token's importance. It is designed to detect both long and short text polarity and work for small and large datasets. In the proposed model, we employed GloVe (https://nlp.stanford.edu/projects/glove/) word embedding as input for the weights and lengths to feed the two parallel CNN networks. We used a zero-padding strategy to make the model suitable for long and short text input text and Gaussian noise and Dropout were used as regularization in the input layer to combat overfitting. CNN is used to learn high-level feature context from input representation. Two independent BiLSTM and BiGRU extract the contextual information from the feature-generated CNN layers and consider forward and backward contextual dependencies to perform the sentiment analysis task. Simultaneously, the attention mechanism is applied at the Bi-RNN output layers to pay suitable attention to different documents. We selected SD4A, Sentiment140, and US-Airline Twitter datasets for short sentences because they are challenging due to many misspellings, including polysemy and informal language, and IDMB for the long sentence to prove that the model is suitable for short and long intro sentence sentiment. Social media users commonly post abbreviations and misspellings, so one spelling mistake may change the whole sentence's viewpoint. Likewise, we apply a unique data preprocessing task that removes noise by performing spell correction, word segmentation, and tokenization to address this problem. We experimented on a diverse number of accessible benchmark datasets to test our proposed model's effectiveness. The significant contribution of this article is as follows: This work proposed a novel deep model for sentiment analysis. The model adopts the advantages of local feature extraction by CNN and characteristics of BiLSTM and BiGRU, considering contextual information that effectively improves sentiment knowledge and accuracy. We offer data preprocessing to acquire structured data for remarkable outcomes. This process corrects the spelling mistakes, removes the noise, performs tokenization, word lemmatization, and normalization techniques. The attention mechanism is used to pay suitable attention to different words to improve the feature expression ability, and Gaussian Noise is applied to avoid overfitting.
We performed a comparative experiment on four public datasets to assess the proposed architecture's effectiveness by improving long and short sentences.
The remaining of the article is categorized as follows. "Related Work" explains the related work. "Proposed Architecture" describes the architecture of the framework. "Experiments" provides the experiment setup and implementation, and "Results Analysis and Discussion" presents the results analysis and discussion. Finally, "Conclusion" elaborates the conclusion and futures work.

RELATED WORK
With a large amount of user-generated text, SA becomes an important part of NLP Abdi et al. (2019). SA is the text classification process into different categories (e.g., positive or negative, etc.). SA can be categorized into three levels (Hussein, 2018;Yessenalina, Yue & Cardie, 2010) sentence-level (Farra et al., 2010), document level (Yessenalina, Yue & Cardie, 2010), aspect-based level (Hu & Liu, 2004;Wang et al., 2016b). Sentence-level SA aims to determine whether the opinions expressed in sentences are positive, negative, or neutral (Liu et al., 2014). The document-level SA is a text processing technique to the distinguish sentiment polarities of the overall given document. It will be considering the document expresses only one single topic. Supervised traditional machine learning classifiers such as NB, SVM, and maximum entropy (ME) are used for document-level sentiment classification on different features. For example, semantic features of movie review (Kennedy & Inkpen, 2006), unigram, bigram, POS-tags, position information (Pang, Lee & Vaithyanathan, 2002), and discourse features opinion polarity classification have been proposed (Somasundaran et al., 2009). Most sentiment analysis techniques follow three methods (Sun, Luo & Chen, 2017) lexicon-based, machine learning, and hybrid approach sentiment analysis. These methods are expendable and straightforward. However, they have severe limitations, such as being dependent on human effort labeling, long-term activity, and limited effectiveness leading to a conversational and unstructured social networks text (Saeed, Ayaz Abbasi & Razzak, 2020;Saeed et al., 2019). To address domain dependency problem, Ghiassi & Lee (2018) proposed an approach based on a transferable domain lexicon and supervised ML for Twitter data. In this method, dynamic architecture for neural networks (DAN2) and SVM are used as a "one vs. all" approach for sentiment classification. Two ML tools were utilized to avoid domain dependency. The method achieved accurate results and reduced the feature set to a small set of seven "meta-features" to reduce sparsity. Recently, researchers used deep learning and achieved remarkable success in NLP and computer vision tasks such as network traffic analysis (Woźniak et al., 2021a). Woźniak et al. (2021b) developed a system to detect body movement using RNN and sensor reading. The results evaluation of these methods shows outperforming performance with an accuracy of 99%.

Deep models for sentiment analysis
Most techniques proposed for sentiment classification through DL are geared towards semantic modeling. CNN is generally known as a feed-forward neural network (Ju et al., 2019) to formulate a mechanism of multiple filters by considering layer-to-layer convolution that enables it to extract the significant hidden features from the datasets. It considers the relationship between global information and can extract these features (Liao et al., 2017).
Although numerous neural networks have been widely studied for feature selections (Alayba et al., 2018;Kumar, Verma & Sharan, 2020). Chen et al. (2019) suggested conducting further robust research using multiple neural networks to acquire better achievements. Rezaeinia et al. (2019) proposed Improved word vectors (IWV) to increase the accuracy of pre-trained word embedding. This method combined GloVe, Part-of-Speech (POS), and lexicon-based approaches and then tested via three CNN layers. Other variants of the CNN network for sentiment analysis applications include charCNN (Dos Santos & Gatti, 2014), Glove-DCNN (Jianqiang, Xiaolin & Xuejun, 2018), CRNN (Wang et al., 2016a), DNN (Dang, Moreno-García & De la Prieta, 2020), and many more. However, these models face vanishing gradient problems and did not consider long dependencies. The current study proposes a new DL model for polarity detection based on the RNN and its variant's ability to overcome this problem. Nguyen & Nguyen (2019) proposed a method based on CNN and RNN to capture local dependencies and memorize long-distance information. The authors integrated the advantage of various DL models to reduce overfitting in training. However, this model accuracy is not very significant for larger datasets. Zang et al. (2020), utilizing CNN and LSTM hybrid models to detect multidimensional features and time-series attributes, succeeded in attaining better results than before and provided an idea to further correlation models. This idea had been applied by Yoon & Kim (2017) when they proposed a novel SA framework by a new correlation of CNN and BiLSTM in a statistical manner.
Similarly, Wang, Jiang & Luo (2016) combined the CNN and RNN for SA, wherein the LSTM and CNN layers have been structured into numeral order. The results outperformed the Stanford Sentiment Treebank dataset with the provided model. Habimana et al. (2020) presented a novel technique by considering the contextual features of the sentiment classification using a convolutional gated recurrent network (ACGRN) to prioritize the feature knowledge for improved sentiment detection. The experiments were conducted at six different size real-time datasets and found that the proposed approach outperforms significantly. However, the author suggested examining the proposed model for sequence text, which probably comprises long-term dependencies. An enriched form of RNN is LSTM, capable of automatically adding or removing the progressive state information automatically (Liu, Mi & Li, 2018). LSTM enables the models to control the gradient vanishing during the long-term dealings about feature extraction of compound time-series and sequences. Kumar, Verma & Sharan (2020) executes an analytical study to solve aspect-based SA using the DL neural network (ATE-SPD) for sentiment polarity detection. Their proposed model solved that issue by considering the polarities of online reviews after sequence labeling. It used BiLSTM to combine a conditional random field (CRF) approach and is regarded as the best sequence labeling approach. It confirmed that CRF enabled BiLSTM to improve the efficiency of the proposed model. Recently, numerous sentiment resources and linguistic information are being used in sentiment classification. SA tasks are perceived as sequential tasks that rely on a particular vector and cause text information loss if the vector length is too short. LSTM has been comprehensively studied in Alayba et al. (2018), Wang et al. (2016a), Chen & Wang, 2018, Kaladevi & Thyagarajah (2019, Wang et al. (2020), Alqushaibi et al. (2021) for the extraordinary outputs of sentiment polarity classification with the combination of other neural networks, e.g., CNN. However, the LSTM and CNN combination efficiency relies on the quantity and quality of datasets. Jin et al. (2020) proposed a multi-model named Multi-Task Learning Multi-Task Scale with CNN and LSTM (MTL-MSCNN-LSTM) for sentiment classification of multi-tasking environment. According to the author, the proposed model appropriately managed the vast amount of local and global features with various text scales to signify the sentiment's accuracy with F1 scores. However, the efficiency comparison with single-task learning was not worthy because multiclass sentiment utilizes the NLP process. Li et al. (2020b) proposed another method by combining the CNN-BiLSTM models to improve the sentiment knowledge by sentiment padding for each comment and review. It took advantage of the sentiment lexicon and integrated a parallel CNN and BiLSTM to enhance the polarity's sentiment information. Chatterjee et al. (2019) propose a deep Learningbased approach for emotion detection in textual dialogues called SS-BED. The SS-BED uses two-word embeddings matrices, and the output is fed into two LSTM layers to extract sentiment and semantic for emotion recognition. However, besides the specific advantages and disadvantages, a general weakness of those models is that they cannot consider each sentence's importance differently. Recently researchers have focused on attention mechanisms to solve this issue. Li et al. (2019) proposed a DNN approach via an attention mechanism for text sentiment classification. This model integrated sentiment linguistic knowledge into the DNN to learn sentiment feature enhanced word representation and decrease the gap between sentiment linguistic knowledge and the DL methods. The CA-LSTM model (Feng et al., 2019) incorporates preceding tweets with word-level and tweet attention for context-aware microblog sentiment classification. However, these models decode the input file into a specific length vector. If the length of the defined vector is set shorter or longer, the information will be lost. To solve this issue, Li et al. (2020a) proposed a multichannel BiLSTM model with self-attention mechanism (SAMF-BiLSTM). This model uses multichannel BiLSTM and the outputs fed into the self-attention mechanism to enhance the sentiment information. However, the author stated that this model needs to redesign the mechanism for a particular document-level classification, and the author did not find the solution for long text. CNN attention-based and BilSTM (AC-BiLSTM) (Liu & Guo, 2019) proposed to address the high dimensionality and sparsity of text data on natural language processing. In this method, CNN is used to extract the high-level phrase, BiLSTM is applied to access forward-backward context presentation. Attention mechanism employed at the output of BilSTM to give different focus on the features. Even though their work results are encouraging in terms of classification accuracy, this method did not consider overfitting problems, and the author stated that this model needs improvement and redesign. AttDR-2DCNN model ) helps combat the pending task of long text for document classification and discover the dependencies among features. While findings of AttDR-2DCNN lacks value about words in a sentence to enhance the polarity classification. Basiri et al. (2021) proposed a model that used two LSTM and GRU layers to extract the context, and the attention mechanism is applied to emphasize different words. The authors use multiple directions to affect the model's accuracy. However, neither method achieved satisfactory results for large datasets, such as sentiment140. In our previous study, attention-based CNN and BiLSTM Model for Sentiment Analysis (Kamyab, Liu & Adjeisah, 2021). CNN is employed to learn high-level feature context and, the attention mechanism is applied at the CNN output layers to pay suitable attention to different documents. BiLSTM extracts contextual information from the feature generated from previous layers. This model achieved significant accuracy on large datasets for short text. However, it still needs improvements. Our proposed model address the issues and weaknesses mentioned earlier, leveraging CNN, BiLSTM, and BiGRU.

PROPOSED ARCHITECTURE
Our proposed architecture is illustrated in Fig. 1. It comprises a distinctive data preprocessor, text representation layer, CNN and pooling layers, bidirectional RNN, an attention layer, fully-connected layer, and output layers. We apply word representation for spell check and noise removal during preprocessing. Glove word embedding represents the vector, while CNN extracts local features. Bi-RNN takes CNN feature-generated data as input and extracts the long dependencies feature as an output during training. The attention mechanism at the output layers of RNN and CNN pays suitable attention to different words. The following sections discuss each layer in detail.

Data pre-processor
The language used on social media is non-standard, informal, with many grammatical errors, acronyms, misspellings, and non-standard punctuation. Therefore, they are in serious need of extensive processing. We designed a unique data preprocessing step to control and deal with all issues. These methods involve various steps such as deleting non-Unicode strings, non-English characters, replacing URLs, and User Handler. We also did spelling correction, partial speech tagging (POS), and word formation. Table 1 shows an example of data preprocessing to Twitter data. We employed Glove with 300 dimensions for spell check to significantly correct all the misspelled words as it is trained on 840 million tokens from a standard crawl.

Input layer
As stated, we used Glove to represent words using similarities in words. The input layer receives text data as T(t 1 , t 2 ,…, t k ), where t 1 , t 2 ,…, t k are the k number of token with dimensions of each input word d. Hence, R d will be the dimensions space of each word or token; then each word is converted into word vector of d dimension, the word vector matrix T corresponding to a sentence of length k can be obtained as: (1) We apply the zero-padding strategy since the input text length is not the same. If the length is longer than the predefined length, then uniform(l) will be truncated, but if the  p.s. I didn't realize that you were sending me messages ps i did not realize that you were sending me messages length is shorter than the predefined length, then l zero padding will be added to the size.
The input text matrix generation after uniform length shows as Eq. (2): where, the input text T is passed to the Glove embedding layer to generate the word embedding vector from the corpus text. In addition, Gaussian Noise is applied after receiving text representation from the embedding layer. The Gaussian Noise z ∼ N(0, σ 2 ) process can be used as a regularization method, making the model more vital and less prone to overfitting. Since Gaussian noise is applied directly to the word embedding, this process serves as a random data augmentation during training time.

Convolutional neural network
Two-word representations are then fed to the two convolutional layers based on row representation. Let's assume the ith row input feature matrix corresponding to a single channel is V i ∈ R d , where V i is the feature vector of ith word with d dimensional. V i:i+l−1 ∈ R l Ã d represent the feature matrix composite of l input text length from ith word to the i + l − 1 word. Then feature h i after extraction is formulated as follows: where f is the activation function for the non-linearity, h i represents ith feature value obtained by the convolutional operation of each sentence. W represents the matrix's weight, b is the bias, z Gaussian noise. We have two Chanel; h 1:i and h 2:i get two outputs from the convolutional layers in each channel. The feature obtained from the CNN layers are stated as follows: Next, is to apply the max-pooling to each channel output h n layer to minimize the dimensions of the dataset as shown in Eqs. (6) and (7). p n ¼ Max½h n ; n ¼ 1; (6) p n ¼ Max½h n ; n ¼ 2; where the P n , n = 1 is the feature map obtained after the max-pooling layer from the first convolutional kernel, and P n , n = 2 feature map obtained after the max-pooling layer from the second convolutional kernel.

RNN layer
RNN is a dominant model due to the compound configuration for NLP tasks of feature extraction. Studies show that RNN is has been applied successfully for sequential data. It can utilize its internal memory to execute variable sequences of datasets, making it applicable to investigate sentiment polarity (Hochreiter & Schmidhuber, 1997). To solve the vanishing gradient problem, the previous steps out transfer into two independent bidirectional RNN.

Long short-term memory
LSTM is an enhanced RNN that can handle the vanishing problems of RNN. LSTM is proposed as a gating mechanism that comprises three gates (input, forgets, and output). The adaptive gating mechanism generates output from the current states, remembers and sends it to the next stage as input every time. LSTM comprises the memory cell C t that considers the time interval of its state over arbitrary. It consists of three different nonlinear gates; Input gate i t , Output gate O t , and Forget gate f t , responsible for information flow to and from the C t . Equations (8)-(13) (Socher et al., 2013) presents the function of LSTM gates. Wherein, σ (.) is a sigmoid function based on element, tanh(·) is a hyperbolic Tangent function, and is the product. Similarly, X t is the input vector while h t is the hidden vector for a time t. Meanwhile, U and W offering the weight matrix of input and hidden vector, and b stands for bias vector (Liu & Guo, 2019;Alom, Carminati & Ferrari, 2020).
The forget gate (f t ) is liable for ignoring or forgetting information.
The input gate (i t ) decides what information should store in the memory cell according to Eqs. (9)-(11). While in the LSTM network, forget gate decides about the output information, the amount of information, and which parts of the cell state should be output according to the following Eqs. (12) and (13).
LSTM network these gates unit helps to remember important information over multiple time updates and ignore the information which is not required.

Gated recurrent unit
GRU is a similar variant of LSTM, but has only two gates, an update gates (z t ) and the reset gates (r t ). These two gates control the information updating process to the state together. Reset gate r t controls the contributions of the past state h t−1 to the candidate stateh t and the smaller the value is, the higher the ignored rate. At the time t, the reset gate works as follows.
The update z t is used to determine how much of the stream information can be kept or forgotten at the timestamps of t and t − 1, the update gate expressed as follows.
where σ is the logistic sigmoid function, x t and h t−1 respectively denote the input and previous hidden states. The GRU state at the time interval t is calculated as Eq. (16), and the candidate stateh t is computed as Eq. (17).
like LSTM, W, U are learnable weights, and b is the bias term where ⊙ is vector element multiplication.

Bidirectional RNN
This paper used two independent bidirectional RNN (LSTM and GRU) to consider both forward and backward features in parallel. The purpose of using two different RNN is to get better results in both large and small datasets. BiGRU is applied to get richer context information, and GRU bit faster or needs less data to generalize. However, BiLSTM generates final features by sequentially processing the map, leading to better results with large data. These two independent Bi-RNN networks are applied to form the feature context matrix representation from the vectors obtained from two channels of the previous steps with m padding length or maximum feature-length.
Equation (20) We now obtain an annotation for each input vector by concatenating the forward and backward context in Eqs. (22) and (23).
where, h t lstm is the concatenating output of forwarding ð h ! f lstm Þ and backward ð h b lstm Þ extracts long dependencies feature from BiLSTM. Similarly, h t gru is the concatenating output of forwarding ð h ! f gru Þ and backward ð h b gru Þ extracts long dependencies feature from BiGRU. The extracted feature of the entire sentence is ½ h ! f lstm ; h b lstm and ½ h ! f gru ; h b gru . In this way, the forward and backward contexts can be considered simultaneously.

Attention mechanism
Since different words have different meanings, we apply attention to each Bi-RNN generated feature to emphasize the meaning of each sentence. The h t lstm and h t gru word annotation each one is passed through one layer perceptron to get u t gru as the hidden representation of h t lstm for tth input and u t gru as the hidden representation of h t gru tth input that formulated as follows: where w represent the weight and b is bias, then to measure the importance of words, we perform normalization on the hidden representation of u t with context vector v s which is randomly initialized and jointly learned during the training process presented as: in Eqs. (26) and (27), A t lstm and A t gru represent the normalized weight LSTM and GRU layers respectively, m the number of words in the text, and exp(.) is the exponential function. Then we concatenate the LSTM and GRU achieved normalized weight A t , and it can be expressed as Eq. (28). These importance weights A t are aggregated into V to obtain a single vector can be denoted as Eq. (29).

Output layer
This layer performs the significant sentiment classification task by utilizing the merge feature layer, as illustrated in Fig. 1. The Sigmoid function is considered for binary input, while cross-entropy is responsible for the discrepancy between the actual and predicted sentiments.

EXPERIMENTS
This section discusses the data description, hyper-parameters setting, and baseline. The proposed model was implemented using data preprocessing, word representation, and joint deep learning networks with CNN and Bi-RNN. We test the ACR-SA model on SD4A, sentiment140, US_airline datasets as short text, and IMDB as a long text dataset considering the ultimate goal of accurately analyzing our proposed model. We conducted several experiments to evaluate the performance of our architecture. The model was trained on 50 epochs and compared our results with previous models.

Datasets
This study has considered SD4A, Sentiment140, US-Airline for short text, and IMDB for long text. The acquisition of SD4A is also a significant contribution to this work. The rest of the three datasets are the existing SOTA datasets in various studies with substantial performance for the other models. Detailed statistics of the dataset are listed in Table 2.

Training procedure and hyper-parameters setting
The proposed model and selected baselines are implemented based on the Keras deep learning API written in python language. The models are executed on the Windows operating system with 3.00 GHz processor and 16 GB RAM. We used the grid search optimization techniques to find the hyperparameters used in the experiments. The hyperparameters values have used Glove with 200 dimensions for word embedding. The input layer matrix T, 90,000 words vocabulary size applied to the dataset. In addition to traditional preprocessing, we used the multiprocessing API text_similarity_spellings function to predict a word in a sentence based on the context and correct the spell mistakes. In training, 80% of the observations were used for training, 10% for validation, and 10% for testing, and the 200 and 40 first words were considered in long and short text using padding process, respectively. Gaussian Noise with the value of 0.3 and Dropout with the value of 0.4 applied at the connection network with the two-channel convolutional layer to avoid overfitting. We used 64 bias with kernel sizes of 4 and ReLU as an activation function in the convolutional layers. The output of the convolutional layer fits into max-pooling with pool size 2. Two independent Bi-RNN (LSTM and GRU) were applied with 256 sizes, dense size 128, and ReLU as the activation function. Attention mechanism applied after received feature from each bidirectional RNN layer, and the output sent to the concatenation layer. After the dense layer, a Sigmoid function is used for the binary classifier. Finally, binary cross-entropy is used for the training model, and Adam optimizes the model's learning rate. Table 3 describes the hyper-parameters value in our model.

Baseline methods
In this model, four datasets are used, and we used several recent similar models developed for sentiment classification. The list of the most current and significant baseline models that we compared our model with are as follows.
CRNN (Wang et al., 2016a): In the CRNN model, each sentence is considered a region, and a regional CNN is applied to the input word vectors. Max-pooling is used to reduce the local features' dimensionality. Lastly, the LSTM layer is applied to capture long dependencies. SS-BED (Chatterjee et al., 2019): In this model, two-word embedding matrices are used for word representation, then two LSTM layers are applied for learning the semantic and sentiment knowledge. And finally, fully connected layer and hidden layer to classify the emotion categories. ARC (Wen & Li, 2018): A well-known model for word representation, which takes input to the RNN layer that fed the results into the CNN layer for attention mechanism implication. ABCDM (Basiri et al., 2021): This is an attention-based bidirectional CNN-RNN model. The model is combined from two bidirectional independent RNN layers for extracting the past and future features. The CNN layer's applied to the output RNNs layer to reduce the dimensionality. Improved Word Vectors (IWV) (Rezaeinia et al., 2019): This model integrated three CNN layers, one max-pooling layer, and a fully connected layer for sentiment analysis. The authors combined Part-of-Speech (POS), lexicon-based approaches, and Word2Vec methods for vector improvement. AC-BiLSTM (Liu & Guo, 2019): This model starts with the CNN layer with different filter sizes for higher-level phrase extraction. The output of the CNN layer is fed into BiLSTM to access the preceding and succeeding context terms, followed by an attention mechanism. Word embedding-CNN (Alharbi & de Doncker, 2019;Dang, Moreno-García & De la Prieta, 2020): In this study, the authors propose the architecture of CNN that takes into account and user behavior. This model consists of five parts: input, convolution, pooling, activation, and softmax layers. Glove-RNN-CNN (Abid et al., 2019): This method was proposed for the Twitter datasets by merging RNN variants and CNN. The GloV pre-trained word embedding was employed as input and then fed the input into RNN layers after afterward connected with CNN. Although this method has yielded significant results for small datasets, it is not suitable for large datasets and long text. ACL (Kamyab, Liu & Adjeisah, 2021): In this model, CNN captured contextual features, and BiLSTM extracted long-term dependencies. Attention mechanism employed at the output of CNN to give different focus on the features. The GloV pre-trained word embedding was used for word representation.

RESULTS ANALYSIS AND DISCUSSION
In this section, we provide the results of our model with the baseline methods. The experiment was executed on four datasets to evaluate the model accuracy, recall, precision, and F1 value's effect. Besides, we visualize the impact of the attention mechanism on the words weight matrix and the effect normalization layer for decreasing overfitting. The experiment result validates that the proposed model (ACR-SA) performed better and achieved better accuracy than the other SOTA methods in all four datasets, as can be seen from Tables 4-7, in which the bolder entries indicate the highest performances of the models. Table 4, an overall glance for the Sentiment140 Twitter data shows that all these given models acquired 84.8% to 90.13% accuracies. Our proposed model achieved the highest accuracy of 90.13%. In contrast, the baseline highest accuracy 87.06% is achieved by ACL (Kamyab, Liu & Adjeisah, 2021). ACR-SA improvement achieved 2.02% F1, 1.66% precision, and 1.52% recall on the negative class and 7.01% F1, 5.21% precision, and 5.36% recall on the positive class than the baseline models. Table 5 lists the obtained accuracies on different models on the US-Airline dataset. Comparing our proposed model with the baseline models, we can notice that our proposed model has a significant accuracy of 98.42% for this dataset. In contrast, ACL (Kamyab, Liu & Adjeisah, 2021) has the highest accuracy as compared to all other baseline methods. Meanwhile, it can be seen that the ACR-SA model improves F1, precision, and recall by 9.41%, 9%, and 8.15% in the positive class and 2.94%, 1.91%, and 1.93% in the negative class, respectively, than ABCDM (Basiri et al., 2021). Table 6 presents the evaluation utilized for our proposed model on the SD4A dataset. The ACR-SA model gained 95.39% accuracy, which is sufficiently improved as compared to all baseline models. ACL (Kamyab, Liu & Adjeisah, 2021) and Glove-BiGRU-CNN (Abid et al., 2019) achieved the highest accuracy of 94.53%, 94.5% respectively among the selected baseline models, showing that our proposed model enhanced by 0.86%. Similarly, we can see that the ACR-SA model improves the F1, precision, and recall by 0.53%, 0.93%, 0.8% in the positive class, and 0.77%, 0.79%, 0.95% in the negative class, respectively, which is an efficient performance than highest values achieved in baselines. Table 7 compares our proposed model with the selected baseline models on the IMDB dataset. All selected baseline models and our proposed model were tested. The ACR-SA model achieved 89.98%. Meanwhile, the ABCDM (Basiri et al., 2021) method achieved an accuracy of 88.40%, which is the highest among chosen baselines. Our proposed model was enhanced by 1.58% than the ABCDM model and 3.38% than the ACL model. Comparison of F1, precision, and recall shows that our model improves by 1.77%, 0.71%, and 2.97% respectively in the positive class, 1.78% F1, and 3.12% precision in the negative class. Only precision in the positive class did not improve.
To summarize the performance of ACR-SA compared to baselines, our proposed model outperformed accuracy and confusion matrices F1, precision, recall in all four datasets. In baseline methods, ABCDM (Basiri et al., 2021) performs well on the long-text issue among the baseline models. However, this model does not work significantly for short sentences because the first feature extraction layer is RNN, designed to capture long dependencies. Similarly, Glove-Bi-RNN-CNN (Abid et al., 2019) performs well on the small Twitter datasets, but this model does not use attention mechanism. In addition, ACL (Kamyab, Liu & Adjeisah, 2021) received sound performance for on short text. Nevertheless, the model did not address the co-occurrence of long dependencies. ACR-SA model solved this problem by using multiple layers to make it suitable for the big dataset. CNN was used as the first feature extraction of the model, which is suitable for short sentences. The zero-padding process and two independent bidirectional RNN make the model appropriate for long text.
For simplicity, Fig. 2 summarizes the proposed model's accuracy with the aforementioned neural networks and datasets. It depicts that the proposed ACR-SA model successfully attained the highest accuracy with our structured dataset. In comparison with baselines, our proposed model achieved an excellent 4.41% accuracy in the US-Airline dataset, and 3.07% in the Sentiment140 dataset. Similarly, in SD4A and IMDB datasets the accuracies improved with 0.86% and 1.58% respectively, while the overall average accuracy improvement is 2.86%.
To evaluate the performance of the attention mechanism in our model, we select a few sentences for data visualization analysis, as shown in Figs. 3A and 3B. The darker the color, the more considerable weight of the given token. The depth of the color represents the corresponding word in the sentence. In Fig. 3A, the first sentence is positive. The importance weights of "stay,'' "safe,'' and "healthy'' are larger than the other words on the same tweet. In Fig. 3B, the first tweet, the color of "attacks" is more profound than those of other words, indicates that it has more negative semantics than the other words.
We plotted the received operating characteristic curves of baseline models and the proposed model on the 10% test dataset, which is 10%. To interpret the prediction of models, as shown in Figs. 4A-4D. The area under the curve (AUC) is compared to evaluate the model's performance. The results show that the proposed model performance is more significant than all baseline models. The improvements mainly benefit from preprocessing, two-channel CNN layers for the local feature extraction, two independent  Bi-RNN by considering two directions of long dependencies, applying different normalization, and finally, weight attention according to the importance of each word. As discussed in the literature (Nguyen & Nguyen, 2019;Li et al., 2020a) DL models encounter overfitting challenges that reduce the model's accuracy and performance. In this work, we resolve this challenging task by utilizing the Gaussian Noise (GN) on different datasets. The experiments are conducted on training and validation datasets with 50  20 epochs than the validation or test dataset. While for initial epochs, the test data's accuracy is a bit higher. Our proposed ACR-SA model converged toward the optimal solution after the mentioned number of epochs with consistent accuracy. These analyses provide evidence that the ACR-SA network reduces the overfitting problem and attains adequate accuracy.

CONCLUSION
Deep learning models, especially CNN and RNN, are widely used for text classification. However, previous studies have the drawback of low accuracy and overfitting, which demanded to be tackled for significant knowledge extraction. We proposed a novel Attention-based deep model through two-channel CNN and Bi-RNN model for Sentiment Analysis (ACR-SA). The model combines unique data processing techniques, word representation, and DL techniques, including attention-based mechanisms. Data processing is applied to handle social media data challenges, such as spelling correction and harm model training accuracy. ACR-SA used pre-trained word embedding for word representation to create vector representation for each sentence. Further, different DL models were combined to extract higher features, capture long-term dependencies and generate sentiment analysis knowledge. Two-channel CNN layers were used to extract contextual features, to decrease the dimensionality of feature space, with max-pooling at the CNN layers' output. Gaussian Noise and Dropout were used on the input layer to overcome the overfitting issues. The two independent bidirectional RNNs (LSTM and GRU) networks utilized to temporal feature and update the past and future sentiment representation of the CNN layers output. The attention mechanism was applied at the end of the BiLSTM and BiGRU layers to put more or less attention into different words. Moreover, this made the sentiment analysis more informative. Finally, a dense and fully connected layer with ReLU activation and sigmoid function transforms the vector into sentiment polarity classification. Experiments were conducted on three datasets of short-text English tweets and one dataset of long reviews about movies to analyze the performance of the proposed model for long and short text. Comparisons with SOTA baseline methods demonstrate that the ACR-SA is more effective and efficient in classifying and comprehending semantics in both short and long texts. Our proposed model's performance achieved a magnitude of 95.39%, 98.42%, 90.13%, and 89.98% for the SD4A, US-Airline, Sentiment140, and IBDM datasets, respectively. In contrast, the overall average accuracy improvement is 2.86% compared to baseline methods results. Future works mainly includes the following parts: (1) Expanding the model to include other languages such as Persian; (2) using other word embedding methods to improve our model.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the Innovation and Development of Shanghai Industrial Internet (Grant No. XX-GYHL-01-19-2527). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.