Deep Learning Multimodal for Unstructured and Semi-Structured Textual Documents Classification

: Due to the availability of a huge number of electronic text documents from a variety of sources representing unstructured and semi-structured information, the document classification task becomes an interesting area for controlling data behavior. This paper presents a document classification multimodal for categorizing textual semi-structured and unstructured documents. The multimodal implements several individual deep learning models such as Deep Neural Networks (DNN), Recurrent Convolutional Neural Networks (RCNN) and Bidirectional-LSTM (Bi-LSTM). The Stacked Ensemble based meta-model technique is used to combine the results of the individual classifiers to produce better results, compared to those reached by any of the above mentioned models individually. A series of textual preprocessing steps are executed to normalize the input corpus followed by text vectorization techniques. These techniques include using Term Frequency Inverse Term Frequency (TFIDF) or Continuous Bag of Word (CBOW) to convert text data into the corresponding suitable numeric form acceptable to be manipulated by deep learning models. Moreover, this proposed model is validated using a dataset collected from several spaces with a huge number of documents in every class. In addition, the experimental results prove that the proposed model has achieved effective performance. Besides, upon investigating the PDF Documents classification, the proposed model has achieved accuracy up to 0.9045 and 0.959 for the TFIDF and CBOW features, respectively. More-over, concerning the JSON Documents classification, the proposed model has achieved accuracy up to 0.914 and 0.956 for the TFIDF and CBOW features, respectively. Furthermore, as for the XML Documents classification, the proposed model has achieved accuracy values up to 0.92 and 0.959 for the TFIDF and CBOW features, respectively.


Introduction
Due to the wide variety of the types of the documents circulating over the internet used in large scale of different applications, identifying the type of document is a critical task for the classi cation models in order to simplify further operations. Textual semi-structured and unstructured documents have many differences related to their nature which include the structure of the textual representation, degree of ambiguity, degree of redundancy, degree of using punctuation symbols and use of idioms and metaphors [1]. Therefore, intensive preprocessing steps are required to get acceptable classi cation results through using textual representation techniques.
In addition, document classi cation is a process of effectively managing large volumes of documents through assigning one or more documents to a speci c class from a set of prede ned classes. Formally, let D = {d 1 , d 2 , . . . , d n } the set of all documents of a size n documents and C = {c 1 , c 2 , . . . , c m } the set of prede ned classes of m classes [2].The document classi cation task can be also modeled as f : D → C that assigns one document d i to a speci c class, c i . Furthermore, it engages various elds including Natural Language Processing (NLP), machine learning and information retrieval to work altogether to conduct the classi cation of the textual resources [3].
The main issue with the classi cation of text documents relates to the great diversity in the nature of documents that require special kinds of manipulations. Although there have been an increasing body of efforts using DL approaches for handling such issue, most of these approaches are designed for dealing with a certain type of data, while others have ignored the relationships between data that affect the expressive power of the extracted features. Thus, there is a need to develop a generic approach for textual documents classi cation across a wide range of data types with a variety of complex structures. Therefore, this paper aims to develop an automatic document classi cation model for categorizing semi-structured and un-structured textual resources using the Deep Learning (DL) techniques based on various text vectorization techniques. Tokenization and various text normalization techniques are used at the preprocessing level. Furthermore, TF-IDF and CBOW are used at the feature level. Additionally, DNN, LSTM and Bi-LSTM are used at the classi cation level.
Furthermore, the remainder of this paper is organized as follows: The researchers highlight and summarize the related literature review in Section 2. Then, Section 3 discusses the proposed approach in details. Next, Section 4 presents the experimentation results. Finally, the conclusions are demonstrated in Section 5.

Document Classi cation Approaches
Document classi cation has two main different approaches: Manual and automatic classication. The rst approach is both expensive and time consuming. However, it provides the user with a great control over the process. The user identi es the relationships between documents and handles the classi cation issues. On the other hand, the second approach ends up in faster and more objective classi cation. It applies content-based matching of one or more prede ned categories to documents. In addition, automatic document classi cation can be accomplished through using one of the following three classi cation models: Supervised, unsupervised and rule-based classi cation.
First, in the supervised learning classi cation, the training model is based on using a small training set of prede ned input-output sample documents. This is in an attempt to generalize the categorization task and deduce the classi cation rules to precisely classify new emergency documents.
Second, in the unsupervised learning classi cation, patterns are discovered and documents are categorized based on similar words and phrases. The most similar documents are the ones that have more attributes in common.
Third, in the rule-based classi cation, a set of linguistic rules that de ne the relationships between the input dataset and their associated categories are formulated and parsed. It is most suitable for predicting data containing a mixture of numerical and qualitative features. Moreover, it is very accurate for small document sets, where the classi cation results are always based on the prede ned rules. However, the task of de ning rules can be tedious for large document sets with many categories.

Related Work
In this sub-section, the researchers highlight the previous literature studies that covered the contributions of the researchers in various areas of research related to the classi cation process, including feature representation and vectorization and individual and multimodal classi cation.

Feature Representation and Vectorization
Huang et al. [20] have presented a statistical feature representation method that extracts the most descriptive terms in a document. It also assesses the importance of the word through counting the number of times it occurs in each document and assigning it to the feature space. This method ignores the semantic values of the words and word relationships in each sentence. Therefore, it leads to poor similarity results.
In addition, Melamud et al. [21] have presented context2vec neural architecture which uses word2vec's CBOW architecture with a major enhancement achieved through implementing bidirectional LSTM instead of its native context modeling. This model is an unsupervised approach that handles embedding procedures based on large corpora and produces high quality word representation to learn a generic embedding function for variable length contexts.
Yang et al. [22] have also improved feature representation through getting the semantic and syntactic relations among words and providing rich dictionary resources that can cover all aspects of the NLP tasks. This model generates both de nitions and example sentences of target words. The experimental results prove that the model has achieved high performance with regard to both de nition modeling and usage modeling tasks. Nevertheless, it still needs more enhancements to generate more meaningful example sentences.

Individual Deep Learning Classi ers
Yao et al. [23] have proposed a Graph Convolution Neural Network (GCN) method for text classi cation. It is used to achieve strong classi cation performances with a small proportion of labeled documents, interpretable words and document node embedding. This model consists of a knowledge graph, where each node refers to an object category and input represented as word embedding of nodes for predicting class. It also uses a single GCN layer with a larger neighborhood which includes both one-hop and multi-hops nodes in the graph to overcome oversmoothing. However, this method is weak with regard to learning representation on a large scale of unlabeled text data.
Moreover, Naqvi et al. [24] have developed a roman Urdu news headline classi er based on different individual machine learning techniques, Logistic Regression (LR), Multinomial Naïve Bayes (MNB), Long short term memory (LSTM) and Convolutional Neural Network (CNN), to classify news into relevant categories on which further analysis and modeling can be done. Firstly, the news dataset is collected using scraping tools. Then, a phonetic algorithm is used to control lexical variation and test news from different websites. The experimental results prove that the MNB classi er has achieved the best accuracy among the other mentioned classi ers.
Yoon [25] has proposed a convolutional neural network model for sentence classi cation. This model uses a single convolution layer after extracting word embedding for tokens in the input sequence. It has achieved acceptable results on multiple benchmarks using several variants of hyperparameter tuning and static vectors, compared to other DL models that utilize complex pooling schemes.
Furthermore, Zhang et al. [26] have implemented character-level convolutional networks (Con-vNets) for text classi cation. This model encodes characters using one-hot encoding scheme to convert each numerical categorical entry in the dataset into columns of either zeros or ones based on the number of categories. These encoded characters have been fed as inputs to the deep learning architecture with multiple convolution layers. This model proves that character-level convolutional networks achieve competitive results with regard to large scale datasets.

Multimodal Deep Learning Classi ers
Zulqarnain et al. [27] have proposed a classi cation model based on a combination of Gated Recurrent Unit (GRU) and Support Vector Machine (SVM). They have replaced Softmax activation function in the output layer with GRU. This model has achieved remarkable results particularly when the size of the storage is limited. It has also overcome the issues of vanishing and explosion of gradient.
Haralabopoulos et al. [28] have proposed an automated sentiment classi cation model used to categorize human-generated content. This model consists of several multi-label DNN classi cation architectures and two ensembles. The rst architecture is a simple CNN with fully connected layers. The second architecture integrates a Gated Recurrent Unit (GRU) with a convolution layer. The third architecture implements TFIDF and a DNN with three fully connected layers. This model has made the best use of these articulated architectures to improve classi cation results without hyper-parameters tuning or data over-tting.
Kowsari et al. [29] have also proposed a classi cation model called Random Multimodal Deep Learning (RMDL) that concatenates standard DL architectures in order to develop robust and accurate architectures for classi cation tasks. Their constructive model is based on three architectures: CNN, RNN and DNN. The output is generated using majority vote on output of these architectures. The results prove the effectiveness of this model.
Moreover, Ding et al. [30] have proposed a model with multi-layer RNN called Densely Connected Bidirectional LSTM (DC-Bi-LSTM) for text classi cation. It has used LSTM to encode a sequence of input. In each layer, the hidden states have been represented as a reading memory. This model has made improvements over the traditional Bi-LSTM, achieved high performance and improved information ow in large tasks. Besides, the researchers expect that the performance may be improved in case of including the implementation of dense Bi-LSTM module instead of the Bi-LSTM encoder.
Furthermore, Wang et al. [31] have proposed a classi cation model based on a combination of the Dynamic Semantic Representation model and the Deep Neural Network model (DSRM-DNN). Firstly, it generates a model to capture the context of words and selects semantic words dynamically where each word's attribute has been assigned a weight to be quanti ed. Secondly, it has fed these features as elements to the text classi er that is composed of deep belief network and back-propagation neural network. This model improves the speed and accuracy of text classi cation, taking into consideration the value of the low-frequency words and new words.
In addition, Cireşan et al. [32] have proposed a multi-model neural networks classi er that is composed of multi-column deep neural networks as combination architectures of DNN and Convolutional Neural Networks (CNN). Moreover, CNN empowers the DNN max-pooling layer by using feed-forward networks with convolutional layers to include local and global pooling layers and, hence, improve the classi cation results.

The Proposed Model
The proposed supervised automatic document classi cation model is adopted to categorize semi-structured and un-structured textual documents using DL techniques. It is decomposed of three subsequence stages: The textual data preprocessing, text vectorization and document classi cation. Fig. 1 shows this proposed framework.

Textual Data Preprocessing
Once the data is imported from the corpus, it is automatically preprocessed to be suitable as an input to the classi cation model. Textual data preprocessing involves two basic steps: text tokenization and text normalization. Algorithm 1 illustrates the tasks required to be completed during the preprocessing process.

Text Vectorization
In order to convert the text data into the corresponding suitable numeric form acceptable to be processed by DL techniques, TFIDF and CBOW models are used to convert the raw text data into their corresponding numbers.

Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF is a numerical statistic approach that aims to measure the importance of a word to a textual document in a corpus (i.e., dataset) [15]. It also acts as a weighting factor in information retrieval and text mining issues. The higher the TF-IDF value is, the more the words will be in the document.
The TF-IDF weight assigns a weight to each term in a document depending on both its Term Frequency (TF) and its Inverse Document Frequency (IDF). It can be obtained through multiplying the values of the both terms, as given in Eq. (1).
where w i, j is TF-IDF value of word i in document j. TF refers to the ratio of the number of times a word occurred in a document to the total number of words in the document, which can be obtained by Eq. (2).
where f i, j is the frequency of word i in document j. n j is the total number of words in document j.
IDF acts as a measure of how much information the word provides, it is calculated via Eq. (3).
where |D| is the total number of documents, |{d ∈ D : i ∈ d}|: is the number of documents containing the word i; if a number of this term is zero, it becomes 1 + |{d ∈ D : i ∈ d}|

Continuous Bag-of-Words (CBOW) Model
CBOW is a predictive DL model to map words to vectors and nd out the word embedding. This is in order to capture contextual and semantic similarities [18]. Let W = {w i−n , . . . , w i−1 , w i , w i+1 , . . . , w i+n }, CBOW tries to predict the target given its surrounding context words. It can be modeled as f : X → Y, where Y = w i represents the target word while X = W − w i represents the context surrounding words.

Textual Documents Categorization
This paper builds an effective document classi cation multimodal to categorize big corpus textual documents. This multimodal is a stacked ensemble combination of several individual DL techniques: DNN, RCNN and Bi-LSTM. Fig. 2 shows the structure of the proposed classi cation multimodal.

Deep Neural Network (DNN)
The DNN architectures feed-forward multilayer architectures. The researchers' implementation of the DNN is basically as a discriminatively trained model that uses ReLU as an activation function. The input is a chain of word embedding features. Furthermore, the output layer houses neurons equal to the number of classes and uses Softmax function.
In addition, the data input (500 × 50) is generated from an embedding vectorization layer that has passed to ve consequent levels of hidden layers; and there are 512 nodes in each hidden layer. Each hidden level is decomposed of both a dropout layer and a dense layer. A dense layer represents a matrix vector multiplication of trainable parameters that implements the ReLU activation function, as given in Eq. (4). Moreover, a dropout layer has been used for setting the trainable parameters to be zero with probability. Next, the output layer of size 3 has been used, where the generative output is multi-class classi cation that uses softmax as an activation function, as stated in Eq. (5).

Recurrent Convolutional Neural Network (RCNN)
This technique is a combination of RNN and CNN in order to capture the contextual information with the recurrent structure and to construct the representation of the text using the CNN technique.
The data input (500 × 50) is generated from an embedding vectorization layer that has passed to the hidden combination layer ofthe CNN and RNN techniques. The CNN consists of four consequent levels of convolution layers (4-Conv1D), with 256 lters with a kernel size = 2. Besides, the ReLU activation function is followed by four consequent levels max-pooling (4-MaxPooling1D). The RNN consists of four consequent levels of LSTM (4-LSTM) with 256 number of nodes passed to the two levels of the dense layer using the ReLU activation function. After that, the output is generated using Eq. (5).

Bidirectional-LSTM
Bidirectional LSTMs (Bi-LSTMs) are an extension of typical LSTMs that are intended to enhance the performance of the classi cation model. Bi-LSTMs train two LSTMs instead of one LSTM on the input sequence. The rst provides feed-forward from the input sequence to the output, while the other provides feed-backward in a reverse order. The idea behind this technique is to allocate the forward state part to be responsible for the positive time direction and the backward state part to keep track of the opposite direction.
The data input (500 × 50) is generated from an embedding vectorization layer that has passed to the bidirectional layer. The bidirectional layer uses 100 memory cells in parallel in the both LSTMs to generate an output with a shape of 30 data points wide and 256 data points' height. Next, the time distributed layer is used to generate an output shape with 30 data points wide and 256 data points' height. The generated shape is passed to the atten layer that produces an output shape of 7680 points; and that is nally fed as an input to the dense layer to nd the closest output class.

Stacked Ensemble Technique
This technique is intended to combine a set of previously trained models (DNN, RCNN and Bi-LSTM) and merge them with the concatenation function to generate the nal classi cation outcome [33].

Dataset Description
The training set consists of three textual classes: XML, JSON and PDF documents that are collected by web-crawling different websites. A total of 50.000 documents are randomly picked and allocated for JSON and XML classes, taken from the following websites: https://catalog. data.gov/dataset?res_format=JSON and https://www.sba.gov/sites/default/ les/data.json. For XML and JSON requests, an internal logger is used that collects 100.000 of such requests. Additionally, regarding the PDF class, the dataset consists of 11,228 newswires from Reuters labeled over 46 topics.

Evaluation Metrics
Multiple performance and evaluation criteria are used to ensure the improvement of the proposed model, in comparison to the other existing models. Precision [34] act as Positive Predictive Value (PPV), as stated in Eq. (6).
F-measure [34] is calculated by the harmonic means between precision and recall as illustrated in Eq. (8).

Experiments
In this section, a series of experiments are done to evaluate the performance of the researchers' revised individual classi ers and the results of the proposed combined document classi cation multimodal.

Experimental Results of DNN Model
Tabs. 1-3 illustrate the precision, recall and f-measure of the experimentation results of the individual DNN model for predicting PDF, JSON and XML documents, respectively. These results are based on the researchers' suggested hyper parameters that include the following values: the numbers of epochs, the learning rate values, the batch size values and the numbers of hidden layers. First, Tab. 1 illustrates the classi cation results for predicting PDF documents in the case of using the TFIDF and CBOW text vectorization techniques. Second, Tab. 2 demonstrates the classi cation results for predicting JSON documents in the case of using the TFIDF and CBOW text vectorization techniques. Finally, Tab. 3 shows the classi cation results for predicting XML documents in the case of using the TFIDF and CBOW text vectorization techniques.

Experimental Results of Bi-LSTM Model
Tabs. 7-9 demonstrate the precision, recall and f-measure of the experimentation results of the individual Bi-LSTM model for predicting PDF, JSON and XML documents, respectively. These results are based on the researchers' suggested hyper parameters that include different numbers of epochs, element vectors, batch size values and numbers of hidden layers. Tab. 7 illustrates the classi cation results for predicting PDF documents in the case of using the TFIDF and CBOW text vectorization. Furthermore, Tab. 8 shows the classi cation results for predicting JSON documents in the case of using the TFIDF and CBOW text vectorization techniques. Finally, Tab. 9 clari es the classi cation results for predicting XML documents in the case of using the TFIDF and CBOW text vectorization techniques.

Experimental Results of the Proposed Document Classi cation Multimodal
In addition, Tab. 10 illustrates the precision, recall and f-measure of the classi cation results of the document classi cation multimodal for the unstructured PDF class, semi-structured JSON class and semi-structured XML class in the case of using the TFIDF and CBOW text vectorization techniques. The results indicate that the performance of the proposed multimodal based on the stacked ensemble technique gives better results, compared to those reached by any of those models individually.
The high results found by the study are due to applying the proposed technique, which is a combination of the RNN and CNN techniques. Actually, it makes use of the advantages of the both techniques. It is also intended to capture the contextual information with the recurrent structure. Moreover, it helps construct the representation of the text through using the CNN and Bi-directional Neural Networks that allocate the forward state part to be responsible for the positive time direction and the backward state part to keep track of the opposite direction.
Finally, the researchers have used the stacked ensemble technique to combine a set of trained meta-models. The outputs of the previously trained models are merged with the concatenation function to generate the nal classi cation outcome. Prior to that, the researchers made feature extraction using Word2Vec and TF-IDF Word2Vec to capture the position of the words in the text (syntactic) and to capture the meaning of the words (semantics). Therefore, word2vector, according to the achieved results above, shows the best outcomes.

Conclusion
The classi cation task is an important issue with regard to machine learning, given the growing number and size of datasets that need sophisticated classi cation. Therefore, the researchers have proposed an automatic document classi cation multimodal for categorizing multi-typed textual documents. In addition, the proposed multimodal combines three individual classi ers: DNN, RCNN and Bi-LSTM, based on the stacked ensemble technique. The purpose of adopting this multimodal is to make managing and sorting the textual documents easier. This is especially useful for publishers, nancial institutions, insurance companies or any industry that deals with large amounts of content. Moreover, the proposed automatic document classi cation model realizes a signi cant reduction in the time consumed on manual data entry, in costs and also in the turnaround time for document processing. Additionally, it ends up in an accurate, ef cient and more objective classi cation where it applies semantic classi cation based on deep learning classi cation. Furthermore, the evaluation results show that a combination of the models and the parallel learning architecture used has consistently resulted in accuracy higher than that obtained through using conventional approaches and individual deep learning models.
Finally, the researchers aim in future studies to empower the feature extraction and representation stage through using an effective glove technique. Moreover, the researchers intended to extend the feature level through embedding multivariate analysis and dimensionality reduction technique to specify which subspace the data approximately lies in and to nd uncorrelated features. In addition, the researchers plan to develop a test data generative model for an automated testing tool and embed the proposed automatic classi cation model as a pre-integral part of the generative model to classify different kinds of documents before generating the test data for each type.