Iterative Deep Neighborhood: A Deep Learning Model Which Involves Both Input Data Points and Their Neighbors

Deep learning models, such as deep convolutional neural network and deep long-short term memory model, have achieved great successes in many pattern classification applications over shadow machine learning models with hand-crafted features. The main reason is the ability of deep learning models to automatically extract hierarchical features from massive data by multiple layers of neurons. However, in many other situations, existing deep learning models still cannot gain satisfying results due to the limitation of the inputs of models. The existing deep learning models only take the data instances of an input point but completely ignore the other data points in the dataset, which potentially provides critical insight for the classification of the given input. To overcome this gap, in this paper, we show that the neighboring data points besides the input data point itself can boost the deep learning model's performance significantly and design a novel deep learning model which takes both the data instances of an input point and its neighbors' classification responses as inputs. In addition, we develop an iterative algorithm which updates the neighbors of data points according to the deep representations output by the deep learning model and the parameters of the deep learning model alternately. The proposed algorithm, named “Iterative Deep Neighborhood (IDN),” shows its advantages over the state-of-the-art deep learning models over tasks of image classification, text sentiment analysis, property price trend prediction, etc.


Background.
Deep learning has been proven to be a powerful tool for pattern classification problems and sensor studies [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. A deep learning model usually has more than three layers, and by using multiple layers, the model extracts hierarchical features from the original data. In this way, an abstractive feature can be generated from the highlevel layers. e deep learning model can release the problem feature engineering by learning effective features automatically.
e feature engineering process of traditional machine learning models relies on the domain knowledge of feature engineers heavily, and it is time consuming and the designed features cannot be generated to other domains. However, compared to the traditional machine learning and feature engineering process, the deep learning can automatically find the features which are relevant to the learning problem and store the features in the neural units of multiple layers. For example, (i) In the problem of face recognition, deep learning models, especially deep convolutional neural network (CNN), have been a popular model to extract high-level features for individuals [16,17]. e deep CNN model is composed of multiple convolutional layers and max-pooling layers. e low-level convolutional layers use a sliding window and a group of filters to extract local simple features from the facial images; the filters are based on simple local patterns such as circle, square, and edges. e middle-level convolutional layers take the outputs of the low-level convolutional layers and extract the patterns of parts of faces, such as eye, nose, and mouth. Finally, in the high-level layers, the patterns of individual faces are generated. In this way, the key features of faces of individual are extracted. (ii) Meanwhile, for the problem of text categorization, deep learning models are also playing a key role. e most popular deep learning model is the long-shortterm memory (LSTM) model [18,19]. A deep LSTM model is also composed of multiple layers of LSTM, while each layer processes the input sequence by using a sliding neural unit. e input to the first layer is the sequence of tokens (represented by the wordembedding vectors). e sliding neural unit slides over the sequence and takes both a current instance and a memory of the previous instance as inputs and outputs both a response for the next layer and a memory vector for the next instances. For the deep LSTM model, the low-level layers extract features from the tokens, and the middle layers extract features for the phases, while the high-level layers extract features for the sentence/texts and the final layers present the text by semantic feature.

Motivation.
e existing deep learning models have achieved great success in some problems of different areas, such as computer vision, natural language processing, speech processing, signal processing, and bioinformatics. However, up to now, most of the existing deep learning models hardly achieve satisfactory performance in many other machine learning applications, due to its strong limitation of the input of the model. A traditional deep learning model only takes the input sequence of instances of the input data point as input, but it ignores the other data points in the dataset. e assumption behind this model is that the data instances of an input data point are sufficient to predict its class label. However, in real-world applications, the input data instances themselves are not sufficient for the prediction. For example, in the problem of sentiment analysis, given an input sentence (data point) "My paper has been accepted by the journal," it is difficult for the deep neural network to decide if the sentiment is positive or negative only from the tokens (instances).
To solve this problem, we propose to leverage the neighbors of the input sentence and use their information to help the decision making of the sentiment of the input sentence. As an example, we find a similar neighboring sentence from the data set, "Congratulations! Your paper is good enough to be accepted by the journal," and then use it to help the prediction of the input sentence. We design a model to take both the input sentence and the response of the neighboring sentence to do the prediction. Since the neighboring sentence contains some positive tokens such as "Congratulations" and "good", the model can give a strong response to the positive sentiment. When taking this positive sentiment of the neighboring sentence into account, the model can also further decide that the input sentence has a positive sentiment. e thought of using neighbors' responses to enhance the prediction ability of the deep learning model is illustrated in Figure 1(a). As a comparison, we also show the traditional structure of deep learning model in Figure 1(b).

Our Contributions.
In this paper, our contributions are given as three parts: (i) Firstly, we proposed a new deep learning model for pattern classification problem. e key difference between our model and the traditional deep learning model is the input structure. e traditional deep learning model only takes the input instances of an input data point, but our model can take both the input data point and its neighboring points, to be specific, the classification responses of the neighbors, as the inputs of the model. Loia et al. [23] developed an effective local learning method by merging a local weighted regression model and a fuzzy transform method (F-transform). e F-transform plays the role of the reduction method for the cardinality of the learning problem. Our local learning method is inspired by [23], but we solve a different problem of learning neighbors from deep CNN model. (iv) We evaluate the proposed joint deep learning and neighbor learning algorithm over several benchmark data sets, and the results show its advantage over state-of-the-art deep learning algorithms. We also study different properties of the proposed algorithm experimentally over the benchmark datasets and show the stability of the proposed method.

Paper Organization.
In the rest of this paper, we introduce the proposed deep learning model and its learning process in Section 2. In Section 3, we evaluate the proposed algorithm by comparing it to state-of-the-art deep learning models and studying its properties experimentally. In Section 4, we conclude the paper with some potential future works.

The Proposed Method
In this section, we introduce the proposed deep learning model which takes both the input instances of a data point but also the neighboring data points' classification map as input. We will first introduce the inputs of the model, then the model structure, and finally the method to learn the parameters of the model.

Model Inputs.
We assume we have a training set of n data points, and we deal with a multiclass classification problem of K classes. e training set is denoted as To classify one data point in the training set, X i , the inputs of the model include two types of data as follows: (i) Instance sequence: the first type of input is the instances of the data point itself as follows: where and x il is the feature vector of the l-th instance of the i-th data point, and |X i | is the length of the sequence. (i) Neighborhood classification map: the second type of the input is the neighborhood of X i and the classification map of the neighborhood. e neighborhood dataset of X i is denoted as N i . To obtain the classification map of N i , we first calculate the classification responses of the data points in N i for each class, then apply a classwise max-pooling operation, and finally concatenate the K maximum responses for K classes. We denote the classification response of a data point X j ∈ N i regarding the k-th class as p jk ∈ [0, 1], and the classification map of N i is given as where max j: X j ∈N i p jk is the max-pooling result of the classification responses over N i regarding the k-th class. e calculation of the classification responses p jk will be introduced in the following sections.
For each input data point X i , according to the above description, we have two inputs as follows: (x i1 , . . . , x i|X i | ) and p i .

Model Structure.
e overview framework of our mode is shown in Figure 2. is model is composed of a CNN model, denoted as f, one concatenation layer, one fullconnection layer, and one softmax nonlinear transformation layer. e functions of these layers and the flow of the data in the model are introduced as follows: (i) e input sequence of instances are firstly transformed to a vector of d-dimensional vector z i ∈ R d , by the CNN model, f, composed of three convolutional layers and two max-pooling layer: Computational Intelligence and Neuroscience (3) (ii) en, z i is concatenated with the neighborhood classification map vector, p i ∈ R K , by the concatenation layer. e concatenated vector is denoted as (iii) e concatenated vector is further reduced to a Kdimensional vector by the full-connection layer, and where w k is its k-th column corresponding to the k-th class. e outputs of the fullconnection layer is calculated as (iv) Finally, the outputs of the full-connection layer are normalized to probabilities over the K classes by the softmax activation layer, and the outputs are calculated as follows: where is the probability of X i belonging to the k-th class and y i is the output vector of the model. To decide the class of the given data point, X i , we choose the class with the largest probability: 2.2.1. Model Parameter Learning. Learning problem modeling: In our model, there are two groups of parameters, which are the parameters of the CNN model, f, and the connection weight matrix W. To learn the parameters to fit the training data, we build a unified learning framework. In this learning framework, we propose to measure the classification error by the cross-entropy loss function and measure the complexity of the model by the squared ℓ 2 norm of the parameters. Moreover, we proposed to minimize the classification error to improve the classification performance, and the complexity of the model to reduce the overfitting risk. e learning problem is modeled as a minimization problem, and the objective of the problem is given as follows: where is the cross-entropy loss function for the i-th data point, ‖W‖ 2 2 is the squared ℓ 2 norm of W, ‖f‖ 2 2 is the squared ℓ 2 norm of the filters of the CNN model f, and C 2 is the tradeoff parameter of the classification error term and the ℓ 2 norm regularization term. e minimization problem is given as follows to learn the optimal parameters, W * and f * , over the training set: Please note that in our learning problem, we explicitly introduce a slick variable, the convolutional representation vector of the CNN model, z i , for each data point and impose it to be equal to the output of the CNN model, f(X i ). Problem optimization: it is difficult to solve the problem in (10) directly, because the classification map p i itself is a function of W, f, and p j: X j ∈N i according to (2) and (5): Moreover, the parameters W and f are coupled. us, we adopt the EM algorithm to solve this problem. In an iterative algorithm, the parameters and the neighborhood classification map vector for each data point are updated alternately. In the M-step, we fix the classification map vectors p i | n i�1 and update W and f by minimizing the objective, while in the E-step, we fix the parameters W and f to update the neighborhood and the neighborhood classification map vectors. e E-step and M-step are introduced as follows: (i) E-step: in the t-th iteration, we first use the CNN model f t− 1 learned from previous iteration to update the convolutional vector of X i : en, we use the convolutional representation vectors of the data points to update the neighborhood of each data point. e neighborhood of each data points are collected as its k nearest neighbors according to the ℓ 2 norm distance collected: en, we use the updated neighborhoods, N t i | n i�1 , the updated convolutional representation vectors, z t i | n i�1 , the classification map vectors of previous iteration, , and the connection weight matrix of fullconnection layer, W t− 1 , to update the classification map vectors of current iteration according to (11): (ii) M-step: in this step, we fix the classification map vectors of previous iteration, p t− 1 i | n i�1 , and minimize the problem of (10) to obtain the solution of W and f for the t-th iteration: Computational Intelligence and Neuroscience 5 To solve this problem, we use the ADMM method. For each constraint z i � f(X i ), we introduce a dual vector, θ i ∈ R d , and rewrite the problem as follows: where Θ � [θ 1 , . . . , θ i ] and L(W, f, Z, Θ) is the augmented Lagrangian function. According to the ADMM process, we iteratively update the primal variables w k | K k�1 , f and z i | n i�1 to minimize the augmented Lagrangian using the gradient descent and use the gradient ascent method on the dual problem to update θ i | n i�1 : where ∇L x (x) is the subgradient function of L regarding x and η is the descent/ascent step.

Experiments
In this section, we evaluate the performance of the proposed deep learning method over several benchmark data sets.

Data Sets.
We used the following datasets for the evaluation of the proposed method.

SkyFinder Dataset.
is dataset is for the problem of sky detection in hazy image [24,25].
ese data contain about 90,000 outdoor images, which are captured by 53 cameras. e number of images captured by each camera is of thousands. About 40% pixels of the images are the sky, and the problem is to predict sky from the image in pixel level. e input for the prediction of each pixel is the surrounding region of the target pixel.

Multilingual Text Data Set.
is dataset is composed of five subsets of texts [26]. Each subset is corresponding to a language. e five languages are English, French, German, Italian, and Spanish. e texts are of six different classes, which are C15, CCAT, E21, ECAT, GCAT, and M11. For each class of each language, the number of texts is no more than 5,000. e number of texts of each language varies from 12,000 to 30,000, and the number of unique tokens of each language varies from 11,547 to 34,279. e number of texts of each class also varies from 11,000 to 34,000. Each text is presented by a sequence of tokens, and each token is represented by a work-embedding vector trained by Glove algorithm [27]. us, each text is a data point, and a text is a sequence of work-embedding vectors.
e problem is to predict the class label of a text from the sequence of wordembedding vectors.

FERET Face Image Data Set.
is dataset is an image data set of human faces [28]. It contains 13,539 images of 1,565 individuals. e images are of different ages, gender, and positions. Each image is of size 128 × 128 pixels. Each image is considered as a set of image patches, and thus, it is a 2− D sequence of instances. e problem for this data set is to recognize the individual from a given face image.

Property Price Data Set.
is is a data set of time series of the nationwide building society housing price index (https://www.nationwide.co.uk). e time range of this data set varies from the year of 1973 to 2000. To generate the data points, we use a sliding window of one year to move over the time series. e time series within the window is considered as the input sequence of a data point. e overall trend of the following three months of the window is treated as the target of prediction. e trend is defined as "increase" if the price at the end of the three months is significantly higher than in the beginning, "decrease" if significantly lower, and "flat" otherwise. e problem is thus a three-class classification problem. To present each data point, we further use a smaller sliding window to splice the time series into a set of frames, and each frame is treated as an instance.

Experimental Setting.
To conduct the experiments, we use the 10-fold cross-validation protocol to split the data sets into training sets and test sets. e entire data set is split into 10 equal-size subsets. Each subset is used as a test set in turn, and the other nine subsets are combined to construct a training set. e proposed algorithm is used to train the parameters of the deep learning models over the training set, and then, the model is applied to the test set. To classify one single data point in the dataset, we first find its nearest neighbors by comparing its convolutional representation against the convolutional representations of the data points in the training set and then use the deep neighbor classification map and its own convolutional representation to calculate a classification score to decide its classification result. e average classification rate over the ten test sets is used as the performance measure.

Compared Deep Learning Methods.
Our model is the very first deep learning model which can take both input data and the neighbor information as input. us, there are no existing models for comparison. However, the contextual deep learning model uses the neighboring instances as contextual to enhance the feature extraction of each instance in the input data point. Note that the neighborhood information of the contextual deep learning model is at the instance level, while our Iterative Deep Neighborhood is at the data point level; this leverages more information than the contextual deep learning model. We compared the following contextual deep learning model to our methods: (i) e Multicontext Deep Learning (MCDL) model was proposed by Zhao et al. [29]. is model was proposed to solve the problem of salient object presence in a low-contrast background, and it is based on the CNN model. Both global and local contexts are employed and jointly modeled in a unified multicontext deep learning framework of the model. (ii) e Multistage Contextual Deep Learning (MSCDL) model was proposed by Zeng et al. [30]. is model was proposed for the pedestrian detection problem, and it jointly trains multistage classifiers. Moreover, the local regions are used as contextual information to support the decision at the next stage. e deep learning model is trained in a stage-by-stage style. (iii) e Spatial Contextual Deep Learning (HCDL) model was proposed by Ma et al. [31]. is model is proposed for the hyperspectral image classification, and it uses both the feature and the spatial contextual information for the hyperspectral image classification. e spectral and spatial features are both learned by the deep learning framework to generate effective representations of the data.

Classification Results.
e classification results of the proposed method, Iterative Deep Neighborhood (IDN), and the compared methods are given in Figure 3. According to the results reported in Figure 3, we have the following observations: (1) Among all the data sets, our method obtains the best classification rate. is is a strong evidence for our claim that the neighbors help to build a more effective deep learning model for the problem of classification.  Figure 4. From this figure, we have the following observations: (1) In this figure, we can see that for all the benchmark data sets, more iterations always lead to better classification rates. e reason for this phenomenon is that our method updates the inputting neighbors together with the deep learning model parameters. us, more iterations give a better estimation of the neighbors, which train a better model.
(2) However, we also observe that when the iteration number is larger than 100 (for SkyFinder and property price data sets) or 200 (for multilingual text and FERET face datasets), the performance improvement seems stable. is indicates the convergence of the iterative algorithm.
Remark: we also discuss the possible parameters affecting the convergence as follows: (1) Since our optimization is based on the ADMM algorithm, the ascent/descent step size is an effector that controls the speed of ascent/descent. A lager step size usually results in a faster convergence. (2) Our method's convergence is also affected by the gradient function. is method is also an EM-like algorithm; thus, the convergence is also affected by the slow convergence nature of the EM algorithm, due to the gradient function. Potential solutions include using the conjugate gradient, or the modified Newton's gradient.     Computational Intelligence and Neuroscience while for the small data set of property price, the training completes within 400 seconds. (ii) e higher dimensional data usually consume more training time. e images as the two-dimensional data are more costly than the sequence data as the one-dimensional data. is is natural since the CNN model of the higher dimensional data conducts filtering over the higher dimensional data and the cost is exponentially compared to the lower-dimensional data. (iii) e overall running time of all the data sets for the proposed algorithm is acceptable. e longest running time is shorter than three hours. is computational cost is reasonable for training a good quality model.

Conclusion
In this paper, we proposed a novel deep learning framework. Different from existing deep learning models which only take the instances of an input data point, the proposed model can take both the input data point instances and the neighboring data points for the classification of the given input data point. Precisely, we estimate a classification map for each neighboring data point and apply a max-pooling operation to the classification maps of the neighbor to represent the neighbors. Moreover, the classification maps are based on the previous trained deep CNN model. e neighbor classification maps and the CNN model parameters are updated iteratively in an iterative algorithm. us, the proposed method is called deep iterative neighborhood. Compared to traditional deep learning methods, the proposed method achieved significant improvement over the classification tasks of benchmark data sets.

Data Availability
All the data sets used in this paper to produce the experimental results are publicly accessed online.

Conflicts of Interest
e authors declare that they have no conflicts of interest.