Image Object Recognition via Deep Feature-Based Adaptive Joint Sparse Representation

An image object recognition approach based on deep features and adaptive weighted joint sparse representation (D-AJSR) is proposed in this paper. D-AJSR is a data-lightweight classification framework, which can classify and recognize objects well with few training samples. In D-AJSR, the convolutional neural network (CNN) is used to extract the deep features of the training samples and test samples. Then, we use the adaptive weighted joint sparse representation to identify the objects, in which the eigenvectors are reconstructed by calculating the contribution weights of each eigenvector. Aiming at the high-dimensional problem of deep features, we use the principal component analysis (PCA) method to reduce the dimensions. Lastly, combined with the joint sparse model, the public features and private features of images are extracted from the training sample feature set so as to construct the joint feature dictionary. Based on the joint feature dictionary, sparse representation-based classifier (SRC) is used to recognize the objects. Experiments on face images and remote sensing images show that D-AJSR is superior to the traditional SRC method and some other advanced methods.


Introduction
Sparse representation has its unique advantages in signal processing, image processing, computer vision, pattern recognition, and so on. Image recognition based on sparse representation can be divided into two parts: sparse representation and classification recognition. First, we need to build a dictionary to represent the test samples, and then use the sparse representation coefficient and classification dictionary to identify the objects. Wright et al., for the first time, put forward a sparse representation-based classifier (SRC) [1]. SRC used the original training samples as dictionaries to represent the test samples linearly and then calculated the sparse representation coefficients of the test samples. It used sparse representation coefficients and training samples to calculate all kinds of reconstruction residuals, so the test samples can be identified according to the minimum reconstruction residuals.
Lai et al. proposed a tensor feature extraction method based on multilinear sparse principal component analysis (MSPCA). e key operation of MSPCA is to rewrite the multilinear PCA (MPCA) into multilinear regression forms and relax it for sparse regression. Moreover, it inherits the sparsity from the sparse principal component analysis (SPCA), and it can iteratively learn a series of sparse projections, achieving good results in face recognition [2]. By introducing the sparsity or l 1 -norm learning, Lai et al. proposed a unified sparse learning framework, which further extends the locally linear embedding-based methods to sparse cases. is method achieves good results in image recognition, especially in the case of small samples [3]. ere is also a generalized robust regression (GRR) method for jointly sparse subspace learning. By incorporating the elastic factor on the loss function, GRR can enhance the robustness to obtain more projections for feature selection or classification and have better robustness in face recognition [4].
Convolutional neural network (CNN) is a machine learning model under deep supervised learning. For image recognition, CNN can directly use the image data as input data without manual preprocessing and additional feature extraction. erefore, CNN has achieved good recognition effects. CNN is very suitable for extracting image features as it can extract a variety of image features including texture, shape, color, and image topology.
Jiajia et al. proposed a new CNN-GRNN model for  image classification and recognition. is model used a simple CNN model for image feature extraction, and then used a general regression neural network (GRNN) model for classification [5]. Lu and Linghua proposed a face recognition method based on discriminant dictionary learning, which used a Gabor filter to learn the new dictionary and classified the images with sparse representation [6]. Mahoor et al. proposed a face motion combination recognition framework based on sparse representation and used the average Gabor feature of motion combination to establish an ultracomplete dictionary to improve the recognition accuracy of various actions [7].
To some extent, the aforementioned researches have improved the recognition efficiency, but they also have their own limitations. For example, only using the CNN model for image recognition will both take a lot of time to adjust parameters and require a large number of training samples [8]. Sometimes, it is difficult to obtain a large number of experimental samples that meet the requirements. On the contrary, the sparse representation of the traditional dictionary introduced above mostly used traditional features, which cannot meet the requirement of high recognition rate in many cases. In view of these situations, we improve the traditional dictionary into an extended dictionary and use deep features as the atoms in the dictionary to propose the D-AJSR approach.
D-AJSR is a data-lightweight classification framework with relative high recognition rate. At the same time, compared with artificial intelligence methods, D-AJSR can classify and recognize objects well with few training samples.

Classification Method Based on Sparse Representation.
SRC is a classification and recognition framework for face images first proposed by Wright et al. [1], which has been gradually applied to other image classification and recognition. If there are n training samples X � [x 1 , x 2 , . . . , x n ](x i ∈ R m , general m << n) belong to k classes, then the entire training data set can be expressed as where , v i,j is the jth sample of the ith class, and n i is the sample number of the ith class. Based on the theory of sparse representation (SR), a new test sample y ∈ R m in class i can be linearly expressed by the training sample W i as follows [9]: where α i,j ∈ R is the sparse representation coefficient of y, j � 1, 2, . . . , n i . Without considering the noise, formula (2) can be written as where x � [0, . . . , 0, α i,1 , α i,2 , . . . , α i,n i , 0, . . . , 0] T ∈ R n . In order to get the sparsest x, SRC needs to solve the following l 1 minimization problem: where x ′ is the coefficient vector and ‖x‖ 1 � i |x i | is l 1 norm.
For each class, we can construct a mapping function δ i (x ′ ) � [0, . . . , 0, x i , 0, . . . , 0] to represent the nonzero element selected from the coefficient vector x ′ corresponding to class i. e test sample reconstructs the representation with the sparse coefficient as en, y is classified to class(i) by using the minimum residual: where ‖ · ‖ 2 is l 2 norm and i � 1, 2, . . . , k.

Joint Sparsity
Model. Joint sparse model (JSM) was first proposed to encode multiple related signals effectively [10]. In JSM, due to the correlation between the signals, the related signals can be used as one set, and each signal can be represented as a combination of public features and private features on a specific sparse basis. e public feature is the public part of all the signals in one signal set, and the private feature is the characteristic part unique to each signal. So, the jth signal can be represented by the public features of a certain set and its own private features: where z c represents the public features and z j represents the private features of the jth signal.
Assuming that all images can be divided into K classes and there are J training images in each category, the jth image in the ith class can be expressed as y i,j . If an image is represented as a one-dimensional column vector, the image in the ith class can be represented as y i � [y i,1 , y i,2 , . . . , y i,J ] T . According to JSM, the jth image in the ith class can be represented as where z c i represents the public features of the images in the ith class and z i i,j represents its own private features [11]. If Ψ ∈ R N×N is the orthonormal basis that can sparse represent the training image, then formula (7) can be written as where θ i,j � Ψy i,j is the sparse representation of y i,j on transform basis Ψ and θ c i � Ψz c i and θ i i,j � Ψz i i,j represent the sparse representation of the public part and the private part on basis Ψ, respectively. If both sides of formula (8) are left multiplied by Ψ T , then formula (8) is changed to

Computational Intelligence and Neuroscience
Combined with formula (7), y i,j � Ψ T θ c i + Ψ T θ i i,j , so the joint representation of the image can be expressed as Formula (9) can be simplified as where is the overcomplete dictionary and consists of two parts: A � Ψ T Ψ T · · · Ψ T T and B � diag(A). W i preserves the discriminant information, and its sparse representation can be obtained by solving the (l 1 ) minimization of the following formula: After obtaining W i , the public features of all images in class i and the private features of each image can be obtained in the Ψ field according to the inverse transformation: All public features and all private features form a joint feature dictionary D: So, we can use the following formula (14) to identify which category the objects belong to: As can be seen from the above, the joint sparse model algorithm only uses two parts to represent each kind of the training image, which effectively reduces the size of storage space.

Image Object Recognition Based on D-AJSR
e algorithm framework of D-AJSR is shown in Figure 1. Unlike JSM summing up private features directly, D-AJSR combines public features and private features into a joint dictionary. Based on this, sparse representation is used to find the sparse solution of the test samples on the adaptive joint dictionary.

Deep Feature Extraction.
CNN can automatically extract complex global and local features from images [8].
erefore, D-AJSR introduces the deep features extracted by CNN into sparse representation to enhance the recognition ability of sparse representation.
In this paper, VGG19 is adopted for feature extraction. In the ILSVRC-2014 image classification competition, VGG took the second place with a 7.3% top-5 error rate and the champion of object detection [12]. VGG uses a small convolution kernel of 3 × 3 throughout the construction of the network and superimposes deep networks by superposing 3 × 3 small convolution kernels. e network structure of VGG19 is shown in Figure 2. Figure 3 shows the examples of extracted features. e left image is the original image, the upper row shows the features extracted from the first layer, and the lower row shows the features extracted from the second layer. By comparing the extracted features of each layer, it can be found that most textures and detail features are extracted by the shallow network, while the contour and shape features are extracted by the deeper network. Relatively speaking, the deeper the layers are, the more representative the extracted features will be, but the resolution of the feature maps will become lower.

Adaptive Weighted Reconstruction.
When constructing the joint dictionary, the object information contained in different samples is also different, and the samples which have larger variance contain more object information. erefore, we consider increasing the weights of the samples with more object information in the dictionary and reducing the weights of the samples with less object information so as  Computational Intelligence and Neuroscience to improve the discrimination ability of the feature dictionary [13].
e feature vector F � [F 1 , F 2 , . . . , F n ] T can be transformed as follows after it has been extracted by CNN: where F i represents the extracted features of the ith image, F 2 ′ represents the weighted image features, and F � (F 1 + F 2 + · · · + F)/n represents the average of the features. Formula (15) can adaptively carry out weighted reconstruction and normalization for feature vector elements, which can increase the standard deviation or variance of the feature vectors to a certain extent, help deep feature dictionary containing more recognition information, and improve the recognition efficiency.

Main Steps of D-AJSR.
e main steps of D-AJSR are as follows: (1) VGG19 network is used to extract the deep features of the training and the test images.

Experiments and Analysis
In order to verify the validity of D-AJSR, we conduct experiments on the face images and remote sensing images, respectively. e computer used in the experiments is configured as Intel Core i5-3210M @2.5 GHz with 4 GB memory. e experimental platform is Matlab R2017a. In deep feature extraction, we can get 64 global deep features in the first layer and 128 deep features in the second layer. All the experimental results in this chapter are the average results of 10 experiments and D-AJSR(8) represents 8 deep feature maps are used.
In the experiments, we chose the features obtained from the second layer. In Table 1    Computational Intelligence and Neuroscience indicate the highest recognition rate under the same condition. As can be seen from Table 1, D-AJSR maintains a high accuracy in all dimensions, and its performance in 50 dimensions is better than that of other methods in 75, 100, and 150 dimensions. erefore, D-AJSR can greatly reduce the feature dimension under the same precision requirement.

Experiments on AR Data
Set. e AR data set contains more than 4000 positive images of 126 people, each with a size of 120 × 165. In the experiments, we use a subset of 2,600 facial images of 100 people, including 50 men and 50 women. Each person has 26 images, which are divided into two separate parts. Each part has 13 pictures, of which 7 are facial expression pictures or unshielded pictures of light changes, 3 are pictures wearing sunglasses, and 3 are pictures camouflaged with scarves, as shown in Figure 5 (the images are selected randomly). In the two parts, we use one part for training and the other part for testing. e feature dimensions of face images are also 25, 50, 75, 100, and 150, respectively. e initial dimension of 8 deep features is 30 × 41 × 8 � 9840.
In the experiments, we chose the features obtained from the second layer. In Table 2, bold numbers in each column indicate the highest recognition rate under the same condition. Because there are samples of wearing sunglasses and camouflaged and the number of these two samples is small, which affects the dictionary training, the recognition rate of our method is lower than that on YaleB data set. As can be seen from Table 2, although the D-AJSR method does not achieve the best effect when the dimensions are 25 and 50, it still remains at a medium level. When the dimensions are 25 and 50, the D-AJSR method does not perform well, mainly because the number of principal components is relatively small and the variance contribution rate is low (less than 0.6). As the representativeness of principal components becomes better, starting from dimension 75, the recognition rate of D-AJSR is better than that of other methods.
In addition, we also compared the experiment results with the locality-constrained and label embedding dictionary learning algorithm (LCLE-DL) [22]. e average recognition rate of LCLE-DL is about 80%, while the average recognition rate of our method is 86.60%. In terms of recognition accuracy, the result of D-AJSR(8) method is relatively better.

Remote Sensing Image Recognition Experiments.
In this part, remote sensing aircraft images are selected from Google Earth 7.1.8 to build data sets for experiments. Remote sensing images of Google Earth are composed of satellite images and aerial images, among which the satellite images come from QuickBird satellite and Landsat-7 satellite, and the aerial images come from BlueSky Company and Sanborn Company and so on. In experiments, images taken at different shooting times and locations are downloaded as data sets. Figure 6 shows the examples of remote sensing images. e size of a remote sensing image is 170 × 170, and the initial dimension of 64 deep features obtained from the first layer is 85 × 85 × 64 � 462400 (Figure 7). After the PCA process, the feature dimensions of aircraft images are 25, 50, 75, and 100, respectively. In the experiments, the SRC method [1] and the adaptive weighted joint sparse representation classification method (AJRC) [13] are compared with D-AJSR. In these experiments, we choose the features obtained from the first layer. e experimental results are shown in Table 3, and the bold numbers in each column indicate the highest recognition rate under the same condition.
Due to the small number of samples in the data set, the great interference caused by the shadow of the aircraft, and the tire marks on the ground, the recognition rates of all the 3 methods do not reach the better effect as those in the aforementioned experiments. At the same time, because the atoms of the same object are only in 8 directions, the recognition of the object will also be affected. But on the same data set, the effect of D-AJSR is still better than that of the other methods.

Cumulative Percent of the Principal Components.
In the experiments, when using PCA to reduce the dimension of features, the cumulative variance contribution rate of features with different lengths is shown in Table 4. In Table 4, the left column is the number of feature maps selected from the second layer of VGG19, and the results are obtained on YaleB data set.  Computational Intelligence and Neuroscience As can be seen from Table 4, due to different image sizes and different numbers of deep features, the cumulative variance contribution rate of features with the same number is not the same. On the contrary, if the same cumulative variance contribution rate is selected, the feature length will be different. In order to keep the size of the dictionary consistent, we use the same number of principal components in the experiment.

Effect and Efficiency of Different Feature Map Numbers.
e recognition rate of D-AJSR varies with the number of depth feature maps. For object recognition on YaleB data set, we use the deep features obtained from the second layer of the VGG19 network, and there are 128 feature maps in total. ese feature maps contain different information of the objects. We choose different number of feature maps in experiments, and the recognition results are shown in Table 5. e first row in the table is the number of principal components selected by PCA, and the left column is the number of feature maps selected after deep feature extraction. e bold numbers in Table 5 are the highest recognition rate in each column. As can be seen from Table 5, when the number of feature maps is small, the recognition rate will increase with the increase of feature maps. When the number of feature maps increases from 1 to 4, the recognition rate increases by 4.89% on average, among which the recognition rate increases by the most when 25 principal components are selected, reaching 8.69%. However, with the increase of feature maps, the improvement of recognition rate is not obvious or even decreases. For example, when the number of deep feature maps increases from 64 to 128, all recognition rates decrease in varying degrees. Because the number of deep feature maps is too large, the original atomic column vectors are too long and the energy is difficult to concentrate, so the resolution of the final feature vector decreases. erefore, in practical applications, we need to select the appropriate number of deep feature maps.
In addition to the recognition rate of D-AJSR, we also calculate its time efficiency. e experiments are carried out on the remote sensing image data set and compared with SRC and AJRC. e training efficiency results of different approaches on remote sensing images are shown in Table 6, the test efficiency results are shown in Table 7, and the unit of time is second (s). In the experiments, there are 150 training samples and 225 test samples, and the sample size is 170 × 170. e experiments adopt the feature maps of the first layer, and D-AJSR(64) represents 64 deep feature maps are used.
From Tables 6 and 7, it can be seen that the object recognition time of D-AJSR is longer than that of SRC.
However, when the number of feature maps is less than 32, the recognition time is shorter than that of AJRC. e feature extraction of 226 deep feature maps in D-AJSR takes about 30 seconds. Considering the results of Tables 5-7, D-AJSR is more advantageous than the other two methods.
From all the aforementioned experiments, we can see that D-AJSR can achieve satisfactory recognition results when the data set is small. Generally speaking, in the recognition of remote sensing objects, VGG and other neural network methods often need thousands of images as training sets for each class, while D-AJSR only need a few images as atoms for each class to output the recognition results. In many cases, it is difficult to obtain a large number of training data for specific tasks, such as the recognition of sensitive objects in special circumstances, identification of unusual objects, and so on. At this time, D-AJSR can give full play to its advantages and provide timely recognition results.

Conclusions
Aiming at the application requirements of object recognition, we introduce deep features into adaptive joint sparse     representation and propose D-AJSR, a data-lightweight classification framework. In order to improve the object recognition rate, the method also adaptively adjusts the atomic weights. Experimental results show that the method has relatively higher recognition rate. On the contrary, since deep feature extraction is more complex than simple change feature extraction, the time consumption of the method will increase correspondingly. When the number of samples is too small, methods such as deep learning cannot provide reliable identification results due to inadequate training. However, D-AJSR can provide recognition results when there are only a dozen or even a few samples, which provides an effective solution for object recognition without sufficient samples. In addition, after angular rotation expansion of training samples, D-AJSR also has a certain ability of rotating object recognition. In D-AJSR, feature extraction needs some time.
erefore, under the framework of sparse recognition, how to select features that are more expressive and can be extracted quickly is worth our attention in the future.

Data Availability
e YaleB data sets and AR data sets are the public data sets, which can be found from reference [14,15].

Conflicts of Interest
e authors declare that they have no conflicts of interest.