Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor

Subspace learning based pattern recognition methods have attracted considerable interests in recent years, including Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA), and some extensions for 2D analysis. However, a disadvantage of all these approaches is that they perform subspace analysis directly on the reshaped vector or matrix of pixel-level intensity, which is usually unstable under appearance variance. In this chapter, we propose to represent an image as a local descriptor tensor, which is a combination of the descriptor of local regions (K*K-pixel patch) in the image, and is more efficient than the popular Bag-Of-Feature (BOF) model for local descriptor combination. As we know that the idea of BOF is to quantize local invariant descriptors, e.g., obtained using some interest-point detector techniques by Harris & Stephens (1998), and a description with SIFT by Lowe (2004) into a set of visual words by Lazebnik et al. (2006). The frequency vector of the visual words then represents the image, and an inverted file system is used for efficient comparison of such BOFs. However. the BOF model approximately represents each local descriptor feature as a predefined visual word, and vectorizes the local descriptors of an image into a orderless histogram, which may lose some important (discriminant) information of local features and spatial information hold in the local regions of the image. Therefore, this paper proposes to combine the local features of an image as a descriptor tensor. Because the local descriptor tensor retains all information of local features, it will be more efficient for image representation than the BOF model and then can use a moderate amount of local regions to extract the descriptor for image representation, which will be more effective in computational time than the BOF model. For feature representation of image regions, SIFT proposed by Lowe (2004) is improved to be a powerful local descriptor by Lazebnik et al. (2006) for object or scene recognition, which is somewhat invariant to small illumination change. However, in some benchmark database such as YALE and PIE face data sets by Belhumeur et al. (1997), the illumination variance is very large. Then, in order to extract robust features invariant to large illumination, we explore an improved gradient (intensity-normalized gradient) of the image and use histogram of orientation weighed with the improved gradient for local region representation.


Introduction
Subspace learning based pattern recognition methods have attracted considerable interests in recent years, including Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA), and some extensions for 2D analysis.However, a disadvantage of all these approaches is that they perform subspace analysis directly on the reshaped vector or matrix of pixel-level intensity, which is usually unstable under appearance variance.In this chapter, we propose to represent an image as a local descriptor tensor, which is a combination of the descriptor of local regions (K*K-pixel patch) in the image, and is more efficient than the popular Bag-Of-Feature (BOF) model for local descriptor combination.As we know that the idea of BOF is to quantize local invariant descriptors, e.g., obtained using some interest-point detector techniques by Harris & Stephens (1998), and a description with SIFT by Lowe (2004) into a set of visual words by Lazebnik et al. (2006).The frequency vector of the visual words then represents the image, and an inverted file system is used for efficient comparison of such BOFs.However. the BOF model approximately represents each local descriptor feature as a predefined visual word, and vectorizes the local descriptors of an image into a orderless histogram, which may lose some important (discriminant) information of local features and spatial information hold in the local regions of the image.Therefore, this paper proposes to combine the local features of an image as a descriptor tensor.Because the local descriptor tensor retains all information of local features, it will be more efficient for image representation than the BOF model and then can use a moderate amount of local regions to extract the descriptor for image representation, which will be more effective in computational time than the BOF model.For feature representation of image regions, SIFT proposed by Lowe (2004) is improved to be a powerful local descriptor by Lazebnik et al. (2006) for object or scene recognition, which is somewhat invariant to small illumination change.However, in some benchmark database such as YALE and PIE face data sets by Belhumeur et al. (1997), the illumination variance is very large.Then, in order to extract robust features invariant to large illumination, we explore an improved gradient (intensity-normalized gradient) of the image and use histogram of orientation weighed with the improved gradient for local region representation.
With the local descriptor tensor of image representation, we propose to use a tensor subspace analysis algorithm, which is called as multilinear Supervised Neighborhood Preserving Embedding (MSNPE), for discriminant feature extraction, and then use it for object or scene recognition.As we know, subspace learning approaches, such as PCA and LDA by Belhumeur et al. (1997), have widely used in computer vision research filed for feature extraction or selection and have been proven to be efficient for modeling or classification.

Xian-Hua Han and Yen-Wei Chen
Recently there are considerable interests in geometrically motivated approaches to visual analysis.Therein, the most popular ones include locality preserving projection by He et al. (2005), neighborhood preserving embedding, and so on, which cannot only preserve the local structure between samples but also obtain acceptable recognition rates for face recognition.In real applications, all these subspace learning methods need to firstly reshape the multilinear data into a 1D vector for analysis, which usually suffers an overfitting problem.Therefore, some researchers proposed to solve the curse-of-dimension problem with 2D subspace learning such as 2-D PCA and 2-D LDA by ming Wang et al. (2009) for analyzing directly on a 2D image matrix, which was proven to be suitable in some extend.However, all of the conventional methods usually perform subspace analysis directly on the reshaped vector or matrix of pixel-level intensity, which would be unstable under illumination and background variance.In this paper, we propose MSNPE for discriminant feature extraction on the local descriptor tensor.Unlike tensor discriminant analysis by Wang (2006), which equally deals with the samples in the same category, the proposed MSNPE uses neighbor similarity in the same category as a weight of minimizing the cost function for N th order tensor analysis, which is able to estimate geometrical and topological properties of the sub-manifold tensor from random points ("scattered data") lying on this unknown sub-manifold.In addition, compared with TensorFaces by Casilescu & D.Terzopoulos (2002) method, which also directly analyzes multi-dimensional data, the proposed multilinear supervised neighborhood preserving embedding uses supervised strategy and thus can extract more discriminant features for distinguishing different objects and, at the same time, can preserve samples' relationship of inner object instead of only dimension reduction in TensorFaces.We validate our proposed algorithm on different benchmark databases such as view-based object data sets  and Facial image data sets (YALE and CMU PIE) by Belhumeur et al. (1997) and Sim et al. (2001).

Related work
In this section, we firstly briefly introduce the tensor algebra and then review subspace-based feature extraction approaches such as PCA, LPP.
Tensors are arrays of numbers which transform in certain ways under coordinate transformations.
The order of a tensor X∈R N 1 ×N 2 ×•••×N M , represented by a multi-dimensional array of real numbers, is M.A ne l e m e n t o fX is denoted as X i 1 ,i 2 ,••• ,i M , where 1 ≤ i j ≤ N j and 1 ≤ j ≤ M. In the tensor terminology, the mode-j vectors of the nth-order tensor X are the vectors in R N j obtained from X by varying the index i j while keeping the other indices fixed.For example, the column vectors in a matrix are the mode-1 vectors and the row vectors in a matrix are the mode-2 vectors.
for all index values.X ×d U means the mode d's product of the tensor X with the matrix U.
The mode product is a special case of a contraction, which is defined for any two tensors not just for a tensor and a matrix.In this paper, we follow the definitions in Lathauwer (1997) and avoid the use of the term "contraction".
In tensor analysis, Principal Component Analysis (PCA) is used to extract the basis for each mode.The proposed MSNPE approach is based on the basis idea of Locality Preserving Projection (LPP).Therefore, we simply introduce PCA, LPP and a 2D extension of LPP as the following.
(1) Principal component analysis extracts the principal eigen-space associated with a set (matrix denotes the samples feature in transformed subspace.Then, the linear transformation P can be obtained by solving the following minimization problem with some constraints, which will be given later: where W ij evaluate the local structure of the image space.It can be simply defined as follows: By simple algebra formulation, the objective function can be reduced to: where each column P i of the LPP linear transformation matrix P can not be zero vector, and a constraint is imposed as follows: where I in constraint term P T XDX T P = I or Y T DY = I is an identity matrix.D is a diagonal matrix; its entries are column (or row, since W is symmetric) sums of W, . Matrix D provides a natural measure on the data samples.
The bigger the value D ii (corresponding to y i ) is, the more importance is y i .The constraint for the sample y i in Y T DY = I is D ii * y T i y i = 1, which means that the more importance (D ii is larger) the sample y i is, the smaller the value of y T i y i is.Therefor, the constraint Y T DY = I will try to make the important point (has density distribution around the important point) near the origin of the projected subspace.Then, the density region near the origin of the 93 Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor www.intechopen.comprojected subspace includes most of the samples, which can make the objecrive function in Eq. ( 2) as small as possible, and at same time, can avoid the trivial solution ||P i || 2 = 0forthe transformation matrix P.
Then, The linear transformation P can be obtained by minimizing the objective function under constraint P T XDX T P = I: Finally, the minimization problem can be converted to solve a generalized eigenvalue problem as follows: In Face recognition application, He et al [8] extended LPP method into 2D dimension analysis, named as Tensor Subspace Analysis (TSA).TSA can directly deal with 2D gray images, and achieved better recognition results than the conventional 1D subspace learning methods such as PCA, LDA and LPP.However, for object recognition, color information also plays an important role for distinguishing different objects.Then, in this paper, we extend LPP to ND tensor analysis, which can directly deal with not only 3D Data but also ND data structure.At the same time, in order to obtain stable transformation tensor basis, we regularize a term in the proposed MSNPE objective function for abject recognition, which is introduced in Sec. 3 in detail.

Local descriptor tensor for image representation
In computer vision, local descriptors (i.e., features computed over limited spatial support) have been proven to be well-adapted for matching and recognition tasks as they are robust to partial visibility and clutter.The current popular one for a local descriptor is the SIFT feature, which is proposed by Lowe (2004).With the local SIFT descriptor, usually there are two types of algorithms for object recognition.One is to match the local points with SIFT features in two images, and the other one is to use the popular BOF model, which forms a frequency histogram of a predefined visual-words for all sampled region features by Belhumeur et al. (1997).For a matching algorithm, it is usually not enough to recognize the unknown image even if there are several points that are well matched.The popular BOF model usually can achieve good recognition performance in most applications such as scene and object recognition.However, in BOF model, in order to achieve an acceptable recognition rate, it is necessary to sample a lot of points for extracting SIFT features (usually more than 1000 in an image) and to compare the extracted local SIFT feature with the predefined visual words (usually more than 1000) to obtain the visual-word occurrence histogram.Therefore, BOF model needs a lot of computing time to extract visual-words occurrence histogram.In addition, BOF model just approximately represents each local region feature as a predefined visual-word; then, it may lose a lot of information and will be not efficient for image representation.Therefore, in this paper, we propose to represent a color or gray image as a combined local descriptor tensor, which can use different features (such as SIFT or other descriptors) for local region representation.
In order to extract the local descriptor tensor for image representation, we firstly grid-segment an image into K regions with some overlapping, and in each region, we extract some descriptors (can be consider tensor) for local region representation.For a gray image, a M-dimensional feature vector, which can be considered as a 1D tensor, is extracted from

94
Principal Component Analysis www.intechopen.comthe local gray region.For a color image, a M-dimensional feature vector can be extracted from each color channel such as R, G and B color channels.With the feature vectors of the three color channels, a combined 2D M × 3 tensor can represent the local color region.Furthermore we combine the K 1D or 2D local tensor (M-dimensional vector or M × 32 D tensor ) into a 2D or 3D tensor with of size M × K × L (L: 1 or 3).The tensor feature extraction procedure of a color image is shown in Fig. 1(a).For feature representation of the local regions such as the red, orange and green rectangles in Fig. 1 (a), the popular SIFT proposed by Lowe (2004) is proved to be a powerful one for object recognition, which is somewhat invariant to small illumination change.However, in some benchmark database such as YALE and CMU PIE face datasets, the illumination variance is very large.Then, in order to extract robust feature invariant to large illumination, we explore an normalized gradient (intensity-normalized gradient) of the image, and use Histogram of Orientation weighed with Normalized Gradient (NHOG) for local region representation.Therefore, for the benchmark databases without large illumination variance such as COIL-100 dataset or where the illumination information is also useful for recognition such as scene dataset, we use the popular SIFT for local region representation.However, for the benchmark database with large illumination variation, which will be harmful for subject recognition such as YALE and CMU PIE facial datasets, we use Histogram of Orientation weighed with Normalized Gradient (NHOG) for local region representation.
(1) SIFT: The SIFT descriptor computes a gradient orientation histogram within the support region.For each of 8 orientation planes, the gradient image is sampled over a 4 by 4 grid of locations, thus resulting in a 128-dimensional feature vector for each region.A Gaussian window function is used to assign a weight to the magnitude of each sample point.This makes the descriptor less sensitive to small changes in the position of the support region and puts more emphasis on the gradients that are near the center of the region.To obtain robustness to illumination changes, the descriptors are made invariant to illumination transformations of the form aI(x)+b by scaling the norm of each descriptor to unity [8].For representing the local region of a color image, we extract SIFT feature in each color component (R, G and B color components), and then can achieve a 128 * 3 2D tensor for each local region.
(2) Histogram of Orientation weighed with the Normalized Gradient (NHOG): Given an image I, we calculate the improved gradient (Intensity-normalized gradient) using the following Eq.: where I x (i, j) and I y (i, j) mean the horizontal and vertical gradient in pixel position i, j, respectively, I xy (i, j) means the global gradient in pixel position i, j.The idea of the normalized gradient is from χ 2 distance: a normalized Euclidean distance.For x-direction, the gradient is normalized by summation of the upper one and the bottom one pixel centered by the focused pixel; for y-direction, the gradient is normalized by that of the right and left one.

Multilinear supervised neighborhood preserving embedding
In order to model N-Dimensional data without rasterization, tensor representation is proposed and analyzed for feature extraction or modeling.In this section, we propose a multilinear supervised neighborhood preserving embedding by Han et al. (2011) ) be the i th object in the c th class.For color object image tensor, L is 3, N 1 is the row number, N 2 is the column number, and N 3 is the color space components (N 3 =3).We can build a nearest neighbor graph G to model the local geometrical structure and label information of X .L e tW be the weight matrix of G.A possible definition of W is as follows: where X i −X j 2 means Euclidean distance of two tensor, which is the summation square root of all corresponding elements between X i and X j ,and • means l 2 norm in our paper.
Let U d be the d-mode transformation matrices (Dimension: A reasonable transformation respecting the graph structure can be obtained by solving the following objective functions: min • Solve the minimizing problem: min with eigenspace analysis end for end for output: the MSNPE tensor Table 1.The flowchart of multilinear supervised neighborhood preserving embedding (MSNPE).

98
Principal Component Analysis www.intechopen.comwhere X i is the tensor representation of the i th sample; X i×1 U 1 means the mode 1's product of the tensor X i with the matrix U 1 ,andX i×1 U 1×2 U 2 means the mode 2's product of the tensor X i×1 U 1 with the matrix U 2 , and so on.The above objective function incurs a heavy penalty if neighboring points of same class X i and X j are mapped far apart.Therefore, minimizing it is an attempt to ensure that if X i and X j are where In optimization procedure of each mode, we also impose a constraint to achieve the transformation matrix (such as U d in mode d) as the following: For the optimization problem of all modes, we adopt an alternative least square (ALS) approach.In ALS, we can obtain the optimal base vectors on one mode by fixing the base vectors on the other modes and cycle for the remaining variables.The d-mode transformation matrix U d can be achieved by minimizing the following cost function:

99
Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor www.intechopen.com In order to achieve the stable solution, we firstly regularize the symmetric matrix D d as D d = D d + αI (α is a small value, I is an identity matrix of same size with the matrix D d ).Then, the minimization problem for obtaining d-mode matrix can be converted to solve a generalized eigenvalue problem as follows: We can select the corresponding generalized eigenvectors with the first N ′ d smaller eigenvalues in Eq.( 14), which can minimize the objective function in Eq.( 13).However, the eigenvectors with the smallest eigenvalues are usually unstable.Therefore, we convert Eq. ( 14) into: The corresponding generalized eigenvectors with the first N ′ d smaller eigenvalues λ in Eq. ( 14) means those with the first N ′ d larger eigenvalues β(1 − λ) in Eq. ( 15).Therefore, the corresponding generalized eigenvectors with the first N ′ d larger eigenvalues can be selected for minimizing the objective function in Eq.( 13).The details algorithm of MSNPE are listed in Algorithm 1.In MSNPE algorithm, we need to decide the retained number of the generalized eigenvectors (mode dimension) for each mode.Usually, the dimension numbers in most discriminant tensor analysis methods are decided empirically or according to applications.
In our experiments, we retain different dimension numbers for different modes, and do recognition for objects or scene categories.The recognition accuracy with varied dimensions in different modes are also given in the experiment part.The dimension numbers is decided empirically in the compared results with the state-of-art algorithms.
After obtaining the MSNPE basis of each mode, we can project each tensor object into these MSNPE tensors.For classification, the projection coefficients can represent the extracted feature vectors and can be inputted into any other classification algorithm.In our work, beside Euclidean distance as KNN (k=1) classifier, we also use Random Forest (RF) for recognition.

Database
We evaluated our proposed framework on two different types of datasets.
(i) View-based object datasets, which includes two datasets: The first one is the Columbia COIL-100 image library by Nene et al. (1996).It consists of color images of 72 different views of 100 objects.The images were obtained by placing the objects on a turntable and taking a view every 5 • .The objects have a wide variety of complex geometric and reflectance characteristics.Fig. 3(a) shows some sample images from COIL-100.The second one is the ETH Zurich CogVis ETH-80 dataset by Leibe & Schiele (2003a).This dataset was setup by Leibe and Schiele to explore the capabilities of different features for object class recognition.In this dataset, eight object categories including apple, pear, tomato, cow, dog, horse, cup and car have been collected.There are 10 different objects spanned large intra-class variance in each category.Each object has 41 images from viewpoints spaced equally over the upper viewing hemisphere.On the whole we have 3280 images, 41 images for each object and 10 object for each category.

Methodology
The recognition task is to assign each test image to one of a number of categories or objects.The performance is measured using recognition rates.
For view-based object databases, we take different experimental setup in COIL-100 and ETH80 datasets.For COIL-100, the objective is to discriminate between the 100 individual 101 Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor www.intechopen.comobjects.In most previous experiments on object recognition using COIL-100, the number of views used as training set for each object varied from 36 to 4. When 36 views are used for training, the recognition rate using SVM was reported approaching 100% by Pontil & Verri (1998).In practice, however, only very few views of an object are available.In our experiment, in order to compare experimental results with those by Wang (2006), we follows the experiment setup, which used only 4 views of each object for training and the rest 68 views for testing.In total it is equivalent to 400 images for training and 6800 images for testing.The error rate is the overall error rate over 100 objects.The 4 training viewpoints are sampled evenly from the 72 viewpoints, which can capture enough variance on the change of viewpoints for tensor learning.For ETH-80, it aims to discriminate between the 8 object categories.Most previous experiments using ETH-80 dataset all adopted leave-one-object-out cross-validation.The training set consists of all views from 9 objects from each category.The testing set consists of all views from the remaining object from each category.In this setting, objects in the testing set have not appeared in the training set, but those belonging to the same category have.Classification of a test image is a process of labeling the image by one of the categories.Reported results are based on average error rate over all 80 possible test objects by Leibe & Schiele (2003b).Similar to the above, instead of taking all possible views of each object in the training set, we take only 5 views of each object as training data.By doing so we have decreased the number of the training data to 1/8 of that used by Leibe & Schiele (2003b), Marrr et al. (2005).The testing set consists of all the views of an object.The recognition rate with the proposed scheme is compared to those of different conventional approaches by Wang (2006) and those with MSNPE analysis directly on pixel-level intensity tensor.
For facial dataset, which has large illumination variance in images, we validate that the tensor representation with the proposed NHOG for image representation will be much more efficient for face recognition than that with the popular SIFT descriptor, which only is somewhat robust to small illumination variance.In experiments Yale dataset, we randomly select 2, 3, 4 and 5 facial images from each individual for training, and the remainders for test.For CMU PIE dataset, we randomly select 5 and 10 facial images from each individual for training, and the remainder for test.We do 20 runs for different training number and average recognition rate in all experiments.The recognitions with our proposed approach are compared to those by the state-of-art algorithm by Cai et al. (2007a), Cai et al. (2007b).

Experimental results
(1) View-based object data sets We investigate the performance of the proposed MSNPE tensor learning compared with conventional tensor analysis such as tensor LDA by Wang (2006), which is also used in view-base object recognition, and the efficiency of the proposed tensor representation compared to the pixel-level intensity tensor, which directly consider a whole image as a tensor, on COIL-100 and ETH80 datasets.In these experiments, all samples are also color images, and SIFT descriptor for local region representation is used.Therefore, the pixel-level intensity tensor is 3rd tensor with dimension R1 × C1 × 3, where R1andC1 is row and column number of the image, and the local descriptor tensor is with 128 × K × 3, where K is the segmented region number of an image (here K=128).In order to compare with the state-of-art works by Wang (2006), simple KNN method (k=1 in our experiments) is also used for recognition.Experimental setup was given in Sec. 5, and we did 18 runs so that all samples can be as test.Figure 6(a) shows the compared results of MSNPE using pixel-level tensor and local descriptor tensor (denoted MSNPE-PL and MSNPE with KNN classifier, respectively, MSNPE-RF-PL and MSNPE-RF with random forest) and traditional methods by Wang (2006)   (n.d.).From Table 3 and 4, it is obvious that our proposed algorithm can achieve the best recognition performances for all most cases, and the recognition rate improvements become greater when the training sample number is small compared to those by the conventional subspace learning methods by Cai et al. (2007a), Cai et al. (2007b), Cai (2009) and Cai (n.d.).In addition, as we have shown in the previous section, our proposed strategy can be applied not only for recognition of face with small variance (such as mainly frontal face database), but also for recognition of generic object with large variance.With generic object dataset with large variance, the recognition rates are also improved greatly compared with using pixel-level tensor.

Conclusion
In this paper, we proposed to represent an image as a local descriptor tensor, which is a combination of the descriptor of local regions (K * K-pixel patch) in the image, and more efficient than the popular Bag-Of-Feature (BOF) model for local descriptor combination, and at the same time, we explored a local descriptor for region representation for databases with large illumination variance, Which is improved to be more efficient than the popular SIFT descriptor.Furthermore, we proposed to use Multilinear Supervised Neighborhood Preserving Embedding (MSNPE) for discriminant feature extraction from the local descriptor tensor of different images, which can preserve local sample structure in feature space.We validate our proposed algorithm on different Benchmark databases such as view-based and facial datasets, and experimental results show recognition rate with our method can be greatly improved compared conventional subspace analysis methods. 105 Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor www.intechopen.com Fig. 1.(a) Extraction of local descriptor tensor for color image representation; (b)NHOG feature extraction from a gray region.
Tensor objects X c i from C classes, X c i denots the i th tensor object in the c th class Graph-based weights: Building nearest neighbor graph in same class and calculate the graph weight W according to Eq. 9 and D from W Initialize: Randomly initialize U d r ∈ R N d for d =1,2,•••, L for t=1:T (Iteration steps) or until converge do for d=1:L (Iteration steps) do • Calculate D d and S d assuming Fig. 3. Sample images from view-based object data sets.
Fig. 4. (a) The compared recognition rates on COIL-100 between the proposed framework and the state-of-art approaches Wang (2006).(b) Average recognition rate with different mode dimension using random forest classifier.
T be the covariance matrix of the x i .One solves the eigenvalue equation λu i = Cu i for eigenvalues λ i ≥ 0. The principal eigenspace U is spanned by the first K eigenvectors with the largest eigenvalues, U =[u i | K i=1 ].I fx t is a new feature vector, then it is projected to eigenspace U: y t = U T (x t − m).T h ev e c t o ry t is used in place of x t for representation and classification.(2)LocalityPreservingProjection:LPP seeks a linear transformation P to project high-dimensional data into a low-dimensional sub-manifold that preserves the local Structure of the data.LetX =[x 1 , x 2 , ••• , x N ] denotesthe set representing features of N training image samples, and Y Han et al.97 Multilinear Supervised Neighborhood Preserving Embedding Analysis of Local Descriptor Tensor(2011)to not only extract discriminant feature but also preserve the local geometrical and topological properties in same category for recognition.The proposed approach decompose each mode of tensor with objective function, which consider neighborhood relation and class label of training samples.Suppose we have ND tensor objects X from C classes.The c th class has n c tensor objects and the total number of tensor objects is n.Let

Table 4 .
Average recognition error rates (%) on PIE dataset with different training number.