Cumulative Attribute Space Regression for Head Pose Estimation and Color Constancy

Two-stage Cumulative Attribute (CA) regression has been found eﬀective in regression problems of computer vision such as facial age and crowd density estimation. The ﬁrst stage regression maps input features to cumulative attributes that encode correlations between target values. The previous works have dealt with single output regression. In this work, we propose cumulative attribute spaces for 2-and 3-output (multivariate) regression. We show how the original CA space can be generalized to multiple output by the Cartesian product (CartCA). However, for target spaces with more than two outputs the CartCA becomes computationally infeasible and therefore we propose an approximate solution - multi-view CA (MvCA) - where CartCA is applied to output pairs. We experimentally verify improved performance of the CartCA and MvCA spaces in 2D and 3D face pose estimation and three-output (RGB) illuminant estimation for color constancy.


Introduction
Multiple output regression predicts several continous variables simultaneously.One of the emerging topics within regression problems is visual regression.
A straightforward solution is to learn individual regressors for each target variable separately using the traditional techniques (e.g.ridge regression, random forest regression [6] and support vector regression [7]).However, independent regressors discard the interdependence between the target variables, which can be substantial in vision problems.There are more advanced approaches for multivariate regression, such as joint learning of regressors in a multi-task fashion [8] and structured learning [9], but even these generic approaches cannot effectively model cross-target correlations of visual data and are often inferior to problem specific methods.Most of the above methods apply the traditional single layer regression architecture, where the multivariate output is estimated either directly from image features, or by optimizing a tailored score function .During the recent years there have been multiple successful attempts to replace the single layer model with two layer (two stage) architectures [10,11,12].The first layer output represents an "attribute space" where attribute features have an important semantic meaning for the regression or classification task solved by the second layer output.
In this work, we focus on the concept of cumulative attribute (CA) space mapping that was proposed in our previous work [12].The main idea behind the cumulative attributes is the intuitive fact that low level features for certain vision tasks, such as age estimation or crowd counting, are cumulative by nature.
In this work, we show that this hypothesis holds for a wider class of vision problems.
Inspired by the success of CA for scalar-valued regression [13], we extend CA to the multivariate output setting.A straightforward extension is to apply CA regression to each output variable independently.This approach is the baseline in our work -Independent Cumulative Attribute space (IndepCA).The drawback of IndepCA is its limited ability to exploit the multi-dimensional nature of the target space thus omitting the correlations of the output variables (such as visual similarity of faces between adjacent pitch and yaw bins in Figure 1).
To overcome this limitation we generalize the CA to 2-output case by adopting a mapping based on the Cartesian product (Figure 1) -Cartesian Cumulative Attribute space (CartCA).The CartCA divides the multi-dimensional space into disjoint regions.For a landmark point anchored in a multi-dimensional target space, i.e. a single regression label, CartCA forms uniquely different binary partitions of training samples.CartCA is a generalization of the original CA for two-dimensional target space.The number of binary partitions grows exponentially w.r.t. the label space dimensionality making CartCA impractical beyond more than two outputs.
To avoid the combinatorial explosion, we propose an approximation by projecting training samples into various 2D sub-spaces for which CartCA is applied.
We call this approach Multi-View Cumulative Attribute (MvCA) regression.In the experimental part, we study these methods in three different multivariate vi-sual regression problems: 2D head pose estimation, 3D head pose estimation and 3D illumination (RGB) estimation for color constancy.In all experiments, our method provides competitive performance and consistently outperforms methods that do not construct a cumulative attribute space layer for regression.
Our main contributions are summarized as follows: • We extend the scalar value cumulative attribute (CA) regression to 2output cumulative regression by adopting the Cartesian product to partition output spaces (CartCA).
• We propose an approximation approach for CA with ≥ 3 outputs by partioning output spaces to multiple 2D views -Multi-view Cumulative Attribute (MvCA).This approximation avoids exponential growth of CartCA.
• We demonstrate effectiveness of multi-output CA regression in several computer vision applications (2D and 3D head pose estimation and RGB illumination estimation for color constancy) where CartCA and MvCA achieve competitive accuracies as compared to state-of-the-art.

Related Work
In this section, we provide a short survey on the recent and related works in visual regression and attribute learning.Since our experiments are performed on 2D and 3D targets, we also survey related works on these applications (namely, head pose estimation and color constancy estimation).
Multivariate Regression -For the standard univariate regression problems in computer vision, we seek for a mapping f : R N → R, where the input x ∈ R N corresponds to N extracted image features and the output y ∈ R is a real-valued regression target.Traditional methods include L 2 regularized (ridge) regression, L 1 regularized (LASSO) regression [14], random forest regression [6] and support vector regression [7], to name a few.These regression methods can be applied to multivariate regression problems f : R N → R D by independently learning univariate regressors f : R N → R for each target variable y 1 , y 2 , . . ., y D separately.This approach, however, omits interdependencies between output variables and for that purpose there are other generic approaches such as enforcing jointly learning regressors in a multi-task fashion [8] or structured learning methods [9].For example, structured multivariate regression is applied in a number of computer vision applications [15].
Mid-layer attributes have been adopted in certain recent works [10,11,16,17,18,12].These methods learn D 1 -dimensional feature representation, which is used in a two-layer learning architecture f : R N → R D1 → R D or (concatenation of features and attributes) f : R N → R D1 , R N → R D .Indeed, it has been shown in many cases that the two-layer structure improves the accuracy.Inspired by the success of cumulative attributes (CAs) for scalar-valued regression [13], we generalize CA to the 2-output (D = 2) and 3-output (D = 3) settings in this work.For this work, we adopt the Partial Least Squares (PLS) regression [19] and NIPALS [20] for estimating the regression score (and loading) matrices due to their simplicity (for more details see Section 3.3).
Attribute Learning -Visual attributes, which can be either manually defined according to prior knowledge [17,18] or discovered from data [16,10], have been widely applied to a number of classification problems in computer vision, e.g., image categorisation [17,11], person re-identification [18], and action and video event recognition [16].The proposed classification problems, however, are different from the regression problems since they rarely establish natural and cumulative correlation, such as the person age or a number of people, and often require manual annotation.Yang et al. [21] proposed correlation analysis for two-view image reconstruction.
Recently, the concept of cumulative attributes [12] was proposed for regression problems, as those classification-oriented attributes cannot be utilized directly to explore the cumulative dependency across regression labels.However, CA developed for scalar-valued regression problems can only be applied to multivariate regression problems with the price of missing multi-dimensional nature of the target space (IndepCA in this work).
Head Pose Estimation -In this case, the regression target is either twodimensional (yaw and pitch angles) or 3D (+roll).The challenges reside in feature inconsistency and label ambiguity.In particular, for the same head pose, feature variations between different persons are large due to varying facial appearance.Moreover, the pose labels are noisy as the exact ground truth is difficult to acquire.As head pose estimation is challenging due to uncertain labels, it is considered a good testbed for evaluating robustness of the proposed attributes.The recent algorithms for head pose estimation can be categorized into two groups: classification-based [22] and regression-based [23,1,15,24].Moreover, deep architectures have been proposed for human pose recovery [25].
If the head pose estimation problem is cast to a classification problem, the implicit assumption is that pose labels are independent, which discards the ordered dependency across the label space [22].In the view of this, the regressionbased algorithms have recently become more popular for both 2D [26,27,15] and 3D head pose estimation [23,24].
In [27], a partial least square regression model was adopted to cope with the misalignment problem when estimating the head pose.[26] introduced a twolayer regression framework in a coarse-to-fine manner, which first determines the range of prediction (i.e.coarse estimation to robustify against ambiguous labels) and then learns a regression function to estimate the final pose value.Recently, Geng et al. [1] introduced the concept of soft labelling by using adjacent labels around the true pose label in a multi-label learning fashion.This reduces the negative effect of ambiguous targets and helps to capture correlations between the neighbouring targets.However, the soft labelling suffers from the invalid assumption that label correlations exist only locally.
On the contrary, the goal of our CartCA and MvCA is to represent the target correlations globally across the whole pose space.Beyond multivariate label distribution, regression forests [23] and its variants [15,24] were proven their effectiveness and real-time efficiency in 2D and 3D head pose estimation.
Illumination Estimation -Another experimental case in our paper consid-ers the estimation of illumination of color images.This is a 3-output regression problem, where the goal is to estimate the R, G and B values of scene illumination.
Existing algorithms for illumination estimation can be categorised into two main groups: statistics based [28,29] and learning based [30,31,32].In [32], a five-layer ad-hoc CNN was designed combining feature generation and multichannel regression to estimate illumination in an end-to-end manner.Qian et al. [4] employed an implicit structured output regression on the output of fullyconnected layer of VGG-Net to discover inter-output correlation.

Methodology
This section first introduces cumulative attribute (CA) regression in [12] (Sec.3.1).Next, a two-variate generalization of CA is proposed (CartCA) and then multi-view CA (MvCA) which is more practical for D > 2 target outputs (Sec.3.2).In Sec.3.3 the two-stage regression is discussed in more detail.

Cumulative Attribute Space
Consider a standard scalar value visual regression problem, with I training examples {x i , y i }, where x i ∈ R N are N extracted image features for the image indexed by i and y i ∈ R is the corresponding scalar target.[12] introduces midlevel mapping to a i ∈ R D1 which is termed as a "cumulative attribute" vector of x i .
The main workflow is based on two stage regression, where the first regressor provides attribute mapping f 1 : R N → R D1 and the second regressor provides the target output mapping f 2 : R D1 → R. It is noteworthy, that the best performance is achieved by concatenating the original features and the estimated attribute vector in the second stage, i.e. f 2 : x, a → R.
During the training stage, the mid-level attribute values a i ∈ R D1 are generated by thresholding the regression target y i ∈ R using the following CA rule: Alternative to the regression based attribute functions in our work, also any two-class (binary) classifier can be trained for the attribute assignments defined in (2).However, during our experiments we have found the real valued outputs of regressors, soft attributes, more effective.This can be explained by the fact that no information is lost in the binary decisions and the whole pipeline is regression based.

2-and 3-output Cumulative Attribute Spaces
We will now propose three variants of generalizing the univariate case to multivariate.
IndepCA-A straightforward multivariate (D ≥ 2) extension of CA is to treat all output dimensions as independent and use the standard CA for each output variable.We denote this straightforward extension as IndepCA.If, for simplicity, we assume that all D output dimensions are similar, then their corresponding cumulative attribute spaces can be represented by D 1 -dimensional attribute vectors.IndepCA learns D 1 -dimensional attribute mapping for each D dimensions of the target space y i ∈ R D .For the final stage regression we concatenate D D 1 -dimensional attribute vectors to a single vector of length The second stage regressor is a multi-variate regressor or D univariate regressors that provide the target output y i = (y 1 , y 2 , . . ., y D ).More details about the practical computation are in Section 3.3.
For scalar-valued regression, an important advantage of CA comes from its more effective use of the 1D target space than traditional regression learning settings.In particular, with all the available training samples, each attribute function in CA is trained to output either positive (i.e.one) or negative (i.e.zero) values, and a collection of such trained attribute functions, corresponding to a range of landmark points anchored in the 1D target space (e.g.integer ages), provides strong evidence for estimation of the target output.In contrast, regressors in traditional settings are trained to give a complete range of values in the target space, while regression fidelity for any specific target value is taken care of only by a (usually small) subset of training samples.This advantage of CA is particularly important for many regression problems in computer vision, such as human age estimation and crowd density estimation which often suffer from sparse and imbalanced training data.
The aforementioned collective evidence provided by trained attribute mapping functions and the attribute vector representation where each entry corresponds to a "landmark" (e.g.age) in a target space is intuitive and easy to manually select for 1D cases.However, the multivariate setting is more complex as there is no similarly unique way to divide the output space to "zeros" and "ones".We have already defined a multivariate model based on multiple CA regressors (IndepCA), but its main weakness is that it does not exploit the multi-dimensional nature of the target space in multivariate regression, i.e.

cross-correlations and interdependencies of output variables.
CartCA-The main problem in generalizing CA to multivariate cases is how to partition D-dimensional space such that it naturally represents the cumulative nature of attributes with their mutual dependency.As a novel solution, we propose a model termed Cartesian Cumulative Attributes (CartCA).
Assume again that we have I training samples {x i , y i }.Considering a Ddimensional target y i ∈ R D , each component y j=1,2,...,D will partition the training samples into two subsets as defined in (1).Now, if this is done for all j variables and their superpositions added by Cartesian product, the vector entries y i collectively partition the training samples into 2 D subsets, which we denote as {S 1 , . . ., S 2 D }.These subsets of training samples suggest that we can learn 2 D different attribute functions anchored at the position y in the target space.
For k = 1, . . ., 2 D , CartCA assigns attribute labels {a k i } to the training samples {x i } based on the following rule Consider, for example, the particular case of two-dimensional targets, i.e., D = 2.Then, the above rule for constructing the 2 D (in this case 4D) attribute tensors is given as follows (1) i and τ (2) j y 1, when τ For studying complexity of CartCA and MvCA we may assume that the D 1 attribute spaces are similar.In this case, we have the total of D 2 1 possible anchor points in the attribute space.MvCA learns 4 attribute planes associated with each of the landmark points, and there are in total D(D − 1)/2 such dimension pairs (j 1 , j 2 ).MvCA learns attribute functions in the same way for each of the pairs, producing a total of 2D 2  1 D(D − 1) attribute planes.For D > 3, this is significantly less than the corresponding number (2D 1 ) D for the CartCA.
In the case that the target space of multivariate regression is two-dimensional (a plane), i.e.D = 2, CartCA and MvCA are equivalent and give the same number of attribute features.In the case D = 1 all the original CA, IndepCA, CartCA and MvCA are equivalent.There are also recent works that could be used for dimensional reduction [33], but these are beyond the scope of this work.ŷ in the label space, ideally provide an exact indication on the target of x: attributes given by these functions form a vector 1 ∈ R 2 k with all entry values of 1 (any zero-valued entry in this vector indicates y = ŷ).When such a group of attribute functions are not available, attribute functions anchored at neighboring positions of ŷ form polytopes in the target space, which provide different levels of refined position information for the estimation of y.
• Based on different (and unique) binary partitions of the target space, other attribute functions provide different half-space constraints for the estimation of y.When these attributes are concatenated to the vector a CartCA , they collectively provide rich (and redundant) information for the estimation of y.
An illustration of the above geometric interpretation is presented in Figure 2. In summary, CartCA (or MvCA) encodes in the attribute vector a CartCA (or a M vCA ) strong information about the underlying position of any test sample in the target space, which can be exploited for final label estimation.

Two-stage Regression
Given training samples {x i , y i } with input features x i ∈ R N and output target vector y i ∈ R D , we construct the training attribute targets a i ∈ R D1 based on the attribute construction rules in the previous sections.
To this end, we employ the Partial Least Squares (PLS) regression [19] for its capability to cope with multicollinearity problem, and which has recently been applied to a number of visual regression problems [27?].Typical solution for estimating the score (and loading) matrices is the NIPALS [20], which we adopt for its low computational complexity (O(N 2 )).Alternatively, other multivariate regression models can also be employed such as multivariate ridge regression [12] and regression forests [6].Partial least square regression is adopted owing to its simplicity in implementation and computational efficiency.PLS learns a mapping function f : R N → R D1 from training data, which is used to estimate an attribute feature vector ã ∈ R D1 for an unseen test sample x and is the first stage regressor in the proposed CartCA and MvCA regression methods.
To perform the second stage target estimation, we first estimate ãi = f (x i ) and then concatenate x i with ãi .The concatenated vectors are used as the training data for the second stage multivariate regression.To learn a mapping function from the concatenated feature space to the multivariate target space, we adopt a few recent state-of-the-art methods, e.g.KPLS [27], KRF [15], and MLD [1] and compare them in our experiments.Our use of the existing methods is mainly to verify the effectiveness of our proposed CartCA and MvCA attribute features, by removing contributions from other factors.

Experiments
In the following, the proposed multi-output cumulative attribute space regression methods, IndepCA, CartCA and MvCA, are evaluated in multiple vision problems: 2D head pose estimation (2 pose angles), 3D head pose estimation (3 pose angles) and illumination estimation for color constancy (3 color correction terms for the red, green and blue channel).Datasets-For 2D head pose estimation, we used the popular Pointing'04 benchmark dataset [34] which contains face images of 15 persons captured in varying appearance and under controlled indoor environment.For 3D head pose estimation, we used the Biwi Kinect Head Pose Estimation dataset [35], which contains depth images of 20 persons.As a distinct visual regression problem from head pose estimation we also evaluated our model with two illumination estimation datasets [30,36] where illuminant tri-stimulus value (Red, Green, Blue)

Datasets and Settings
is estimated to correct a color biased input image.The SFU Indoor dataset [36] contains 321 images captured in 11 different controlled lighting conditions.The SFU Color Checker dataset [30] contains 568 12-bits dynamic range images which all include the Macbeth Color Checker chart as groundtruth.Details of the datasets are given in Table 1.
Features-For 2D head pose estimation, after cropping the foreground of faces with manually-annotated bounding boxes, the facial images are normalized into 32 × 32 pixels from which we extract a 2511-dimensional histogram of oriented gradients (HoG) feature vector [37], which is widely employed in the recent works [26,1,27,15].Encouraged by the significant advances with Convolutional Neural Networks (CNNs) in facial recognition [38], we also extract CNN features from the "fc6" layer of the pre-trained VGG-net 16 layer model [39].
For 3D head pose estimation, we first remove the background using the provided foreground masks by cropping 96 × 96 facial region anchored in the center of foreground masks.The cropped facial patches are then resized into 32 × 32 pixels.Inspired by the features used in [23,24], the depth value of each pixel in 32 × 32 patches were used as low-level features after which applying normalization of the non-zero pixel intensities (i.e.depth distance) into [0, 1].
Finally, for the illumination estimation problem, we used the pre-trained 19layer VGG-net without fine-tuning as described in [4].For both SFU Indoor and Color Checker datasets, we follow the settings in [4] to extract 4096-dimensional CNN "fc6" features from images resized to 224 × 224.
Settings-For the Pointing'04 dataset, two experiments were conducted according to the settings of data split.In the first experiment, we followed the same training and testing partition as [26,1,27,15], i.e. five-fold cross-validation.An alternative setting, i.e. two image sequences of the same person evenly split into training and testing data, was adopted for the second experiment as in [15].For the Biwi Kinect dataset, two experiments were conducted by 1) dividing the data into training part containing the images of the first 18 persons and testing part with the remaining images [23,24] and 2) by adopting five-fold cross-validation [23], respectively.For the SFU Indoor and Color Checker datasets, we followed the standard 3-fold cross-validation protocol in [32,4,29,40,41,31].
Comparative Methods-We collected most of the results of competitive approaches from corresponding papers.For ablation study with the 2D dataset we implemented several state-of-the-art methods including linear/kernel partial least square regression (PLS/KPLS) [27], k-cluster regression forests (KRF) [15], and multivariate label distribution learning (MLD) [1].
For 3D head pose estimation, we adopted standard regression forests (RF) [6] for the second layer multi-variate regression model owing to its strong performance in recent works [23,24].
For illumination estimation, we implement comparative multi-output support vector regression [4] in the light of its competitive performance.The number of factors for PLS and KPLS with RBF kernel is 25 and 40 respectively.
For KRF, we followed the setting in [15], the minimal size in each leaf node is 5 and we grew 20 regression trees.Following [1], MLD adopts weighted Jeffrey's divergence and two-dimensional Gaussian distribution with the finest granularity of head pose µ = 15.Regression forests for 3D head pose estimation have at least the sample size of 5 in each leaf node and grow 20 regression trees.
For illumination estimation, we used multi-output support vector regression (MSVR) [4] with the RBF kernel.Trade-off parameter C and γ of the RBF kernel were tuned by three-fold cross-validation.
We adopted the class labels to generate CartCA for 2D head pose estimation, while rounded values to nearest integers of 3D head pose angles are employed to generate CartCA and MvCA.For illumination estimation, we first normalised groundtruth illuminations into [0, 255] levels, which are quantised into 64 bins in a cumulatively and continuously changing manner.The class label of each bin on each colour channel was adopted to generate CartCA and MvCA.
Performance Metrics-For evaluating the performance of head pose estimation, we employed two types of performance metrics, i.e. regression metric in Mean Absolute Error (MAE) and classification metric.Considering the different data characteristics in labels (i.e.integer angles in the Pointing'04 dataset and scalar values in the Biwi Kinect dataset), we report the classification accuracy of predicted poses with respect to the ground truth [1] for 2D head pose estimation and used Cumulative Score (CS) defined in [42] for 3D head pose estimation as the classification metrics, respectively.Following [30,36], for illumination estimation we measured the angular error (cosine distance) ε between estimated illumination I ∈ R 3 and groundtruth I gt ∈ R 3 : where • is the Euclidean norm.We report median and mean value of ε I,Igt of all test samples.

Comparative Evaluation
2D Head Pose Estimation-We compared our IndepCA, CartCA and MvCA with a number of recent methods on the Pointing'04 datasets.The results of these experiments are shown in Table 2.Among the methods, PLS [27], KPLS [27], KRF [15], and MLD [1] use identical HoG and VGG-Net features as our approach.Since our models can use any general purpose regressor we selected MLD since it performed well both in the original paper and in our experiments.Interestingly, our multivariate baseline IndepCA is on par with the existing methods using traditional features (HoG) and clearly superior with the deep CNN features.However, in the both cases the proposed CartCA/MvCA is more accurate.
In order to further assess the significance of the feature set, we also fine-tuned the VGG-Net end-to-end in the same evaluation setting.More specifically, we It can be seen that in most cases, the end-to-end network is inferior to the proposed approach.The network is able to predict the pitch (vertical) angle better than alternative methods, but performs poorly on yaw angle prediction rendering the yaw+pitch metric inferior, as well.The inferior performance in horizontal angle prediction may be due to the larger number of classes in this direction (13 yaw angles, 7+2 pitch angles), which decreases the number of training samples per class and causes the network to overfit to the relatively small training set.
Finally, in order to assess the general suitability of a CNN for multivariate regression problems, we also considered using the original VGG-Net features with a neural network classifier.More specifically, we trained the described network architecture with frozen convolutional layers, forcing the network to use exactly same features as the other methods.The results are discouraging as the errors are up to three times higher than the best ones in Table 2.This is an indication that a plain dense neural network may not be ideal for multivariate regression tasks (note, however, successful results in related tasks with e.g., autoencoder structure [25]), and even better results could be obtained by coupling the fine-tuned convolutional pipeline with the proposed CartCA/MvCA.
3D Head Pose Estimation-Two experiments were conducted using different settings for data splitting and the results are in Table 3.Since the original  datasets.Our method achieves the best performance on both performance metrics on the SFU Indoor dataset, and our result is comparable to state-of-the-art on the SFU color checker.It is noteworthy that our results are always better than MSVR [4] who use identical deep features.Again, IndepCA performed well and MvCA was the best of the three proposed methods.
Computational Cost-The additional complexity of the proposed CA models yields from the mid-layer presentation, attribute vector, for which two regressors need to be trained.In traditional visual regression there is a single regressor which maps N input variables to D output variables.The computational complexities (sized of the attribute vectors) and the actual numbers for the three problems are shown in Table 5.

Ablation Study
CA Mapping-In order to validate the claim that the proposed Cartesian cumulative attribute multivariate regression (CartCA) and its multi-view projection based approximation (MvCA) provide accuracy improvement over the straightforward IndepCA we conducted an ablation study where the different CA spaces were compared using different regressors but with the same visual features.The results are shown in Table 6.In all cases the higher dimensional CA spaces provided superior accuracy.However, it is obvious that this finding is most evident with more traditional regressors such as KPLS [27].The more advanced regressors, such as KRF [15] and MLD [1], exploit output correlations more efficiently and therefore differences between IndepCA and CartCA/MvCA are less significant.
Concatenating with Imagery Features-During the experiments, we found that the best peformance was achieved by concatenating original imagery features and cumulative attributes for the second stage regression.In this experiment this finding was verified with the both face pose and color constancy datasets.The results are shown in Table 7 that clearly indicates that concatenation provides small but systematic improvement in all cases.

Conclusions
In this work, we investigated Cumulative Attribute space regression that has been found effective in many computer vision regression problems.In particular, we studied how correlations in the target label space can be exploited for improved accuracy.To this aim, we extended CA to 2-output and 3-output  sion problems.
In the experimental section we compared the proposed methodology with state of the art deep neural networks.It is noteworthy that the CNN does not excel in this domain, unlike most areas of machine learning today.This is likely due to the small training sample size as well as the challenges in encoding regression problems for neural networks.This highlights the key benefit of cumulative attributes: they divide the regression problem into a number of binary classification problems.This increases the amount of data for each task by several orders of magnitude.
Our future work will address higher dimensional generalizations of CartCA and MvCA and their applications in general multivariate regression.Moreover, integrating the idea of (multivariate) cumulative attributes with state-of-the-art classifiers-deep neural networks-would bring together the best of both worlds: data-hungry but accurate deep learning and economical cumulative attribute models.

Figure 1 :
Figure 1: Cartesian Cumulative Attribute space (CartCA) for 2-output regression.CA-based regression has three processing stages: i) feature extraction, ii) mapping from feature space to Cumulative Attribute space (Attribute Learning) and iii) mapping from CA space to a two-dimensional output space (Target Regression: head yaw and pitch angles).

for j = 1 , 2 ,
. . ., D 1 .In other words, the regression problem is decomposed into D 1 binary classification problems by thresholding the target at τ j .The dimension of the attribute space D 1 and the corresponding thresholds are problem specific; for example, in age estimation an obvious choice is to set τ 1 = 1, τ 2 = 2, . . ., τ 99 = 99 when D 1 = 99.The attribute mapping f 1 is learned using ridge regression; meaning that we learn D 1 attribute functions corresponding to D 1 mid-level binary targets.Ideally the mapping should look like a step function with the change located at the true target value, but estimated attributes âi are actually real valued vectors that are not binarized but directly used in the next stage regressor f 2 .This means that binary values are used only during the training stage and in the testing stage real value multiview cumulative attributes are used for the final regressor.
are set similarly to the original CA and they have clear semantic meaning.For a training example, the two-dimensional output sets an anchor point to partition the 4D attribute tensor.An illustration of the above attribute label assignment rule is shown in Figure1, where the goal is to estimate the head pose yaw and pitch angles.MvCA-One may notice that the number of attributes in CartCA increases exponentially with the dimensionality of target space, which makes learning of CartCA impractical in cases of high-dimensional target space and a small amount of data.In our experiments we found CartCA impractical for D > 2. As a remedy, we propose an approximate CartCA termed Multi-view Cumulative Attributes (MvCA).The MvCA attribute construction rule is based on CartCA which is still practical for D = 2 using (2).More specifically, for training samples {x i , y i } in the D-dimensional target space, we first select an output dimension pair (j 1 , j 2 ) with j 1 , j 2 ∈ {1, . . ., D}, j 1 = j 2 , and project all the training samples into this CartCA subspace.For a fixed anchor point y i,{j1,j2} ∈ R 2 in the CartCA sub-space, its entries partition the output space into 4 subsets (like those of Figure1), based on which MvCA uses 4 different "attribute planes" by following the rules in (3).

Figure 2 :
Figure 2: Geometric intuition of the proposed Cartesian Cumulative Attributes.Attribute functions/hyperplanes (blue lines) form polytopes in the target space, which provide different levels of indicative position information on the target (dark star point) of a test sample.In the weaker form certain attributes provide half-space constraints (red lines) on the target of the test sample.
used the VGG convolutional pipeline, with two output layers in place of the original 1000-class output-layer.The parallel output layers predict the yaw and the pitch angle encoded as two independent classification problems.The network was trained using the negative log-likelihood loss and softmax activations individually for both yaw and pitch targets.Moreover, we tested alternative network structures: the ResNet50 base network as well as alternative target encodings.It turned out that clearly the best results are obtained using the VGG-Net structure and classification encoding (each yaw and pitch angle is one class) instead of the regression target (the two output layers have linear activation and are directly predicting the yaw and pitch angles).

460
regression problems by introducing a general Cartesian CA (CartCA) and its multivariate approximation using multi-view CartCA projections-MvCA.The proposed CartCA and MvCA models are generally applicable and demonstrate systematic performance boost in 2-output and 3-output computer vision regres-

Table 1 :
Details of the datasets used in the experiments.D(i) = range of the i-th output dimension (2D face pose: yaw, pitch; 3D face pose: +roll; color constancy: color corrections c R , c G , c B ).

Table 4 :
Comparison with state-of-the-art on color constancy with the SFU Indoor and Color

Table 5 :
The CA space sizes for the proposed models.Note that only CartCA and MvCA can represent cross-correlations between the output dimensions.

Table 6 :
Comparison of the proposed CA spaces with various regressors for the second regression stage.Results correspond to the Yaw+Pitch MAE and classification accuracies with the Pointing'04 benchmark.

Table 7 :
The proposed CA spaces with (+x) vs. without the original input features concatenated in the second stage regression.