FUSION OF HYPERSPECTRAL AND VHR MULTISPECTRAL IMAGE CLASSIFICATIONS IN URBAN AREAS

An energetical approach is proposed for classification decision fusion in urban areas using multispectral and hyperspectral imagery at distinct spatial resolutions. Hyperspectral data provides a great ability to discriminate land-cover classes while multispectral data, usually at higher spatial resolution, makes possible a more accurate spatial delineation of the classes. Hence, the aim here is to achieve the most accurate classification maps by taking advantage of both data sources at the decision level: spectral properties of the hyperspectral data and the geometrical resolution of multispectral images. More specifically, the proposed method takes into account probability class membership maps in order to improve the classification fusion process. Such probability maps are available using standard classification techniques such as Random Forests or Support Vector Machines. Classification probability maps are integrated into an energy framework where minimization of a given energy leads to better classification maps. The energy is minimized using a graph-cut method called quadratic pseudo-boolean optimization (QPBO) with α-expansion. A first model is proposed that gives satisfactory results in terms of classification results and visual interpretation. This model is compared to a standard Potts models adapted to the considered problem. Finally, the model is enhanced by integrating the spatial contrast observed in the data source of higher spatial resolution (i.e., the multispectral image). Obtained results using the proposed energetical decision fusion process are shown on two urban multispectral/hyperspectral datasets. 2-3% improvement is noticed with respect to a Potts formulation and 3-8% compared to a single hyperspectral-based classification.


INTRODUCTION
In land-cover mapping, optical images with higher spatial resolution (<2 m) offers a slight increase in classification accuracy with respect to 2-10 m imagery (Khatami et al., 2016).It can be attributed to the significant amount of geometrical details present in the scenes.However it is also often accompanied with two main drawbacks.First, VHR data steadily increases the spectral variability of each land-cover class and decreases it between classes.Secondly, the spectral resolution of VHR data is often limited to three or four spectral bands (red, green, blue, infrared).Consequently, it appears relevant to merge such multispectral (MS) VHR images with hyperspectral (HS) data.The latter one gives a precise description of the spectral information but with a low geometric precision.Hence, HS data allows reliable land-cover classificationn results while MS image helps retrieving the geometric contours of such classes.Combining these two data sources may help reaching better classification results at the highest spatial resolution between both datasets.Combining data with different dimensionalities, spectral and spatial resolutions is a standard remote sensing problem that has been extensively investigated in the literature (Chavez, 1991).Such issue has been exacerbated in the last years with the emergence of new optical sensors with various spatial and spectral configurations.Remote sensing is now inherently multi-modal.The possibility to acquire images of the same area by different sensors has resulted in scientific research focusing on fusing multisensor information as a means of combining the comparative advantages of each sensor.Complementary observations can thus be exploited for land-cover mapping purposes and combining existing observations can mitigate limitations of any one particular sensor in particular for land-cover issues (Gamba, 2014;Joshi et al., 2016).None of them strictly outperforms all the others.Fusion can be carried out at three different levels.First, it can be achieved at the observation level.For that purpose, pan-sharpening is a well known technique that integrates the geometric details of a high-resolution panchromatic image (PAN) and the color information of a low-resolution MS image to produce a highresolution MS image.Pan-sharpening methods usually use PAN image to replace the high frequency part in the MS image (Carper, 1990).Other fusion algorithms have been proposed to merge MS and PAN images to combine complementary characteristics in terms of spatial and spectral resolutions (Loncan, 2016).Secondly, data sources can merged at the feature level.Attributes are computed for each source separately but are fed into the same classifier through a unique feature set.Thirdly, decision fusion can be performed (Benediktsson and Kanellopoulos, 1999) i.e., outputs of multiple independent classifiers are combined in order to provide a more reliable decision (Aitkenhead and Aalders, 2011;Huang and Zhang, 2012).Several types of fusion methods have been proposed, i.e., probabilistic, fuzzy and possibilistic fusion (Fauvel et al., 2006) as well as the evidence theory (Tupin, 2014).This later type of fusion is the most popular nowadays and was developed by Dempster and Shafer (Shafer, 1976).This general framework for reasoning with uncertainty relies on the use of belief functions, and makes it possible to combine evidence from different observations to reach a certain degree of belief.Yet efficient in some cases, it is a theoretical complex framework that does not apply easily when dealing with heterogeneous and multiple data.Another effective solution is to propose as feature set to a classifier the probabilistic outputs of the several mono-source classifiers (Ceamanos et al., 2010).
In this paper, a fusion technique at the decision level is presented.Classification resultats are obtained from MS and HS images sep-arately.The specific aim here is to use the class membership probability maps, an additional information given by standard classification algorithms.Such information is here used within a generic energetical framework, to be able to generalize the process to classifications obtained from several data sources at various spatial resolutions with complementary advantages (optical images, lidar and radar).At present, work has focused on spatial fusion and no advanced method (e.g. using evidence theory) was used to manage data uncertainty.Hence, it was aimed at proposing a simple and adaptive method that will further be generalizable to integrate easily several types of data, with diversity in term of geometric, spectral and temporal resolutions.Thus, it can also be used when pansharpening methods are not adapted.The model presented in this paper is based upon graph-cut algorithms, that are well known and have been widely used in the computer vision and image processing community.These techniques rely on the definition of an energy composed with a data term and a regularization term, and the aim is to minimize the energy to obtain the desired result.Such techniques are popular for their simplicity of use and their flexibility while giving satisfying results in a wide range of application fields such as recognition, segmentation or 3D data reconstruction with a suitable computing time (Szeliski, 2010).Thus, such techniques were used to integrate an usually unexploited information (i.e., the probability maps of class membership) in order to propose an efficient fusion of classification results obtained on any pair of MS and HS images.

METHODS
The core of the fusion method is based upon a specific data: the class membership probability for each pixel of the images.Thus, this section first shortly reminds some classification techniques providing such an output.The rest of the section is dedicated to the description of the fusion method.

Classification algorithms
Most classification approaches provide probability values for each class of interest as an output, namely the Random Forests (RF) method (Breiman, 2001) and the Support Vector Machines (SVM) one (Schölkopf, 2002).RF technique natively provides posterior class probabilities output whereas additional steps are required for SVMs (Platt, 2000).For classification purposes, SVMs usually give slightly better results than RFs, but with longer computation times.Since the processing time is not an issue here, our preference was given to SVMs, and more precisely to a Gaussian kernel SVM.Given a set of training pixels for each class, SVM learns a model and assigns new pixels into one of the considered class.As denoted before, SVM also provides posterior class probabilities, retrieved with the Platt's technique.∀u ∈ where u is a pixel of the image I and C k is one of the k classes of interest.SVM classification process is applied to both VHR MS and HS image, as depicted in Figure 1, which presents VHR MS and HS images and their corresponding classifications (in grey level).It must here be underline that in our experiments, VHR MS images were sometimes replaced by panchromatic ones so as to have a more challenging problem.All SVM classifications considered in this article were obtained using sets of 100 randomly selected pixels for each class within the ground truth data (see Section 3.1 for more details).

Basic model
In this section, we present the basic model for MS and HS image classification fusion using posterior class probabilities.It is  based upon the definition of an energy that will further be minimized.The energy is composed of two terms: a data attachment term Edata and a regularization term Eregul.This kind of formulation is well known in the image processing domain (Kolmogorov and Zabih, 2004) and has been successfully used for different applications.The energy model is a probabilistic graph taking into account the posterior class probabilities PMS and PHS from MS image IMS and HS image IHS, respectively, as well as the MS classification CMS.In order to get probability maps with identical sizes, IHS was first upsampled to the size of IMS, meaning that each pixel of the VHR MS image has a corresponding pixel in the HS image.For a classification map C, our basic model defines the energy E as: where: is the tradeoff parameter between both terms and N is the 8 connexity neighborhood.Edata is a function of the probability map PHS since IHS is the data containing the most discriminative information for classification.If PHS(u) is high, the pixel u is prone to belong to class C(u) and Edata(C(u)) will be small.
Eregul models the relationship between a pixel and its neighbors.
The more a pixel u and its neighbors v ∈ N correspond to the desired model, the smaller Eregul The full model E(C) expresses how the classification fits to the probability map PHS and how neighboring pixels follow the model defined for Eregul.

Data term:
Edata is the data attachment term and is defined by a function f such as: The function f ensures that if the probability for a pixel u to belong to the class C(u) is close to 1, Edata will be close to 0 and will not impact the total energy E. Conversely, if the pixel u is not likely to belong to class C(u), PHS(C(u)) will be close to 0 and the data attachment term will be close to its maximum for a pixel, i.e., 1.

Regularization term:
The regularization term Eregul defines the interaction between a given pixel and its neighbors.For a pixel u and its neighbors in N (the 8 connexity neighborhood), four cases have to be considered regarding the values C(u), C(v) and CMS.The four cases considered are the next ones: Hence, the regularization term has two functions.The first one is tailored to smooth the results by favoring neighboring pixels to belong to a same class, i.e., when C(u) = C(v).It is the basic idea of the Potts model (Schindler, 2012) where regularization term is simply defined by: The second role of the regularization term Eregul is to take into account the classification CMS of the VHR MS image and, more specifically, the probability PHR(CMS(u)) associated to the most probable class CMS(u).Thus, if a pixel u and one of its neighbor v are assigned to a same class C and if it also corresponds to the most probable class CMS(u) given by the SVM in the MS image (i.e., if , the regularization term is null.Indeed, in this case, it is the "ideal" configuration where the smoothing criterion is satisfied and classification C(u) matches with the class CMS(u).
, the smoothing criterion is satisfied but the class C(u) does not match with the most probable MS class CMS.In this case, the regularization term is a function of PHR(CMS(u)).If PHR(CMS(u)) is high, Eregul is also high since the likelihood of pixel u to belong to class CMS(u) is strong.
In the third case, C(u) = CMS(u) = C(v), the classification C(u) matches with the class CMS(u) but the smoothing criterion is not satisfied.If PMS(CMS(u)) is close to 1, which means that SVM on MS image is confident for pixel u to belong to class CMS, then Eregul is small and the configuration is favored.Inversely, if PMS(CMS(u)) is close to 0, Eregul will be high (close to 1) and the configuration is more prone to be dismissed.The last case is the case where the class C(u) does not match CMS(u) while the smoothing criterion is not satisfied.It is the worst case, where C(v) = C(u) = CMS(u).In this case, the regularization term is set to its maximum value, i.e., 1.
The parameter β ∈ [0, ∞[ is a tradeoff parameter between the smoothing criterion and the importance of CMS in the model.If β is high, the smoothing criterion is predominant and the model comes close to a Potts model.On the opposite, if β is low, the model will tend to follow the classification given by CMS.A property that will be used further for parameter selection is that when β → ∞, the proposed model becomes a Potts model.

Energy minimization:
The minimization of the energy is performed using a quadratic pseudo-boolean optimization method (QPBO)1 .This is a classical graph-cut method that builds a probabilistic graph where each pixel is a node.The minimization is computed by finding the minimal cut (Kolmogorov and Rother, 2007).QPBO performs regul classification, extension to multi-class problem is performed using an α-expansion routine (Kolmogorov and Zabih, 2004).

Adding contrast information
In this section, the proposed model is extended to a more general model that integrates an important visual property contained in the MS image, i.e., the contrast.The contrast is extracted from the VHR MS image since it is the data source with the highest spatial resolution.Indeed, the contrast in the VHR MS image helps retrieving the precise borders between classes and thus improves the classification details.Another smoothness solution could have been implicit e.g., by integrating larger regions of analysis such as superpixels that can be sharply detected with standard segmentation algorithms of MS images (Achanta et al., 2012).Following (Rother et al., 2004) 2 ) is considered, where Ii(u) is the intensity for pixel u in the MS image IMS for dimension i. X is the mean of X in the image.The function V (u, v) computing the contrast value is then given by: where dim is the number of dimensions of IMS and is a parameter that modifies the standard deviation in the exponential terms.
The general model for classification results fusion is now: where: The parameter γ is a tradeoff parameter between the basic model (led by the MS classification CMS) and the newly integrated contrast-based terms.This model integrates the idea that if the contrast between two neighboring pixels u and v is important, two pixels are less prone to belong to a same class.Hence, for the condition, C(u) = CMS(u) = C(v), if the value V (u, v, ) is high, the regularizing term Eregul will be high and the configuration will more likely be rejected.Inversely, for the last condition C(v) = C(u) =

CMS(u)
where the class assigned to pixels u and v are different, a high contrast will lead to a small regularizing term.If γ = 0 and = 0, the proposed model becomes a Potts model (Schindler, 2012) defined by: This property is used to further address the parameter selection step described in the following section.

Parameter selection
Parameter selection is always an issue when dealing with energies composed of several terms.The general fusion model depends on four parameters, i.e., λ, β, , and γ.Parameter selection is performed by cross-validation where a limited part (half of the data here) of the data is used to select the best parameter set (the highest percentage of correct classification).This parameter set is then used to process the rest of the data.The issue here is that choosing the best parameter set among the possible ones might be very costly in computing time.In order to reduce the computing time of this step, we use the following properties of the general model proposed in Section 2.3: • if γ = 0 and β → +∞, the general model is a Potts model; • if γ = 1 and = 0, the general model is a Potts model.
The hypothesis that the value λmax maximizing the classification result for a Potts model is the same value than the one maximizing the fusion model classification result is formulated here.Hence, λmax is computed using a simple Potts model.Then, with λmax and γ = 0, βmax is found.Similarly, using λmax and γ = 1, the value max is computed.Lastly, the tradeoff parameter γmax maximizing results of the model is chosen in the [0, 1] interval.The process can also then be iterated, optimizing parameters in the same order at each iteration.

EXPERIMENTS
This section provides first a description of the considered datasets and then presents a discussion on obtained results.Toulouse.Toulouse Centre dataset was simulated from HS images acquired at a 1.6 m GSD and composed with 405 spectral bands ranging from 400 to 2500 nm (Adeline et al., 2013).MS images composed of 5 bands (in the visible and near infra-red wavelengths) were then simulated at a 1.6 m GSD using the spectral configuration of Pliades satellites.HS images were downsampled to a 8 m GSD. 13 land cover classes were considered:  water, high vegetation (trees) and low vegetation (bush), asphalt, tiles, bare soil, metal roof 1 and 2, gravel roof, train track, pavement, cement, slates.Figure 4 shows the MS image of Toulouse Centre dataset and the corresponding annoted ground truth.

Results
Results obtained on the three datasets are now presented.Figure 6 shows a table containing the classification results obtained with several variants of the proposed energy.The results shown here were obtained using a cross-validation technique where half of the data was used for parameter selection and the rest is the testing data.Parameter selection for those results followed the method described in Section 2.4.The columns VHR MS and HS correspond to a simple SVM classification applied to the VHR MS (or PAN) and HS data, without any fusion step.These are the input of the basic and general fusion algorithms described in the previous sections.In this figure, the results obtained using a Potts model and the general model proposed in Section 2.3 (FUSION column) are also presented.
The obtained results show the contribution of the fusion of VHR MS and HS images for urban land cover processing.Hence, on the three datasets, classification results were highly increased between SVM classifications on HS/VHR MS data and classification using the proposed fusion model.Moreover, comparison with a simple Potts model shows that the fusion models presented in this article not only perform smoothing as designed but also correctly use the probability maps to retrieve details available from the VHR MS image (while keeping to some extent the good classification properties of the HS data).
One should note the ground truth corresponds to limited parts of the areas.Thus, even if the classification results are better, qualitative assessment through visual evaluation on all images remains necessary.This is illustrated in Figure 5 that shows results of the classification on a restricted area (for visual convenience).The difference on the details of the general shape between the results of the Potts model and the fusion models results is obvious.Indeed, we can distinguish the shape of buildings and roads while it was not possible with the best Potts results.Hence, more than the classification results, the method visually efficiently gets the best of the MS and HS information, providing an efficient classification (from HS data) while retrieving details provided by the highest geometric resolution available (from VHR MS data).proved to be empirically efficient.Indeed, we computed the classification results using much more exhaustive sets of parameters.We found out that the best classification results were very close from the ones obtained using the procedure described in Section 2.4.Hence the considered hypothesis seems to be valid while significantly reducing the computation costs for parameter selection.For example, if the choice is carried out among N = N λ = N β = Nγ = 100 values for the parameter set { , λ, β, γ}, 10 8 classification processes need to be computed.Such a procedure enables to reduce hugely the computation time for the parameter selection step.Parameters , λ, β and γ are selected within grids of sizes N = N λ = N β = Nγ = 100, the number of classifications to be computed falls from 10 8 to 400, which is still quite important

CONCLUSION
In this paper, we presented a method for fusion of VHR multispectral (MS) and hyperspectral (HS) data in urban areas at the decision level.The idea was to combine the best of those two types of data, i.e., the high spatial resolution of MS images and the discriminative properties of the HS images.The proposed fusion model relied on posterior class probabilities, available for most of existing classification techniques.Here, a SVM method was adopted.The probabilities were integrated through an energy minimization process, enabling to improve classification results of a single source and to retrieve details observable in the VHR MS data while keeping the good classification properties obtained with the HS data.The model is generic and intends to be applied to other configuration involving VHR image and richer remote sensing data.
The perspectives are threefold.First, to some extent, the energy should be modified to less rely on posterior class probabilities in order to better manage initial misclassifications or spatial inconsistencies.Secondly, we will extend our model to a larger set of images (instead of only two here) of various spectral dimensions and geometric resolutions.The idea is to be able to process and fusion any information of a given scene to get the best classification at the most precise spatial resolution available.This can be performed using a multi-layer process where each layer is the model described in this paper for fusion of two images.Another final interesting perspective is to process MS/HS images acquired at different epochs.This opens the field for change detection in urban areas, and will require more advanced knowledge fusion model.

ACKNOWLEDGMENTS
This work was supported by the French National Research Agency (ANR) through the HYEP project on Hyperspectral imagery for Environmental urban Planning (ANR-14-CE22-0016).

Figure 1 :
Figure 1: Inputs of our method.Two SVM classifications of one multispectral and one hyperspectral image.The label maps are accompanied with posterior class probabilities.

Figure 3 :
Figure 3: Pavia University dataset.Classes : Meadows -Gravel -Trees -Bare soil -Painted metal sheets -Bitumen -Asphalt -Self blocking bricks -Shadows.3.1 DataThe proposed model was tested on three distinct urban land cover datasets, namely Pavia Centre, Pavia University (Italy) and Toulouse Centre (France).Pavia University and Pavia Centre are well known datasets used for years in the hyperspectral community 2 .Pavia Centre and Pavia University.Pavia Centre and University hyperspectral images have respectively 102 and 103 spectral bands ranging from 430 to 860 nm.Pavia Centre is composed of two images of sizes 228×1096 and 569×1096 pixels.Pavia University is a 335×610 pixels image.Both have a ground sample distance (GSD) of 1.3m.Both image ground truths are composed with different sets of 9 classes.The land cover classes associated to Pavia Centre scene are: trees, asphalt, self-blocking bricks, bitumen, tiles, shadows, meadows, and bare soil.The set of classes for Pavia University is composed of: meadows, gravel, trees, painted metal sheets, bare soil, bitumen, self-blocking bricks, shadows.For Pavia Centre and Pavia University datasets, a panchromatic image (PAN) with the initial geometric resolution (1.3 m) was created.In the following, these PAN images were used instead of MS images (considering PAN is an extreme case of MS image with only one dimension).This more challenging problem was a way to highlight the properties of the proposed model by using a complex configuration where the geometric precise MS data has very low spectral discriminative properties.The considered HS images were a resampled version at a lower resolution of 7.8 m of the initial HS images.Hence, we have a set of images where the MS image (i.e., here, PAN images) has a 1.3 m geometric resolution and HS image has a 7.8 m geometric resolution.Figures2 and 3present the PAN/MS image and the ground truth for both Pavia Centre and University datasets.