Water level prediction from social media images with a multi-task ranking approach

Floods are among the most frequent and catastrophic natural disasters and affect millions of people worldwide. It is important to create accurate flood maps to plan (offline) and conduct (real-time) flood mitigation and flood rescue operations. Arguably, images collected from social media can provide useful information for that task, which would otherwise be unavailable. We introduce a computer vision system that estimates water depth from social media images taken during flooding events, in order to build flood maps in (near) real-time. We propose a multi-task (deep) learning approach, where a model is trained using both a regression and a pairwise ranking loss. Our approach is motivated by the observation that a main bottleneck for image-based flood level estimation is training data: it is diffcult and requires a lot of effort to annotate uncontrolled images with the correct water depth. We demonstrate how to effciently learn a predictor from a small set of annotated water levels and a larger set of weaker annotations that only indicate in which of two images the water level is higher, and are much easier to obtain. Moreover, we provide a new dataset, named DeepFlood, with 8145 annotated ground-level images, and show that the proposed multi-task approach can predict the water level from a single, crowd-sourced image with ~11 cm root mean square error.


Introduction
The frequency of weather-related disasters is increasing rapidly: During the period of 1995-2015, floods have accounted for 47% of all weather related disasters and have affected over 2 billion people [1].The number of floods has also soared up to an average of 171 floods per year between 2005-2014, compared to 127 floods per year during 1995-2004 [1].Moreover, a change in the nature of these events has been observed, with an increase of flash floods, acute riverine and coastal flooding.Additionally, the progressive urbanisation has resulted in large flood run-offs [1].
To mitigate the damage caused by such flood events and for effective disaster response and emergency plans, the rapid analysis of data collected from the affected area is essential [2].There are various sources from where observations can be gathered: stream gauge data [3,4], remote sensing data [5,6] and field data collection.The field data collection approach consists of sending people to the affected areas to survey and document data after the flood event.The information collected can then be used to prepare flood-inundation maps [7].
However, implementing this approach in real-time is expensive, labour intensive and difficult to obtain from flooded areas during, or immediately after, the flood event [8].
Data collected from stream gauges provide accurate, near real-time information of water height for the monitored locations, but gauges are sparsely distributed leading to extremely sparse observations.Stream gauges are not installed systematically along every waterway and much less away from the water streams.Due to these dispersed locations the information provided is often not sufficient to map the flooded area.In addition, stream gauges are rendered useless in cases where the water level rises beyond the limit of gauges themselves or if they are washed away during a flood event [8].
Remotely sensed satellite imagery has been widely used to monitor disaster events like floods [8].However, many satellites have comparatively long revisit cycles, which makes them less useful when information should be gathered in real time, or flood duration is relatively short, like in the case of pluvial/ localised flooding.In general, only the flooded area can be retrieved, whereas water depth cannot be observed directly.Moreover, satellite sensors in the optical wavelengths are affected by the cloud cover, which is inherently frequent during flooding events.Aerial photography is another commonly data source for flood mapping, but it also dependent on weather conditions, and expensive [8].
The unprecedented global spread of low-cost sensors, especially in smartphones, together with the rise of the internet and social media, opens the possibility of community-based mapping initiatives [9].Recognition is increasing for the utility of social media when it comes to capturing real-time information during and immediately after a flood, using "citizens-as-sensors".
In earlier work [10] we have presented a model to predict flood height from images gathered from social media platforms in a fully automated way using a deep learning framework.The proposed model performed object instance segmentation and predicted flood level whenever an instance of some specific object was detected.Although the trained model performs rather well, the effort required to build a large, pixel-accurate annotated dataset for instance segmentation of flood images is considerable.To tackle this problem, we propose in this paper a deep learning approach where we define the flood estimation as a per-image regression problem and combine it with a ranking loss to further reduce the labelling load.We propose to avoid the tedious, and hardly scalable, procedure of pixel-accurate object instance labelling per image by (i) directly regressing one representative water level value per image and, more importantly, (ii) exploiting relative ranking of the water levels in pairs of images, which is much easier to annotate.
Moving from pixel-accurate object delineation as in [10] to annotating only a single water depth per image comes at a price.While the regression task might, in principle, be easier than detailed object detection and segmentation, the supervision signal for a machine learning system is much weaker (e.g., we no longer tell the system to turn its attention to certain types of objects that reoccur with similar metric height).Furthermore, even in the presence of known objects it is often hard for a human operator to determine the water depth in individual images on an absolute scale.On the contrary it is a much simpler task to rank images via pairwise comparisons.People can, with no or little training, quickly decide which of two images shows a higher water level.In this way it becomes feasible to outsource the labelling effort to large groups of untrained annotators, for instance through an online tool.Using ranking as a complementary task can be seen as a variant of weak supervision, or alternatively the ranking information can be interpreted as a regulariser for the otherwise data-limited regression task.
The idea is that a large volume of weaker ranking labels should be able to largely compensate for the small amount of strong water depth labels, and lead to better regression performance.We make three contributions in this paper: (i) We propose a deep learning approach that learns to estimate water level from social media images by combining water level regression with a relative ranking of image pairs.The water level regression part is fully supervised while pairwise image ranking adds a weak supervision signal to improve overall accuracy.The general idea is that the fully supervised signal (i.e., water level regression) from a small, expensive label set is supported by a closely related, weak supervision signal (i.e., pairwise water level ranking), where collecting large amounts of labels is cheap.
(ii) We introduce a new, large-scale dataset DeepFlood with >8000 images.
DeepFlood is comprised of two sub-datasets called DF-OBJ and DF-IMG which we use for our regression and ranking sub-tasks respectively (Sec.4).We make all data available on request via email to one of the authors of this paper.
(iii) We experimentally investigate the trade-off between an object-driven approach with pixel-accurate segmentation labels, versus a regression of the water level with (or without) support from weak pairwise rankings (Sec.5).

Related Work
Using ground-level images for water depth estimation is still a relatively new idea about which only little literature exists.In contrast, the use of social media text has received more attention.Also learning from rank order is a well-known concept in machine learning, but has only recently been adopted to for deep learning.In the following, we review those works that are related closest to ours.

Flood estimation from images and social media
Over the past few years there has been an increased interest to use data from social media for flood detection, mapping and estimation.These data have the advantage of being available in real-time, while at the same time being inexpensive to collect.
Wang et al. [11] propose to use social media and crowd-sourcing data to complement traditional remote sensing data and witness reports.In their study, they use Twitter and the MyCoast crowd-sourcing platform to collect data.
MyCoast is an app that has been used by a number of US environmental agencies, since 2013, to collect "citizen science" data about various coastal hazards and incidents [11].For accurate location mapping they use Stanford's Named Entity Recognition (NER) [12] tool to extract location data within the tweet text.Their approach for extracting flood depth information is based only on regular expression patterns in text, while the images are used only to detect the presence of flooding.
Starkley et al. [9] demonstrate the importance of community-based observations, also known as "citizen science".The observations used in that project were in many cases either photographs or videos.It is shown that communitybased data are valuable for local flash flood events.Quantitative flood metrics are extracted manually.
Fohringer et al. [13] propose a methodology that leverages social media content to support rapid inundation mapping, including inundation extent and water depth.They stress that with this procedure information is readily available especially in densely populated, urban areas.This is important, since alternative information sources like remote sensing do not perform well there.A main limitation is that, also in that system, the social media content is only retrieved automatically, but manually assessed for relevance and visually inspected by experts to derive inundation depth.
Other works, such as Aulov et al. [14], Smith et al. [15], Li et al. [8], also suggest to gather information from social media platforms that is useful for creating flood maps.The method proposed by Aulov et al., in particular, is able to determine regions free from flooding and regions which were flooded, together with a rough estimate of the flood depths, by manually inspecting street photos.
In [15], the authors present a real-time modelling framework to identify areas likely to have been flooded, exclusively information from social media platforms.
They validate their results with data from Twitter during two 2012 flood events.
Li et al. [8] instead use geo-referenced social media texts and combine them with a digital elevation model to generate a flood map.Moreover, The 2019 Multimedia Satellite Task: Emergency Response for Flooding Events [16] is one track of an online challenge offered by MediaEval.Sub-task Multimodal Flood Level Estimation from News asks participants to build a binary classifier that predicts if an image contains at least one person standing with water above the knee.In contrast to our work, participants have access to additional text features from news articles and also satellite imagery [16].[17] has achieved first rank in this task and propose the idea of matching the water level with human pose to determine the level of severity of flooding.
Perhaps the closest work to ours in terms of water-level estimation is [18].
The authors propose to use smartphones and other (fixed) embedded system cameras to estimate water depth via explicit exterior orientation and detection of the water level.The method achieves accuracy levels on the centimeter-level, but heavily relies on a high-accuracy digital surface model of the scene.In our work, we do not demand any additional information other than the flood images, as it is often not available (at least not with the required accuracy).
To the best of our knowledge, our work is the first to propose a fully auto-mated water level depth estimator from only single ground level images.The present paper extends the preliminary [10] to avoid extensive, pixel-accurate data annotation.

Learning-to-rank and weak supervision
Different ways have been developed to learn from relative ordering or ranking information, including pointwise [19] and list-based [20] approaches.The most widely used principle relies on pairwise ranking, i.e., one examines pairs of items and searches for a predictor whose outputs for the members of every such pair.
There are only few works that implement learning-to-rank in the context of state-of-the-art image analysis with deep (convolutional) networks.Doughty et al. [27] use a pairwise deep ranking model to rank the skill of a person performing some task (like drawing or surgery) based on videos.Their approach employs both spatial and temporal streams, in combination with a loss function designed to discriminate between comparable and different skill levels.Further applications of learning-to-rank include image quality assessment [28], age estimation [29], and fine-grained quantification of image similarity [30].
Most related to our work is a recently method for crowd counting in images [31].The authors exploit the idea that for any image window, cropping a sub-window will result in a new window with an equal or smaller number of people.The corresponding pairs yield a self-supervised ranking objective, such that only little direct supervision with person counts is needed.As in our method, the ranking is not a goal in itself, but serves as an auxiliary task that improves the performance of a regressor, e.g., crowd counting in [31].In that view ranking can be interpreted as a form of weak supervision [32] that is cheaper to obtain than the "strong" supervision with ground truth regression targets, and regularises the model such that it generalises better, in spite of a small amount of "strong" training labels.Weak supervision has already proved effective for other computer vision tasks, such as segmentation [33].Our work is the first to employ deep, pairwise ranking as supervision for flood level estimation.

Methodology
We formalise flood level estimation as a regression problem from raw images to a scalar depth value.Consequently, the supervision needed is one depth value per training image, which should be representative for the water depth in the depicted scene.Compared to our earlier work that was based on explicit object (instance) segmentation [10] this avoids laborious pixel-accurate instance labelling.In the experiments, we show that, with the same number of training images, the naive regression approach performs worse than the object-driven one -presumably because of the much lower information content per image of the supervision signal.One solution could of course be to annotate a bigger training set of images with associated (scalar, per-image) flood depth values.This indeed works, but is still hard to scale up, because annotating large amounts of images consistently with an absolute water level is hard.Instead, we explore the possibility to add a weaker supervision signal that is easy to annotate and can readily be crowd-sourced, namely relative pairwise flood depth.I.e., the annotator is presented with two images and has to determine whether the second one has higher or lower level than the first, which is much easier to do than estimating the absolute water depth.
We design a deep learning approach that combines global per-image regression and relative, pairwise ranking.The overall architecture of the proposed method is shown in Fig. 1.The backbone of our network architecture consists of a VGG16 [34] network pre-trained on the ImageNet [35] dataset, but any standard network architecture could be used here.We replace the final layers of the network to predict a single scalar water depth.Because the method does absolute water level estimation per image as well as relative ranking simultaneously, we feed two separate training sets to the model.The first part (Fig. 1  For the images that belong to the ranking training set, the procedure is slightly different.We first prepare a mini-batch of images and feed it to the network to obtain a water level prediction for each of them.For these images we cannot evaluate the regression loss, as we do not have access to the ground truth values.We can, however, assemble all possible image pairs and test whether they obey the ground truth ranking.
We jointly learn the regression sub-task and the ranking sub-task, by defining the total loss function as: with L reg the regression loss, L rank the ranking loss, and λ a weighting param-eter to balance the contributions of the two terms.For L reg , we use the Mean Squared Error (MSE) function: where y represents the network output and y gt is the ground truth value of the flood level.The ranking loss L rank is computed with: where y 1 and y 2 represent the network prediction for the two images in a pair, and y rank gt represents the ground truth ranking for the pair, where +1 means the level in image 1 is higher, and −1 means the level in image 2 is higher.
From Eq. ( 1) it is immediately clear that the ranking loss can be interpreted as a regularisation term that avoids overfitting of the regression objective, if the amount of training data with "strong" regression labels is limited.The weight λ balances the regression and ranking tasks, and must be chosen large enough to afford the regularisation, but not so high that it overpowers the regression loss and harms the prediction.We show the influence of varying λ empirically in Sec. 5.
At test time, the network only receives a single image and pushes it through the regression task to obtain a flood level.It takes approximately three seconds for the model to predict a water level per image.The ranking task is not used for testing.

Datasets
We built a new dataset (DeepFlood 1 ) that, in total, contains 8145 groundlevel images with water level annotations and extends our original dataset of [10].
From that earlier work, there are 1259 images with pixel-level object annotations.Additionally, DeepFlood has 5395 flood images with only a single flood  flood depth.We note, however, that despite the lack of object annotations the regression network may to some degree have a notion of semantic objects, since its backbone is VGG16 pre-trained on ImageNet.
We apply stratified cross-validation for experiments and report average performance numbers together with standard deviations.We divide the DF-Obj   We plot the amount of images per water level for both data subsets DF-Obj and DF-Img in Fig. 2a and 2b respectively.Naturally, the amount of images for higher levels like level9 and level10 is much smaller compared to other levels because people are less likely to acquire images if standing in water deeper than 1.5m.

Experiments
We evaluate our method (Reg+Rank ) on the DeepFlood dataset and compare against [10] (Classification), and two baselines approaches (Regression and Regression++): 1. Regression: A pure regression network without additional supervision with ranked pairs, equivalent to Reg+Rank with only the regression loss, trained on DF-Obj.This regression-only approach with a small training set and no pair regularisation serves as sanity check and lower performance bound.

Regression++: Uses the same network and loss function as Regression,
but is trained on a combination of DF-Obj and DF-Img, using absolute water levels for all training images as supervision.This corresponds to the idea case where strong supervision by regression targets is available for the entire training dataset, and serves as an upper bound for the possible performance of Reg+Rank.

3.
Classification: This is the object-driven approach [10], where water levels are predicted via object detection and segmentation, using pixel-accurate object instance masks as supervision.Here we use use a ResNet101 [37] and Feature Pyramid Network (FPN) [38] as backbone and train on the DF-Obj subset, for which the necessary ground truth masks are available.

4.
Reg+Rank : We evaluate our proposed multi-task ranking approach, which combines ranking loss and regression loss, as described in Sec. 3. We train the regression loss on DF-Obj with the absolute water level labels per image like for Regression.Our ranking loss is trained on the DF-Img data subset but, unlike Regression++, without using absolute water levels per image.Instead, each image inside a pair of images is only labelled to have either have a lower, equal, or higher water level than the other image.
At inference time, only the regression task is used for prediction and performance evaluation, and it is evaluated using the root mean squared error (RMSE).

Implementation details and parameters settings
We implemented all methods with the PyTorch [39] framework.Fig. 3 provides an overview of running our approach as described in the following.
First, the images of the mini-batch generated from the DF-Obj dataset is passed through the VGG16 network.The batch predictions are then used with the ground truth to calculate the mean square error (MSE) for the iteration step.Next, we generate a mini-batch from the DF-Img dataset and pass it through the same VGG16 network.The newly generated predictions are then used for the image pairs generation.We take the predictions of the mini-batch from the last step and generate distinct image pairs.Note that, for efficiency, the images need not be passed through the backbone multiple times.Rather, the pairs can be generated after feature extraction, using the single-stream Siamese network method [40]: the images are first passed through the backbone in a mini-batch, then their resulting feature encodings are combined into an set of pairs (this can be viewed as a special "pair generation layer" without trainable parameters) before calculating the loss [41].Finally, the regression and ranking (margin) losses are combined.
All images were resized to 512 × 512 × 3 pixels before being fed to the model.
All models were trained for 200 epochs using the Adam optimiser [42] with an initial learning rate of 10 −3 .The learning rate was decreased by a factor of 10 at epoch 150 and 180 for all experiments -while Adam in theory adapts the learning rate, it empirically nevertheless makes sense to add a such a gradual schedule.We use mini-batch size 5 for all experiments.For the ranking loss of Reg+Rank, the 5 images per batch are compared exhaustively, resulting in ten pairs per batch (recall, for n items there are n(n − 1)/2 distinct pairings).
We use VGG16 [34] pre-trained on ImageNet [35] as network backbone for Reg+Rank, Regression, and Regression++, as [31] found it to work well in the Pair Generation context of ranking-supported regression.On the contrary, the object-driven classification approach is an extension of Mask R-CNN [43], hence we use a

GT
ResNet-101-FPN backbone, as suggested by the creators of Mask R-CNN.
As usual in the transfer learning setting, we remove the top layer of VGG16 for regression and replace it, in our case with a linear layer that combines all input features into a single water level prediction.
To set the appropriate λ parameter for Reg+Rank (Eq.( 1)), we run experiments in which we vary λ between 1 and 30 and evaluating model performance on the validation set.

Evaluation strategy
Estimating an absolute water level from individual images is hard for humans.We thus pursue the strategy also used in our previous work [10] and look for partially submerged objects of roughly known size as "scale bars".While different objects (vehicles, bikes, etc.) are used, we discretise the depth accord-ing to the human anatomy, which provides a reasonably fine-grained set of body parts/joints that can be identified as submerged or visible.we assume that it would ultimately perform as well as Regression++, or even better, if it had access to pixel-accurate object masks for the full DF-Img data subset, too.However, it would constitute a huge effort to manually label thousands of images at that level of detail.This was a main motivation for the multi-task ranking approach Reg+Rank, and indeed, the seemingly weaker signal via pairwise relative ranking of (thousands of) image pairs does bring a marked improvement.Large-scale collection of fast, inexpensive ranked pairs appears to be a viable alternative, especially considering how much easier it is to crowd-source to untrained workers or volunteers at scale.

Experiments
We further display the distribution of the water level predictions from our multi-task ranking approach on the cross-validation folds where Reg+Rank performs best (fold2, Fig. 6) and worst (fold5, Fig. 7).In general, Reg+Rank tends to overestimate low water levels and underestimate very high water levels.We point out that high water levels are in general underrepresented in the data, as people are less likely to capture and upload images in such extreme circumstances.E.g., for the very high water level level9 we have only a single image in each of the two displayed folds.
We qualitatively illustrate water level predictions of all four tested models for some example test images of different cross-validation folds and water levels in Fig. 8 and Fig. 9.As already indicated in Fig. 6     slowly.1M binary pair labels already bring a substantial improvement over the the regression baseline.We note that, while this may still seem like a high value, many of our automatically generated pair labels are redundant and could be derived from transitivity: whenever in a batch image A has higher level than B and B has higher level than C, the pair A-C need not be labeled.

Conclusion
We have proposed a fully automated method for water level estimation in social media images of flood events.The main idea of our approach is that it is much easier for a human annotator to decide in which of two images the water level is higher, rather than assign an absolute water level to a single image, let alone segment pixel-accurate object instance labels.We implement pairwise ranking as a form of weak supervision that regularises the training of the regressor.The experimental comparison with a lower and upper performance bound for regression and an alternative classification scheme shows that the proposed weakly supervised method (Reg+Rank ) is able to perform almost as well as fully supervised regression with a much larger training set (Regression++).Moreover, Reg+Rank also outperforms Classification [10], although the necessary training data is, arguably, much easier to obtain.Weak supervision via pairwise ranking thus provides a promising alternative to costly and time-consuming, fine-grained labelling.
We hope that our approach can help to overcome the label scarcity problem not only for water level prediction, but for many other regression tasks in the environmental and geo-sciences where collecting a sufficient amount of accurate labels is very laborious, and large datasets as needed for training deep learning are rare.

Figure 1 :
Figure 1: The architecture of our Multi-task learning with ranking loss method.In the figure, MSE loss refers to mean squared error.L total is the total loss function for the model.Lreg refers to the regression loss and L rank is the ranking loss.
.chaudhary@geod.baug.ethz.chdepth label per image.Moreover, we add 1491 images from the Mapillary Vistas dataset[36].These images have similar characteristics and scene content as our flood images.The images from DeepFlood dataset are required for the network to learn how scenes from non-flooded areas look like, as the images in the DeepFlood dataset are from various flood events and there are no nonflooded images.Mapillary has pixel-level instance annotations for 37 classes, we randomly pick images from the Mapillary training set that contain at least one of the objects Person, Car, Bus, Bicycle or Building/House that act as basis for our water-level estimation approach.The criteria for selecting images for DeepFlood and our ground truth annotation strategy remain the same as described in[10] because we must rely on image interpretation to annotate ground truth.We can thus only assign water levels if partially submerged objects with known average height are visible in an image.Object classes were selected based on widespread availability in social media posts of floods and roughly known average heights: person, car, bus, bicycle and house/building[10].We partitionDeepFlood into two separate sub-datasets DF-Obj and DF-Img.DF-Obj contains 1862 images (1259 with flooding from our previous database, 603 Mapillary images without flooding) that all have pixel-accurate object instance annotations and annotations of the flood level per object.DF-Img contains 6283 images (5395 with flooding, 888 without flooding) that are annotated with a single water level per image, which is zero for images without flooding.The DF-Obj subset makes it possible to compare to our earlier, object-driven work [10], for which instance-level segmentations are required during training.A summary of our datasets is given in Tab. 1.Note that DF-Obj does not only have pixel-accurate object instance labels, but also flood level annotations per instance.The bigger DF-Img subset has only a single water level annotation per image.Consequently, the expected supervision signal passed to the model during training is much stronger for DF-Obj.This supervision signal explicitly shows which image regions/objects to attend to, and how their class-specific appearance changes as a function of subset into six parts such that each part contains an equal number of images for each flood level.For each fold we use four parts for training, one for validation and one for testing.To test direct regression methods, we simply divide DF-Img into a training and validation part at a ratio of 80 : 20, and apply the model trained with that split to each of the six test folds of DF-Obj.

Figure 2 :
Figure 2: Number of images per water level (including Mapillary Vistas images at level 0) for (a) the DF-Obj subset and for (b) the DF-Img subset.

Figure 3 :
Figure 3: Overview of the processing steps of the multi-task approach (Reg+Rank ).
Figure 4 shows the RMSE error of models with different λ values.The lowest RMSE is achieved with λ = 5, which we use for all further experiments.

Figure 4 :
Figure 4: Illustration of RMSE error on validation set to select best λ value.

Figure 5 :
Figure 5: Water level annotation strategy for person and bicycle

and Fig. 7 ,
all methods underestimate very high water levels (Fig.8a,b), which is most likely due to lack of sufficient training data.Those methods that have access to all available data of the DeepFlood dataset (Regression++ and Reg+Rank ), do perform better in several cases (Fig.8c).Strong supervision via fine-grained, pixel-accurate object instance annotations (Classification) improves training on a small dataset (DF-Obj subset) compared to weaker, per-image annotation (Regression), as can be seen in Fig.9a.In this image, the true water level is somewhat hard to estimate for an automated method, as people at similar locations are in some cases upright in the water and in other cases seated in a boat.Regression overestimates the water level by a large margin whereas Classification is fairly accurate, presumably because with the help of explicit object labels it could learn how to handle people in boats, a situation of which there are several examples in the dataset.For more standard, rather frequent scenes with moderate flood levels 4 (Fig.9b) and 5 (Fig.9c) all methods work surprisingly well.lev0 lev1 lev2 lev3 lev4 lev5 lev6 lev7 lev8 lev9 lev10Ground truth

Figure 6 :
Figure 6: Prediction plot for fold2 test images

Figure 7 :
Figure 7: Prediction plot for fold5 test images

Fig. 10 shows
Fig.10shows RMSE values (in centimeters) when increasing the maximum number of distinct image pairs from 1 million (1M) to six million (6M).As expected the RMSE decreases with increasing number of image pairs, but only

a
Ground

Figure 8 :
Figure 8: Examples of test images and water level predictions for all four approaches Regression, Regression++, Classification, and Reg+Rank.Predicted water levels per image are written on the right side below each method, ground truth is given at the bottom of each image in white.

Figure 9 :
Figure 9: Examples of test images and water level predictions for all four approaches Regression, Regression++, Classification, and Reg+Rank.Predicted water levels per image are written on the right side below each method, ground truth is given at the bottom of each image in white.

Figure 10 :
Figure 10: Ablation study for number of pairs.

Table 1 :
Overview of the DeepFlood dataset with 8145 images in total, with its two subsets DF-Obj (1862 images with pixel-accurate, object instance labels and per-image labels) and DF-Img (6283 images with per-image labels only).

Table 3 :
Quantitative results of experiments on the DeepFlood dataset.Regression and Classification are evaluated using only the DF-Obj data subset (using per-image labels for Regression, while Regression++ and Reg+Rank are evaluated on both data subsets DF-Obj and DF-Img (i.e., the whole DeepFlood dataset) using per-image labels.We report the average root mean square error (avgRMSE) and its standard deviation (stdDev) for 5-fold cross-validation, in centimeters and level.theDF-Obj data subset.The ≈22% drop in RMSE (cm) is the benefit one gets from additional ranked pair supervision.More interestingly, the multi-task (Reg+Rank ) approach performs almost on par with the upper bound Regres-It was expected that Regression performs worse than Regression++ and Reg+Rank.The comparison to Classification was less clear, but also that method performs worse, and only a little better than the baseline Regression approach.I.e., the richer semantic segmentation labels and associated object knowledge bring a moderate improvement (≈ 6%) over the baseline, but cannot overcome the disadvantage of the smaller DF-Obj training set.As Classification has, theoretically, the strongest supervision signal of all four approaches,

Table 4 :
Ablation study for number of pairs.Average RMSE increases slowly with decreasing number of pairs.Results are in all cases computed with 5-fold cross-validation.The distinct folds have slightly different performances which results in minor stdDev value differences.