MAIA—A machine learning assisted image annotation method for environmental monitoring and exploration

Digital imaging has become one of the most important techniques in environmental monitoring and exploration. In the case of the marine environment, mobile platforms such as autonomous underwater vehicles (AUVs) are now equipped with high-resolution cameras to capture huge collections of images from the seabed. However, the timely evaluation of all these images presents a bottleneck problem as tens of thousands or more images can be collected during a single dive. This makes computational support for marine image analysis essential. Computer-aided analysis of environmental images (and marine images in particular) with machine learning algorithms is promising, but challenging and different to other imaging domains because training data and class labels cannot be collected as efficiently and comprehensively as in other areas. In this paper, we present Machine learning Assisted Image Annotation (MAIA), a new image annotation method for environmental monitoring and exploration that overcomes the obstacle of missing training data. The method uses a combination of autoencoder networks and Mask Region-based Convolutional Neural Network (Mask R-CNN), which allows human observers to annotate large image collections much faster than before. We evaluated the method with three marine image datasets featuring different types of background, imaging equipment and object classes. Using MAIA, we were able to annotate objects of interest with an average recall of 84.1% more than twice as fast as compared to “traditional” annotation methods, which are purely based on software-supported direct visual inspection and manual annotation. The speed gain increases proportionally with the size of a dataset. The MAIA approach represents a substantial improvement on the path to greater efficiency in the annotation of large benthic image collections.


Introduction
With the establishment of deep learning methods, the task of image classification seems to be solved for many computer vision scenarios, sometimes even outperforming human experts on PLOS  the best of our knowledge, we are the first to employ Mask R-CNN in the context of marine environmental monitoring and exploration.
As the supervised Mask R-CNN needs a sufficient amount of training data, we apply unsupervised AEN for novelty detection to efficiently and effectively generate training proposals, which are points of potential OOI in the images that could be considered for training a Mask R-CNN model. Based on the assumption that OOI are rare in the images, "background" pixels of the seabed can be regarded as common patterns and "interesting" pixels of objects are regarded as novel patterns. The concept of AEN was first presented by Baldi and Hornik in 1989 [9] and has been used in various contexts like dimensionality reduction [12,13], human pose recovery [14] or cell nuclei detection [15]. AEN have been previously used for novelty detection as well [16,17] but not in the context of FCN training data collection or environmental imaging.
The contributions of this paper can be summarized as follows: (1) We present a machine learning assisted method for image annotation that allows faster manual image annotation than methods that were used before, (2) we are the first to present the use of Mask R-CNN in the context of marine environmental monitoring and exploration, and (3) we present a detailed analysis of the manual annotation speed with three image collections featuring different types of background, imaging equipment and object classes.
In the following, MAIA is first described at the methodological level, dividing the method into four different stages to be performed one after the other (see Section 2). To assess the performance of the method regarding efficiency (i.e. time required per annotation) and accuracy (i.e. evaluation of statistics on correct and incorrect annotations), we have applied MAIA to three different datasets, referred to as JC77, SO242 and PAP, that are described in Section 3. The results of these applications are summarized in Section 4 and discussed in Section 5 regarding the dataset-specific performances and general observations. The dataset of the results can be accessed at [18]. Visual exploration of the results is possible in BIIGLE 2.0 at https://biigle.de/projects/139 using the login maia@example.com and the password MAIApaper. The manuscript ends with a short conclusion about the relevance of our results and the MAIA method for benthic image annotation.

Methods
MAIA consists of four consecutive stages (see Fig 1), all of which are described in the following sections. In Stage I, unsupervised novelty detection generates an initial set of training proposals T p , which are image patches (i.e. regions of an image) showing patterns different from the background. These training proposals contain potential OOI that could be used in a training dataset for the instance segmentation. In Stage II, T p is manually filtered to keep only those training proposals that actually show OOI relevant to the application context. In addition to that, the training proposals are manually refined with regard to their centroids and size, resulting in the dataset of training samples T s ⊆ T p . In Stage III, the set T s is used to train a Mask R-CNN model for instance segmentation which is subsequently applied to produce a set A c of annotation candidates for a whole image dataset. In Stage IV, these candidates are manually reviewed to remove false positives for the final set A ⊆ A c of detected objects and their bounding boxes.

Stage I: Novelty detection with AEN
A dataset of images {I i } from an AUV or ROV dive can feature different kinds of background patterns due to changes in sediment type, illumination or geological characteristics of the seabed (e.g. see SO242 in Fig 2). Thus, the images {I i } from one dataset are first grouped into different clusters {U k } k = 1, . . ., K , each one representing images with a similar background. To this end, each I i is mapped to a feature vector v i which is used for k-means clustering of all images. The features were determined empirically as a combination of principal component projection coordinates and an entropy measure. Details are given in the supporting information (see S1 Text). Each resulting cluster U k contains images featuring global similarities, mostly dependent on the background sediment. For each cluster, one AEN is trained, all sharing the same architecture.  Each of the K AEN consists of three fully connected layers. From the first to the second layer (the latent layer), an input image patch x 2 R r is encoded to a lower dimensional latent representation l 2 R s with s < r, where W is a matrix of weights, b is a vector of biases and σ(�) is the softplus transfer function (see Eqs 1 and 3). If the RGB input image patch has a size of r e × r e pixels, the input layer dimension is r ¼ 3 � r 2 e . From the second to the third layer, the latent representation is decoded to the output patch x 0 2 R r , where W 0 is a matrix of weights and b 0 is a vector of biases (see Eq 2).
The AEN is then trained to generate x 0 as a reconstruction of x by minimizing the reconstruction error F through backpropagation (see Eq 4).
As motivated above, we consider image patches that show the pure seabed or background as common patterns and patches that show an interesting object (like a shell or a starfish) as novelties. This definition is based on the assumption that interesting objects are rare in deepsea image datasets. Those rare objects would appear as novelties and patches featuring such a pattern also feature a large reconstruction error F . By applying a threshold to the reconstruction error, background pixels can be separated from pixels that belong to an OOI.
To train the AEN for one cluster of images U k , we sample 10 4 image patches, each of the form x ¼ ðp ðrÞ 1 ; p ðgÞ 1 ; p ðbÞ 1 ; . . . ; p ðbÞ r=3 Þ and of size r = 39 × 39 × 3 (i.e. a 39 pixel square of RGB values) randomly from U k (see Fig 1b). The RGB values of each pixel p j are "flattened" to the rdimensional input x. The patch size was determined with a parameter search for good detection performance (see Section 4). We choose the dimension of the latent representation of the AEN as s = [0.1r], with [�] as rounding operation. The "compression factor" of 0.1, again, was determined with a parameter search (see Section 4). To accelerate convergence during training, we use Xavier initialization [19] of the weights W of the encoding layer and Adam optimization [20] with a fixed learning rate of 10 −3 . We train for 100 epochs with a minibatch size of 128. Training takes less than a minute per image cluster for all datasets, using TensorFlow [21] on a single NVIDIA Titan X. We refer to the trained AEN of an image cluster U k with the term data-driven background model DBM k .
Each of the K DBM k is applied to the images I i 2 U k , processing each possible image patch of r e × r e pixels (see Fig 3). This can be compared to a convolution operation. We choose a stride of two for the convolution, since it had no impact on detection performance but a higher computational efficiency when compared to a stride of one. The higher stride results in an output of lower resolution than the original image, so the output is upscaled back to the original resolution of the image I i using bilinear interpolation. Owing to GPU memory constraints, a DBM k is consecutively applied to horizontal image slices that are stitched together afterward. The reconstruction error F ðx; x 0 Þ of each image patch x is stored as the "novelty score" at the pixel coordinate of the center of the patch, resulting in the "novelty map" N i for each image I i . The novelty map is convolved with an all-ones kernel of the same size as the training patches, which acts as a dilation to smooth the boundaries of regions with a high novelty score (see Fig 1c).
To get a binary segmentation between background and interesting regions, a threshold has to be applied to the novelty map. We empirically determined the threshold t k to be the mean of the 99-percentiles of the novelty maps of an image cluster U k (see Eq 5), where P 99 (N i ) is the operation that determines the 99-percentile of a novelty map N i . Examples of thresholded novelty maps can be found in Fig 4.
The binary segmentation of all novelty maps N i is used to extract the set of training proposals T p from all images. Each training proposal y j 2 T p is a part of an image, cropped to the minimum square bounding box of an interesting region (see Fig 1d). We set the minimum edge length of the bounding box to 30 pixels to ensure a certain size of the training proposals, even if the interesting region is smaller. We define the cumulated novelty scores of the pixels of an interesting region as the novelty score η(y j ) of the corresponding training proposal.

Stage II: Manual filtering and refinement of training proposals
A considerable amount of the training proposals in T p may show patterns that do not represent anything of interest for the domain experts. Thus, we apply a quick manual filtering step, where a human observer selects only those training proposals that contain OOI. The filtering has been implemented as single patch classification as defined in the RecoMIA guidelines [3], so only the isolated image regions of the training proposals are displayed to the human observer instead of the complete images. This saves the time the human observer would need to screen the complete images for multiple regions of interest. For each training proposal, the human observer has to determine if it contains (part of) an OOI or not. To accelerate manual filtering, the label review grid overview tool Largo of BIIGLE 2.0 is used (see S1 Fig in the supporting information). Training proposals y j are displayed in a regular  grid in descending order of their novelty scores η(y j ). This allows human observers to spot OOI very quickly and to review a large number of training proposals in a very short time (see Fig 1e). The sorting by novelty score is a similar technique than the saliency ranking described by [7]. Starting from the training proposals with the highest novelty score, a human observer can stop reviewing once a sufficient amount of training proposals have been selected. In this work we define 600 as the limit for the required number of selected training proposals for each object class.
The performance of an FCN-based instance segmentation method like Mask R-CNN is crucially dependent on the quality of the training dataset. If the samples in the training dataset are of low quality, i.e. with many discrepancies between interesting and non-interesting image regions, the performance of instance segmentation may be very poor. To obtain a set T s with training samples of appropriate quality from the training proposals T p , a manual refinement step is performed after the filtering. The filtered training proposals are shown to a human observer, each with a suggested centroid and size (i.e. a circle) that marks the OOI. The observer can modify the circle position or size so it closely fits the position and size of the OOI (see Fig 1f). To further accelerate the refinement, we use the volume label review tool Volare of BIIGLE 2.0 (see S2 Fig in the supporting information). With Volare, the viewport of the annotation tool jumps directly from one circle to the next, saving the time a human observer would need to look for and zoom in to each circle on an image.

Stage III: Instance segmentation with Mask R-CNN
The filtered and refined training proposals are used to build a dataset of training samples T s for Mask R-CNN (see Fig 1g). This training dataset differs in fundamental aspects from datasets that are typically generated to train FCNs for instance segmentation. In the ideal case, all OOI are marked in all images of the training dataset. In our case, some OOI may have been missed in the novelty detection stage, so these would be falsely labeled as background. To reduce the probability of encountering such falsely labeled OOI during training, we crop a 500 × 500 pixel region around each filtered and refined training proposal y j and use these crops instead of the whole images as training samples χ j 2 T s . Another difference are the circle shapes that are used for fast manual refinement. In a typical training dataset, the annotated segmentation of an OOI is not limited to a particular shape but aims to segment the object as accurately as possible. Here, we define all pixels inside the circle to belong to the OOI, which may include some pixels that actually do not belong to the OOI.
Due to the fixed number of images that are considered to generate a training dataset, there may be only a few hundred or less samples of a particular class of OOI. For comparison, in datasets like MS COCO [6] there are many thousands of instances for each object class. To increase the number of object instances that are available for training, we boost the training dataset (see Fig 1h). Details on the boosting we apply can be found in the supporting information (see S2 Text).
We utilize a freely available TensorFlow implementation of Mask R-CNN [22] in the archived version of [23]. This implementation differs in a few aspects from the original paper [11]. Input images are resized to support training in minibatches, training bounding boxes are generated on the fly and the learning rate is reduced to 10 −3 (see [23] for details). In MAIA, the training is performed with the set of boosted training samples (see Fig 1i) and the default configuration of the Mask R-CNN implementation. The model is initialized with weights from pre-training with MS COCO [6]. First, all layers except the ResNet 101 backbone [24] are trained for 20 epochs with a learning rate of 10 −3 , a batch size of 2 and 10 3 batches per epoch. Then all layers including the backbone are trained for another 10 epochs with a reduced learning rate of 10 −4 but the same batch size and number of batches per epoch. With this configuration, training takes about eleven hours per dataset on a single NVIDIA Titan X.
Inference is performed by padding each image with zeros so that each dimension is divisible by 64. This is done to guarantee smooth scaling in the six levels of the Feature Pyramid Network [25] and still process the image in its original size. Based on the segmentation masks produced by Mask R-CNN, we define any pixel not belonging to the background class as "interesting". We take the minimum enclosing circle for each region of connected interesting pixels to get the set A c of annotation candidates (see Fig 1j).

Stage IV: Annotation candidate review
To eliminate the false positive detections, which mark regions of the images that show no OOI, the annotation candidates A c are manually reviewed in the last step. This is done analogously to the manual filtering of training proposals in Stage II. Each annotation candidate is shown as an image patch in a regular grid. A human observer then selects all candidates that are true positives, i.e. those that mark an OOI. This yields the final set A of annotations.

Datasets
We evaluated MAIA on three marine image datasets that were collected in different research projects. From each dataset Γ 2 {JC77, PAP, SO242}, we extracted 500 random images as training subset T Γ . For each subset, MAIA was performed to train a Mask R-CNN model. The detection performance of the model was evaluated on another 50 random images as validation subset V Γ for each dataset. The images of the validation subset have been fully annotated using "traditional" methods.

JC77 dataset
The images of the first dataset were captured in the Central North Sea near the Sleipner CO2 storage site [26]. It comprises 6321 images of size 2448 × 2048 pixels (see Fig 5a). The images were captured at a depth of 77 m with a target distance to the seabed of 3.0 m. Traditional expert annotation took 166 min for a subset of 125 images, which are 79.68 s � image −1 . The two different object classes "shell" (see

PAP dataset
The second dataset contains images from the Porcupine Abyssal Plain (PAP), located southwest of the UK in international waters [27]. The dataset is composed of 3708 annotated images, each of which is a mosaic of ten single images. The images were captured in 4600-4900 m depth with a target distance to the seabed of 3.2 m. 57 different object classes were annotated by experts (for examples see Fig 2). Traditional expert annotation took 480 min for a subset of 243 images, which are 118.52 s � image −1 . To eliminate the black background and mosaicing artifacts (e.g. the "sawtooth" boundary between images and background [27]), we extracted tiles of 1000 × 1000 pixels that contain no black background from the mosaics (see Fig 5b). The result is a dataset of 36238 tiles (9.77 tiles � image −1 ). The original mosaics contain a total of 33358 annotations. Owing to overlaps during tile extraction, the final dataset contains 41033 annotations. The 50 images of the validation subset V PAP were chosen so that no overlapping tiles between training and validation subsets exist. The images of V PAP contain a total of 57 annotations.

SO242 dataset
The third dataset consists of 809 images with a size of 4096 × 3072 pixels that were extracted from the SO242/1_83-1_AUV10 survey [28] (see Fig 5c). The images of the survey were captured at depths between 3420 m and 4140 m with a target distance to the seabed of 7.5 m. Traditional expert annotation took 87 min for a subset of 75 images, which are 69.6 s � image −1 . The 50 images of the validation subset V SO242 contain a total of 350 annotations. Only a single "interesting" object class was used to annotate all megafauna in the images of V SO242 .

Results
Here we present the results of the evaluation of the MAIA stages I-IV. First, we show the detection performance of the unsupervised novelty detection using different AEN parameters. Next, we present the timings of the manual filtering and training proposal refinement steps. Third, we show the detection performance of the trained Mask R-CNN model on the validation subset V Γ of each dataset. Finally, we determine the average time it took to annotate an image using MAIA compared to a "traditional" annotation method which employs a sophisticated annotation tool like BIIGLE 2.0 but is purely manual.

Stage I: Novelty detection with AEN
The AEN, as well as Mask R-CNN, produces a pixel-wise segmentation between background and interesting regions for each image. We evaluated the segmentation with recall = TP θ (TP θ + FN θ ) −1 , (TP θ : number of OOI contained in interesting regions, FN θ : number of OOI not contained in an interesting region) and precision = TP ρ (TP ρ + FP ρ ) −1 , (TP ρ : number of interesting regions containing an OOI, FP ρ : number of interesting regions not containing an OOI). From these values we determined the F 2 -score [29] for the detection performance (see Eq 6). We chose the F 2 -score to put a higher weight on the recall as this was considered reasonable in the research context at hand.
A parameter search was conducted for the number of clusters K 2 {1, 5, 10, 50, 100}, input patch dimension r e 2 {29, 39} and latent layer compression factor s c = s/r 2 {0.1, 0.2} of the AEN. For each parameter triplet, we determined the interesting regions of the novelty maps {N i } and calculated the F 2 -score for the validation subset V Γ of each image dataset. The F 2scores are visualized in Fig 6 for all three datasets. Detailed values can be found in the supporting information (see S3 Text). On average, the scores of the parameter triplets with r e = 39 are higher than those with r e = 29 and a cluster number of K = 5 is a good selection for all datasets. The scores of the parameter triplets with a larger s c do not differ much from those with a smaller s c and the same r e but a smaller s c has a lower computational cost.

Stage II: Manual filtering and refinement of training proposals
Novelty detection on the 500 images of the JC77 training subset T JC77 produced a total of 18826 training proposals. These were manually filtered in 61.02 min and refined in 60.17 min which resulted in 440 training samples for the "animal" class and 600 training samples for the "shell" class. Total manual processing of the images of T JC77 took 14.54 s � image −1 on average.
In the case of the 500 tiles of T PAP , 2399 training proposals were produced by novelty detection. These were manually filtered in 23

Stage III: Instance segmentation with Mask R-CNN
The binary segmentation between background and interesting pixels produced by Mask R-CNN was evaluated in the same way as the novelty detection in Stage I. For each validation subset V Γ , precision, recall and F 2 -scores were calculated. The results are shown in Table 1. The values for V JC77 are highest, followed by those for V SO242 . The precision for V SO242 is almost only half of the precision for V JC77 . The values for V PAP are lowest with a precision less than half of the precision for V JC77 . Overall, the instance segmentation achieved an average recall of 84.1% and an average precision of 30.3%. Example annotation candidates for each dataset are shown in Fig 7.

Stage IV: Annotation candidate review
The trained Mask R-CNN model produced 308 to 1094 annotation candidates A G c for the 50 images of each validation subset V Γ (see Table 2). These were reviewed in 2.35 to 16.   from 2.82 s � tile −1 to 27.55 s � image −1 . The average review time and the 3.42 h it takes to manually filter and refine training proposals of 500 training images allowed us to build a function τ MAIA (n) for the time it takes to annotate n images using MAIA (see Eq 7). Likewise, we were able to build a function τ trad (n) for the time it takes to annotate n images using a traditional method (see Eq 8). This function is based on the average annotation time of 89.27 s � image −1 that was measured for the three image datasets.
A plot of the two functions shows that image annotation takes less time with MAIA than with a traditional method when the dataset contains more than 200 images (see Fig 8). The speed-up increases with increasing dataset size.

Discussion
The results are discussed in the same order as MAIA is described, starting with the AEN-based novelty detection. The results of the AEN parameter search indicate that clustering the images by similar background indeed improves the novelty detection performance. Novelty detection with K = 5 clusters yields on average considerably higher F 2 -scores than with K = 1. However, a higher number of clusters does not further increase the scores. A larger latent layer size, which should enable the AEN to learn a higher number or more accurate image patch properties has no notable effect on the novelty detection performance. A compression factor of s c = 0.1 seems to be sufficient to learn the visual properties of the seabed which can be briefly described by a dominant color and varying brightness (see Fig 2). The larger input patch size of 39 × 39 pixels resulted in superior F 2 -scores compared to the results of the smaller patch size. While it is possible that still larger input patch sizes may further increase the scores, we judged a more exhaustive parameter search to be out of scope of this evaluation. We based this decision on our observations in other scenarios where larger window sizes decrease the performance of machine learning based classification [30]. The novelty detection performance with the presented parameter triplet already resulted in a sufficient number and quality of training proposals to be useful in the annotation method.
Manual filtering and refinement proved to be efficient with the implementation as single patch classification and the available tools of BIIGLE 2.0. Although many thousands of training proposals may sound like an overwhelming number, they can be processed much faster with an average of 24.65 s � image −1 than traditional image annotation with 89.27 s � image −1 . Contributing to this speed-up is the ordering of training proposals by novelty scores. Because of the ordering, manual processing of T SO242 p was fastest, although its total number of training proposals was highest among all three training subsets T Γ .
The size of the set of training samples T PAP s is small when compared to the number of training samples collected for the other datasets. While a recall of over 77% is still obtained in the final instance segmentation, a higher number of training samples would probably yield better results. This would require more than the 500 tiles of T PAP to generate a higher number of training proposals and by this more training samples. Hence, the number of images in a training subset T Γ should not be fixed in a future application of MAIA. It should rather be extensible on the fly with new images being added if not enough training samples have been collected.
The Mask R-CNN instance segmentation achieved an average recall of over 84% for the three evaluated datasets. Even with the PAP dataset, where only jT PAP s j ¼ 200 training samples were collected, a recall of more than 77% was reached. This is a surprising result for a deep learning model like Mask R-CNN, especially considering the high variability of 57 object classes that are present in the dataset. The fact that a recall of over 77% is still achieved could be a result of the boosting applied to the training samples. The average precision of 30% may seem rather low when compared to other scenarios. In this case, however, a high number of false positive detections has a low cost regarding the review time of the annotation candidates A c . False positives can be dismissed quickly with the Largo tool.
In contrast to image classification or information retrieval in other domains (such as medicine) there is no clear definition of thresholds for a precision or recall to be considered acceptable in marine imaging. However, we can give a heuristic recommendation based on experience. In our experiments, the results were inspected with the Largo tool in a grid overview of 50-100 image patches at the same time. There we considered it helpful if the display of such a set of image patches featured at least five true positives, just to keep and update the visual model(s) for the OOI. Thus, we would argue that 10% is the minimum value acceptable for precision, 10-50% may be considered good and more than 50% may be classified as very good. The recall values should be discussed more conservative as missed OOI can not be compensated as easily as false positive detections. However, it depends on the actual research context of the data collection how to define thresholds there.
The per-instance class labels and bounding boxes that are also produced by Mask R-CNN are ignored which reduces the output to a mere object detection of a single "interesting" class. While this is sufficient for the image annotation method here, one might argue that Mask R-CNN is not the right tool for this task. Hence, we also evaluated a minimized variant of Mask R-CNN that is "cut off" after the region proposal network, omitting the subsequent classification, bounding box refinement and segmentation mask branches. Details can be found in the supporting information (see S4 Text). Although we suspect that the low number of training samples is not sufficient for accurate classification, an evaluation of the class label output of Mask R-CNN remains a topic for future research.
The evaluation of the annotation candidate review points out different characteristics of the image datasets. The images of SO242 have the highest resolution and, with a target distance to the seabed of 7.5 m, the largest footprint of visible objects. Hence, the same number of images can contain a larger number of potential OOI than images with a smaller resolution taken closer to the ground. Combined with the rather low precision of Mask R-CNN for SO242, this contributes to the much higher number of annotation candidates in A SO242 c when compared to the other two datasets. In addition to that, the review time in s � candidate −1 for SO242 is highest among the three datasets. A possible reason for this, again, might be the larger distance to the seabed since smaller objects are harder to recognize.
The review time in s � candidate −1 for A PAP c is much lower than for A JC77 c although the images of both datasets were taken at a comparable distance to the seabed. This can be explained by differences in object classes between the datasets. Instances of the "shell" class of JC77 (see Fig 2 first column) can sometimes be hardly distinguished from rocks or shell fragments which do not belong to this class. Furthermore, some instances of the "animal" class (see Fig 2 second column) are rather well camouflaged. This is not the case for the majority of OOI in PAP.
The heterogeneity in different properties of the three datasets provides a solid basis to evaluate MAIA. Even the highest review time in s � image −1 stays well below the time that was determined for a traditional annotation method. The annotation time functions τ MAIA and τ trad show that the constant time it takes to prepare a training dataset for Mask R-CNN is quickly made up for by the much faster process of annotation candidate review when compared to manual scanning of whole images for OOI. For datasets containing 200 or more images, MAIA starts to be faster than a traditional annotation method. In case of the datasets with 550 images that were used in this evaluation MAIA is 2.19 times faster.
An interesting alternative to a visual evaluation of the Mask R-CNN results may be to classify those with another convolutional neural network. Recently, we were able to show that even the problem of strong class imbalance can be solved given the right training and data preparation methods [30].

Conclusion
We presented MAIA, a novel machine learning assisted method for image annotation in environmental monitoring and exploration. MAIA requires a reduced amount of manual interactions when compared to traditional annotation methods. We have used BIIGLE 2.0 in this work as the interactions required by MAIA can be performed very efficiently with this system. For datasets with more than 200 images, MAIA offers a faster annotation speed with an average recall of 84.1% when compared to traditional methods. The speed-up increases with the size of a dataset and already reaches a factor of 2.19 with datasets of 550 images that were used in our evaluation. Based on these results, we conclude that MAIA is a promising method for image annotation in all environmental monitoring and exploration scenarios with large image collections.
Supporting information S1 Text. Image clustering. A description of the image clustering that is done prior to the training of the the AEN.