Orthophoto-Based Vegetation Patch Analyses—A New Approach to Assess Segmentation Quality

: The following paper focuses on evaluating the quality of image prediction in the context of searching for plants of a single species, using the example of Heracleum sosnowskyi Manden, in a given area. This process involves a simplified classification that ends with a segmentation step. Because of the particular characteristics of environmental data, such as large areas of plant occurrence, significant partitioning of the population, or characteristics of a single individual, the use of standard statistical measures such as Accuracy, the Jaccard Index, or Dice Coefficient does not produce reliable results, as shown later in this study. This issue demonstrates the need for a new method for assessing the betted prediction quality adapted to the unique characteristics of vegetation patch detection. The main aim of this study is to provide such a metric and demonstrate its usefulness in the cases discussed. Our proposed metric introduces two new coefficients, M + and M − , which, respectively, reward true positive regions and penalise false positive regions, thus providing a more nuanced assessment of segmentation quality. The effectiveness of this metric has been demonstrated in different scenarios focusing on variations in spatial distribution and fragmentation of theoretical vegetation patches, comparing the proposed new method with traditional metrics. The results indicate that our metric offers a more flexible and accurate assessment of segmentation quality, especially in cases involving complex environmental data. This study aims to demonstrate the usefulness and applicability of the metric in real-world vegetation patch detection tasks.


Introduction
Semantic segmentation, i.e., the separation of semantically coherent areas (segments) in an image, is used in many fields such as, for example, machine vision in the broad sense [1][2][3], medical diagnostics [4,5] or remote sensing [6,7].The popularity of this technique is related, among other things, to the development of increasingly efficient and effective deep learning methods, which allow segmentation methods to be easily adapted to specific classes of problems.
In the case of remote sensing, segmentation is often the first step in the image analyses performed, concerning tasks as diverse as land use classification [8], object detection [9,10], change monitoring [11], discovery and inventory of historical heritage [12], or support for precision agriculture [13].In the following study, we focus on the application of deep learning and semantic segmentation to the inventory and detection of stands of a selected plant species.Depending on the scale of the analysed images, this problem can come down to the detection of individual plants [14], their local clusters [15], or their global range [16].The choice of scale will depend on the objective we want to achieve.In our work, we use images taken with UAVs-in this particular case, the resolution of the images obtained allows the identification of local clusters but does not allow the separation of a single plant.Such a scale allows a simple inventory to be made.The main advantage of such a solution is that the inventory can be performed much faster than from the ground with human intervention and that it is possible to reach places that are difficult or not at all accessible from the ground [17,18].
Selecting an appropriate deep learning algorithm and training a model for a specific task, e.g., searching for a specific plant species on an orthophoto, usually requires many experiments.By modifying the structure and parameters of the learning network or the characteristics of the input data (e.g., number and type of image channels, ground sample distance, etc.), we look for a model that will allow the most accurate prediction of the area of plant occurrence.To obtain such a model, we need to compare the quality of the data obtained, eliminating those with insufficient accuracy and continuing to experiment with solutions with higher accuracy, which leads to the problem of assessing the quality of the segmentation performed.This problem is not exclusive to remote sensing data but is generally related to segmentation itself.However, solutions are often specific to a particular type of data.
Many studies attempt to structure segmentation evaluation methods.A hierarchical classification of evaluation methods was proposed by Zhang [19], among others, and repeated with minor changes by Chen [20]: firstly, methods were divided into subjective, i.e., based on human evaluation, and objective methods.Within the objective methods, indirect methods, among others, were distinguished, which do not directly evaluate the algorithm itself, but focus on evaluating the results.The direct methods can be further divided into two categories: supervised methods, where we use a ground truth mask for evaluation, and unsupervised methods.Unsupervised methods are systematically developed [21], but by far the more common use is of supervised methods.They dominate numerous overviews of general evaluation methods [22,23] and are among methods dedicated to specific applications, such as medical [24] or remote sensing [20,25,26].Wanga et al. [22] detail the above division, distinguishing pixel-based, region-based, and distance-based methods among supervised methods.In real-world applications, pixel-based supervised evaluation methods dominate.This trend is mainly related to the simplicity and speed of these methods.In the case of remote sensing image segmentation, the most commonly used basic measures of this type are Accuracy, Recall, Precision, and, interchangeably, the Jaccard Index or Dice Coefficient [6,27].Among the methods based on Accuracy, there are Cohen's Kappa and Kendall's Tau coefficients [28,29].All of these metrics, in the form of easily comparable numerical values, determine how accurately the prediction resulting from the use of a given model corresponds to the pattern (ground truth mask).This is no different in the special case of remote sensing, where the objects being segmented are plants [30][31][32][33][34][35].
The purpose of this article is to propose a new method for evaluating the quality of segmentation that does not focus on efficient pixel-to-pixel matching, but approaches detection more flexibly due to the specifics of, for example, vegetation.From the point of view of field analysis, it is of greatest importance to identify specific plant sites (regions) in a given area.Even an "incomplete" indication of a single region will be useful, which will allow sending a drone or a person to such a place again for more accurate field verification.By an incomplete indication, we mean one where the entire area of a coherent region in the prediction hit only a certain part of it (e.g., 10%).Therefore, it would be ideal to highly reward even small area true positive hits and increasingly lower rewards for an increase in area hits.The analogous effect for false positive hits would be to penalize only for large (area) non-hits.As a result, we would like to obtain an amplification of the coefficient with an increase in the number of positively localised regions and, at the same time, reduce the impact of region-matching accuracy at the individual pixel level.
Our proposed metric introduces two new coefficients implementing the above assumptions, M + and M − , which, respectively, reward true positive regions and penalize false positive regions, providing a more nuanced assessment of segmentation quality.The effectiveness of this metric is demonstrated further with detailed scenarios focusing on changes in the spatial distribution and fragmentation of theoretical vegetation patches, comparing the proposed new method with traditional metrics.The results indicate that our metric offers a more flexible and accurate assessment of segmentation quality, especially in cases involving complex environmental data with a particular focus on vegetation analysis.In this paper, we initially used the example of the analysis of the invasive species Heracleum sosnowskyi (Manden.).

Materials and Methods
In the following study, we used the results obtained from our work to detect stands of Heracleum sosnowskyi (Manden.) in drone flight images.Invasive alien species (IAS) are animals and plants that are accidentally or intentionally introduced outside their natural habitat [15,36,37].Their presence often has serious consequences for native ecosystems, including plant and animal species, leading to a loss of biodiversity and negative economic consequences.One of the most significant threats is the rapidly spreading population of invasive hogweed species (Heracleum spp.).Currently, there are four species of the genus Heracleum in Poland: H. mantegazzianum, H. sosnowskyi, H. sphondylium, and H. sphondylium subsp.sibiricum.Two of them, H. mantegazzianum Sommier and Levier, 1895, and H. sosnowskyi Manden, 1944, are of Caucasian origin and are classified as invasive alien species according to the EP Regulation 2014.Originally brought to Europe, primarily to Central European nations for decorative and practical purposes (as animal feed), they have spread quickly and adversely impacted biodiversity by changing the features of the environment.Furthermore, the sap of these plants, which contains phytotoxic substances, poses a risk to both animals and humans [38,39].Currently, their control requires a comprehensive approach for inventorying their occurrence and assessing their potential for control based on the size of the communities they form.Surveys were conducted at the inventory sites of the plant in areas where it has been noted to occur in large numbers.The possibility of an aerial inventory is important because Heracleum sosnowskyi Manden often occurs in wetlands and areas that are difficult to access.The identification of accurate sites using machine learning prediction will facilitate further work, including the following:

•
Conducting additional drone flights with different parameters, for example, from a lower altitude (to obtain higher-resolution images) or using a different camera (e.g., multispectral instead of RGB);

•
Monitoring changes in the extent of the plant over time (e.g., in connection with an attempt to eradicate it);

•
Sending people to these locations to verify the presence of the plant on the ground and possibly undertake its removal.
A DJI Matrice 210 RTK v2 drone fitted with a MicaSense RedEdge-MX multispectral camera was used to capture the pictures.The drone's position during flight was determined by GPS and RTK services.Additionally, GPS and RTK services were used to locate ground control points (GPCs).The images did not include the RTK position but only the position from the GPS module built into the camera.Data from a dedicated MicaSense DLS 2 incident light sensor were used to normalize the results obtained from the camera.Version 1.6.2 of Agisoft Metashape software was used to assemble orthophotos.
In the baseline scenario represented in this study, prediction results were compared for: • Three-channel orthophoto (red, green, blue) • Five-channel orthophoto (red, green, blue, red edge, near-infrared).
The images for the orthophotomap were acquired from an area of 63.3 ha, located near the small lake Jan (Poland, West Pomeranian Voivodeship, 53 • 5 ′ 24 ′′ N, 15 • 19 ′ 24 ′′ E).The flying range included lakes, meadows, arable fields, scrubs, and forests.The locations of the study sites are shown in Figure 1.The flying range included lakes, meadows, arable fields, scrubs, and forests.The locations of the study sites are shown in Figure 1.

Standard Segmentation Quality Metrics Used
We will present the basic measures of supervised pixel-based evaluation using the standard notations [40].Let  denotes the ground truth mask,  denote the prediction mask, and || and || be their respective areas.We used standard terms such as TP (true positive), FP (false positive), TN (true negative), and FN (false negative), for which the following relationships hold (Figure 2

Standard Segmentation Quality Metrics Used
We will present the basic measures of supervised pixel-based evaluation using the standard notations [40].Let T denotes the ground truth mask, P denote the prediction mask, and ||T|| and ||P|| be their respective areas.We used standard terms such as TP (true positive), FP (false positive), TN (true negative), and FN (false negative), for which the following relationships hold (Figure 2 The flying range included lakes, meadows, arable fields, scrubs, and forests.The locations of the study sites are shown in Figure 1.

Standard Segmentation Quality Metrics Used
We will present the basic measures of supervised pixel-based evaluation using the standard notations [40].Let  denotes the ground truth mask,  denote the prediction mask, and || and || be their respective areas.We used standard terms such as TP (true positive), FP (false positive), TN (true negative), and FN (false negative), for which the following relationships hold (Figure 2   The primary metrics used to assess classification quality include Accuracy (or, more precisely, Overall Accuracy), Recall (Sensitivity), and Precision [40,41].This way we obtain information based on the number of pixels, ignoring the existence and distribution of the connected components of the ground truth mask.In our case, these metrics are only helpful for an initial evaluation of the quality of the classification.Hereafter, we refer to connected components (i.e., the smallest area of indivisible plant sites) as regions.
To evaluate the quality of the prediction more precisely, we used three metrics commonly used to evaluate the quality of image segmentation (e.g., in aerial or medical image analysis tasks [28,[42][43][44][45]).Let us recall these definitions: 1.
IoU (intersection over union (IoU): other terms used in the literature include the Jaccard index and the Jaccard similarity coefficient.
The values of the coefficient are in the range [0, 1], with 0 indicating no fit and 1 indicating full fit.
The values of the coefficient are in the range [0, 1], with 0 indicating no fit and 1 indicating full fit.
Dice's coefficient can be calculated using the Recall and Precision metrics mentioned previously. 2 Tau coefficient, other names: Kendall rank correlation coefficient and Kendall's Tau coefficient.In this study, a version of the Tau-b coefficient was used.The Tau coefficient [28] was proposed as a simpler and more efficient version of the Cohhen's Kappa coefficient [44], which is again used as another measure of accuracy.Let C denote the number of concordant pairs, D denotes the number of discordant pairs, T t is the number of ties in T, and T p is the number of ties in P. If a tie occurred for the same pair in both T and P, it was not added to either C or D. The Tau-b coefficient is given by the following equation: Although the values of the Tau coefficient are in the range [−1, 1], to make them comparable with the other coefficients, we decided to normalise their values and, as a result, they are in the range [0, 1] (i.e., the same as IoU and Dice coefficients).The formula for the normalized Tau-b coefficient is given below.

Research Preparation and Problem Identification
Two different segmentation models based on deep learning were used in this study: UNet powered by 5-channel data (RGB, red edge, and near-infrared).
The main goal of this ongoing study was to determine the range of a single species (Heracleum Sosnowskyi Manden.), the task was limited to binary segmentation (the location in the image included in the plant).The trained models were evaluated for prediction quality using a 1-bit ground truth mask.The model structure, its parameters, and the learning process were not relevant in this study.The most important task is to compare the quality of the obtained predictions-for example, after changing the learning parameters-to determine which element of the analysis improves or worsens the segmentation results.
An example of an RGB orthophoto is shown in Figure 3A.An inventory was created for the same area, and a ground truth mask of the occurrence of the selected plant species in the study area was created (Figure 3B).
Figure 3C,D displays the result of an overlay of the prediction and ground truth masks using the two models previously mentioned.Figure 3C,D's colours were interpreted as follows: • Blue is a true positive (TP; an object exists in the mask and has been indicated in the prediction).
• Black is a false positive (FP, object exists in the mask but has not been indicated in the prediction).
• Red indicates a false negative (FN, no object in the mask, but has been indicated in the prediction).
• White is a true negative (TN, no object in the mask, and not indicated in the prediction).
The example shown in Figure 2 shows a few of the disadvantages of using the Tau, IoU, and Dice metrics to compare the prediction quality.Even at first glance, one can see a significant improvement in the prediction from Figure 3D (infrared model) compared with the prediction from Figure 3C.False indications on the lake surface observed in Figure 3C have been eliminated.The standard values of the metrics given in Table 1 do not, however, reflect this.The changes in the metrics' values were minimal, even despite the clear modifications to the prediction structure.This is because of the linear dependence of the coefficient values on the total area of the correctly hit pixels.As a result, standard metrics are primarily focused on the number of pixels detected but are insensitive to their distribution.A closer analysis of both predictions revealed that, in the case of the Figure 3D prediction, the positive impact of eliminating obvious false hits was, to a certain extent, offset by the lower accuracy of positive hits; the issue is not so much fewer hits regions and less accurate coverage.
Identifying the precise locations of plants (or regions) in a given area is crucial for the research that will be carried out.Even incomplete indications of a single region will be useful for sending a drone or ground level survey worker to that location.By an incomplete indication, we mean that only a part (e.g., 10%) of a region is hit.Therefore, the optimal solution would be to reward even small TP area hits highly and to reward increasingly lower areas for an increase in the hit area.An analogous effect of FP hits is to penalize only large (area) misses.As a result, we would like to obtain a factor enhancement for the number of positively localized regions and simultaneously reduce the impact of the region matching the accuracy at the individual pixel level.  1 do not, however, reflect this.The changes in the metrics' values were minimal, even despite the clear modifications to the prediction structure.This is because of the linear dependence of the coefficient values on the total area of the correctly hit pixels.As a result, standard We conducted a comprehensive analysis of the data collected to better investigate this issue, making sure to verify the quantity and size of each individual region identified in the ground truth mask and the predictions.
Figure 3B shows 611 regions with a total area of 5,665,854 pixels (the entire image has 372,799,248 pixels).Table 2 lists the regional parameters for both predictions.The number of regions and their total areas were included, considering both the masks themselves (ground truth and predictions) and the results in the form of TP and FP images.The number of TP region hits was slightly lower in the 3D prediction than in the 3C prediction (292 vs. 301).In contrast, the number and area of the indicated FP regions were significantly lower.

Proposed New Metric
For the aforementioned Tau, IoU, and Dice metrics, the values of the calculated coefficients depended linearly on the TP area.Thus, there is no need to treat each region independently (although this is possible), and we can immediately operate on the entire image.We aim to change this dependence into a nonlinear one; we promote even a small hit in a region.As the TP area in a single region increased, the increase in the coefficient value decreased.To achieve this, we analysed each region separately.To achieve the same results for images with varying resolutions (and the same percentage coverage), a relatively complex normalization had to be added to the logarithmic function used for the initial tests.A better solution is to use a root function (for the TP) and a power function (for the FP).
Because we are interested in detecting a given object, both the ground truth and prediction masks are binary matrices of the same dimension.The proposed metric M comprises the following coefficients:

•
M + -positive region coefficient calculated for the TP regions, taking values from 0 to 1 • M − -negative region coefficient calculated for FP regions and taking values from 0 to 1 To better understand how the formulas work, consider an example illustration in which a ground truth mask and a prediction mask have been layered on top of each other.Let us denote the ground truth mask as T and the prediction mask as P. Regions T1, T2, T3, T4, and T5 were separated in the ground truth mask and regions P1, P2, P3, P4, P5, P6, P7, and P8 were separated in the prediction mask as on the illustration below (Figure 4).
We identified the regions with their matrix representations and denoted them in the same manner.First, for each region of the ground truth mask, we calculated its positive coefficient m + , which was then used to calculate the positive region coefficient M + .To calculate m + , from the initial matrix T of the ground truth mask, we selected the smallest submatrix containing the first region T R , i.e., the blue region denoted as T1.This submatrix is a matrix of dimension k × l with nonzero rows and columns represented by the rectangle surrounding the T1 region, as shown in Figure 5.
On the other hand, when from the prediction mask matrix P we select the same submatrix, there is a possibility that it contains part of the prediction mask, which we call P R .Here, we can see that parts of orange regions P1 and P3 and the whole region P2 lie in the matrix T R .Next, we calculated the intersection area I R of the region T R of the ground truth mask with the area P R of the prediction mask.Since all matrices consist entirely of zeros and ones, the ij-th pixel will belong to the intersection if and only if the product of the values of the ij-th pixel of the ground truth mask and the ij-pixel of the prediction mask is equal to 1. Hence, in order to find the intersection area I R , it is sufficient to multiply matrices T R and P R element-wise and sum all the ones of the resulting matrix, thus where T ij ∈ {0, 1} denotes the value of ij-th pixel of T R and P ij ∈ {0, 1} denotes the value of ij-th pixel of P R .To obtain a positive coefficient m + for a given region, we use the following formula: where ||T R || denotes the area of the region of the ground truth mask that we are investigat- ing, that is, and α is the exponent that determines the final shape of the function and is chosen experimentally, based on the analysed data.The main role of the α parameter is to show the significance of incomplete area hits in the evaluation of an image segmentation method.The higher the value of the α parameter, the more we reward incomplete area hits.This way, we go through all regions in the ground truth mask and obtain the positive region coefficient, defined by the following formula: Negative region coefficients were calculated analogously.From the prediction matrix, we select the submatrix of dimension m × n as the smallest submatrix containing a region P R of the prediction masks P. Similarly, as before, after separating the same submatrix in the ground truth mask T, we note that parts of the ground truth mask may appear in it-let us denote them by T R .Then, we calculated the intersection I R of the region P R with the region T R as before.The negative coefficient m − for a single region P R is calculated using the following formula: where ||P R || denotes the area of the region of the prediction mask we are investigating, that is, and β is the exponent that determines the final shape of the function.The main function of the β parameter is to show the significance of areas of false positives.It is again selected through experimentation and chosen for the analysed data, taking into account the particular characteristics of the data.The higher the value of the β parameter, the weaker the penalty for a false positive area.
As before, we go through all the regions of the prediction mask and obtain the formula for the negative region coefficient: The entire process of going through the regions to calculate the M + and M − coefficients is shown in Figure 6 below.
Because we are interested in detecting a given object, both the ground truth and prediction masks are binary matrices of the same dimension.The proposed metric  comprises the following coefficients: To better understand how the formulas work, consider an example illustration in which a ground truth mask and a prediction mask have been layered on top of each other.Let us denote the ground truth mask as  and the prediction mask as .Regions T1, T2, T3, T4, and T5 were separated in the ground truth mask and regions P1, P2, P3, P4, P5, P6, P7, and P8 were separated in the prediction mask as on the illustration below (Figure 4).We identified the regions with their matrix representations and denoted them in the same manner.First, for each region of the ground truth mask, we calculated its positive coefficient  , which was then used to calculate the positive region coefficient  .To calculate  , from the initial matrix  of the ground truth mask, we selected the smallest submatrix containing the first region  , i.e., the blue region denoted as T1.This submatrix is a matrix of dimension  ×  with nonzero rows and columns represented by the rectangle surrounding the T1 region, as shown in Figure 5. On the other hand, when from the prediction mask matrix  we select the same submatrix, there is a possibility that it contains part of the prediction mask, which we call  .Here, we can see that parts of orange regions P1 and P3 and the whole region P2 lie in the matrix   .Next, we calculated the intersection area  of the region  of the ground truth mask with the area  of the prediction mask.Since all matrices consist entirely of zeros and ones, the -th pixel will belong to the intersection if and only if the product of the values of the -th pixel of the ground truth mask and the -pixel of the prediction mask is equal to 1. Hence, in order to find the intersection area  , it is sufficient to multiply matrices  and  element-wise and sum all the ones of the resulting matrix, thus where  ∈ 0,1 denotes the value of  -th pixel of  and  ∈ 0,1 denotes the value of -th pixel of  .To obtain a positive coefficient  for a given region, we use the following formula: where | | denotes the area of the region of the ground truth mask that we are investigating, that is, and α is the exponent that determines the final shape of the function and is chosen ex- As before, we go through all the regions of the prediction mask and obtain the formula for the negative region coefficient: The entire process of going through the regions to calculate the  and  coefficients is shown in Figure 6 below.Table 3 shows the coefficient values of the proposed metric for the parameters α i β equal to 5, calculated for the scenario discussed in the introduction (Figure 3).The coefficients M + differ slightly between the Figure 3C,D predictions; however, there was a very clear difference between the coefficients of the M − which influenced the difference between the coefficients ∆.This corresponds to an intuitive visual assessment of both predictions.Parameter values α i β, which affect the degree of non-linearity of the coefficients for the individual regions, can be chosen experimentally for specific images.In the case of the previous scenario, as well as the test scenarios that we will describe in the next section of this study, the parameters α i β are equal to 5.
The factor M + corresponds to the Recall metric, and for parameter α = 1 it considers the same values.Only for parameter values α greater than 1, the presence of regions is considered.The coefficient M − is related to the Precision metric; for parameter i β = 1 the value of Precision is equal to 1 − M − .

Results and Discussion
To facilitate the analysis of the behaviour of the tested metrics, including the new one proposed in this study, sets of test images were generated.All images were 200 px × 140 px, with a total area of 28,000 pixels.
The ground truth mask consists of 20 square regions, each with an area of 100 px.They were arranged in five columns and four rows.The distance between squares was 20 px (Figure 7A).Prediction images were generated as required, with the same size as the ground truth mask, but with varying degrees of correlation with the mask.For example, Figure 7B shows the prediction for 24 regions.The first eight regions (first two columns) were 25 px in the area (5 × 5 px squares) and overlapped with parts of the ground truth mask regions.The next 12 regions (i.e., the next three columns) are 100 px squares, which exactly overlap with the mask regions, and the last 4 regions (last column) are 100 px squares, which do not overlap with any mask region.Figure 7C shows the intersection of the ground truth mask with the generated prediction using the colour convention described earlier (blue: TP-true positive; black: FP-false positive; red: FN-false negative).Further analysis of the results and discussion were performed using the easy-to-interpret and graphical scenarios discussed below.

Cases I with Fixed Dice and IoU Values
Figure 8 shows a set of six tests (I-A to I-F).In all tests, the ground truth (T) mask was the same.In tests I-A to I-E, the total area of TP hits did not change (25%).In the last test, In our research, we employed larger images, scaling the test set's images two, five, and ten times.Except for the experiments with the abandoned logarithmic and exponential functions, this did not, however, alter the values of the coefficients.
Further analysis of the results and discussion were performed using the easy-tointerpret and graphical scenarios discussed below.

Cases I with Fixed Dice and IoU Values
Figure 8 shows a set of six tests (I-A to I-F).In all tests, the ground truth (T) mask was the same.In tests I-A to I-E, the total area of TP hits did not change (25%).In the last test, the TP area increased to 30% but was offset by the FP hits (the FP area was chosen; as a result, the Dice and IoU coefficients remained unchanged).Further analysis of the results and discussion were performed using the easy-to-interpret and graphical scenarios discussed below.

Cases I with Fixed Dice and IoU Values
Figure 8 shows a set of six tests (I-A to I-F).In all tests, the ground truth (T) mask was the same.In tests I-A to I-E, the total area of TP hits did not change (25%).In the last test, the TP area increased to 30% but was offset by the FP hits (the FP area was chosen; as a result, the Dice and IoU coefficients remained unchanged).The values of the test image coefficients displayed in Figure 8 are listed in Table 4.As can be seen, these are cases where the Tau, IoU, and Dice metrics failed, at least given our previous assumptions.At the same time, the coefficient  clearly shows the difference between images, rewarding small hits in additional regions.In the last test (Figure 8(I-F)), the incorrect prediction of several regions was reflected in the value of the coefficient  and, therefore, in the value of Δ.For a relatively high fit in the case of scenario I-E, the coefficient value Δ is high.In contrast, for scenario I-F, where not only is the detection of the regions itself worse, but there are additional FP areas, the coefficient value Δ is negative.The values of the test image coefficients displayed in Figure 8 are listed in Table 4.As can be seen, these are cases where the Tau, IoU, and Dice metrics failed, at least given our previous assumptions.At the same time, the coefficient M + clearly shows the difference between images, rewarding small hits in additional regions.In the last test (Figure 8(I-F)), the incorrect prediction of several regions was reflected in the value of the coefficient M − and, therefore, in the value of ∆.For a relatively high fit in the case of scenario I-E, the coefficient value ∆ is high.In contrast, for scenario I-F, where not only is the detection of the regions itself worse, but there are additional FP areas, the coefficient value ∆ is negative.
Table 4.The values of the coefficients for the test images in Figure 8.For the same test cases (excluding case I-F), verification of the effect of the parameter α on the M + coefficient was performed (results in Table 5, same TP area, different number of hit regions).As expected, an increase in the value of the parameter α strengthens the effect of the number of hit regions on the coefficient value.Therefore, the choice of parameters can be optimised for a specific task, that is, the predictions analyzed (number and size of regions) and the expected results (how much hitting a region is expected to be more important than covering the area accurately).The main role of parameter β is to amplify the penalty for the number of FP regions, not just their total area.Since parameter β defined for the coefficient M − behaves analogously to the coefficient α defined for the coefficient M + , we will skip its detailed description.

Case
As mentioned previously, the α and β parameters depend on the type of data, so their values are chosen experimentally.The choice of these parameters is arbitrary and it is up to the user to what extent he or she wants to reward positive hits and penalize incorrectly detected areas.
Looking at Table 5, we can see that the higher the value of the α parameter, the greater the reward for detecting the same number of regions and, therefore, the greater the value of the M + factor.

Cases II-Hits in the Same Percentage of the Ground Truth Mask Area
Consider a situation in which, on each occasion, the prediction overlapped 25% of the ground truth mask area, but we hit its regions differently (Figure 9).Table 6 displays the coefficients that were computed for these tests.Table 6.The values of the coefficients for the test images in Figure 9.Note that in the absence of false-positive (FP) areas, Dice, IoU, and Tau were insensitive to the difference between detecting 25% of all regions (Figure 9(II-A)) and 25% of the area of each region (Figure 9(II-B)) and showed the same fit.For our applications, the latter prediction is preferred, as reflected in the corresponding coefficient values M + , i.e., it is largest when hitting each of the regions, smaller when hitting eight of them, and smallest when we detected only five.When the FP area appeared, the values of the Dice, IoU, and Tau coefficients decreased.In our case, this is not a desirable effect, as we were unable to assess whether the smaller coefficient was due to hitting a smaller area, a smaller number of regions, or perhaps the appearance of FP areas.Hence, a coefficient M − was introduced.It stores this information and makes it possible to compare predictions explicitly in terms of the appearance of FP areas.Let us further note that in the case of low mask and prediction fit for II-A and II-C tests, the value of the coefficient ∆ is low.In tests II-B and II-D, all regions were detected in the same manner, and the only difference was the appearance of a fairly large FP area in the case of the second one.We can see that in this situation, the value of the ∆ coefficient reflected this and decreased from 0.758 to 0.521.

Cases III Where the Same Ground Truth Mask Regions Are Hit
Let us now assume that we want to compare the predictions that detect the same regions (Figure 10).Table 7 displays the coefficients that were computed for these tests.Table 7.The values of the coefficients for the test images in Figure 10.In tests III-A, III-B, and III-C, the increase in the coefficient  was slower than that in the Dice and IoU metrics.This is because only the area of hits in the regions is increasing, not their number.Note that for scenarios III-C and III-D, the value of the coefficient  is the same, while the FP area information is captured by the coefficient  , which allows us to differentiate between the two predictions.In contrast, when choosing only the coefficient Δ this difference was captured and its value decreased.In the case of scenarios III-A and III-B, in which there were no FP areas and only the area of hits in the same  In tests III-A, III-B, and III-C, the increase in the coefficient M + was slower than that in the Dice and IoU metrics.This is because only the area of hits in the regions is increasing, not their number.Note that for scenarios III-C and III-D, the value of the coefficient M + is the same, while the FP area information is captured by the coefficient M − , which allows us to differentiate between the two predictions.In contrast, when choosing only the coefficient ∆ this difference was captured and its value decreased.In the case of scenarios III-A and III-B, in which there were no FP areas and only the area of hits in the same regions increased, the coefficient ∆ is simply equal to the coefficient M + .

Cases IV of Hits on All Ground Truth Mask Regions
To observe the general behaviour of the coefficient M + , let us examine the tests in which every region was hit to a certain extent and no FP areas occurred-Figure 11.See Table 8 for a comparison of the coefficient values.Table 8.The values of the coefficients for the test images in Figure 11.We observe that when the intersection of the prediction and ground truth is large, the values of all the coefficients are similar.Even though every region of the ground truth mask was detected, the coefficients of Dice and IoU indicated a lack of detection of the area when this intersection was very small.In contrast, the Tau coefficient remained large, even though only 1% of the ground truth mask area was detected.The  coefficient, owing to its nonlinearity (i.e., faster growth rate for small TP areas and slower growth rate for large TP areas), thus represents a compromise between an approach based only on area detection, ignoring the size of the area, and an approach based only on the intersection area of the prediction and ground truth masks, without considering how many regions of that mask we have hit.

Cases V with a Fixed Number of FP Regions
Considering the scenarios in which FP areas are observed, as noted earlier, including them in a metric with a separate coefficient provides an additional precise tool for evaluating the prediction quality.
Each region of the ground truth mask was identified in the ensuing test pairs, V-A and V-B, V-C and V-D, and V-E and V-F (Figure 12), and an extra row of prediction mask regions (i.e., the FP areas appeared).Table 9 compares the coefficients of each pair.We observe that when the intersection of the prediction and ground truth is large, the values of all the coefficients are similar.Even though every region of the ground truth mask was detected, the coefficients of Dice and IoU indicated a lack of detection of the area when this intersection was very small.In contrast, the Tau coefficient remained large, even though only 1% of the ground truth mask area was detected.The M + coefficient, owing to its nonlinearity (i.e., faster growth rate for small TP areas and slower growth rate for large TP areas), thus represents a compromise between an approach based only on area detection, ignoring the size of the area, and an approach based only on the intersection area of the prediction and ground truth masks, without considering how many regions of that mask we have hit.

Conclusions
In traditional segmentation evaluation metrics, such as the Dice coefficient, IoU (Intersection over Union), or Jaccard, the main objective is to examine the degree of coverage between the predicted mask and the true mask.Although these metrics are widely used, they are often insufficient in certain situations, as we have shown in the scenarios we have prepared.This is a common problem not only in vegetation studies and is often a flaw pointed out by other researchers.For example, in a study on the quality of building segmentation [45] using satellite imagery, the author points out the different quality assessment results obtained with the Dice and Jaccard metrics, where the result fluctuated between 84% and 70%, respectively.This suggests that not all metrics are equally reliable in assessing classification quality, and that it often depends on the spatial heterogeneity of the phenomenon under study.Similarly, in the previously cited paper by Müller et al. [24], the authors point out the problem of class imbalance that often occurs in medical analyses and propose the use of a wide range of metrics such as Dice coefficient, IoU, sensitivity, specificity, Rand index, ROC curves, Cohen's Kappa, and Hausdorff distance, which, when used together, aim to assess classification more accurately.They also point out that the standard metrics used, such as accuracy, sensitivity, and specificity, can introduce significant interpretation errors.Thus, there is a lack of a single, more flexible metric that works in more diverse scenarios.Therefore, we believe that our proposed M metric offers more flexibility in assessing the quality of prediction in many non-standard cases, as we have shown in the chapter Results.
Depending on the needs, the metric M can be used both in a compact form, as a coefficient ∆ and in a detailed form, as a pair of coefficients M + and M − .In the first case, we obtain a general indication of the fit of the prediction mask to the ground truth mask.The higher the value of the ∆ coefficient, the more accurate the prediction.Values of the ∆ coefficient lie in the range [−1, 1].To obtain more detailed information about the prediction fit, we can use the component coefficients M + and M − , which contain information on the TP and FP areas, respectively.The values of the coefficients M + and M − are in the range [0, 1].The higher the value of these coefficients, the higher the proportion of TP and FP areas in the ground truth mask.Furthermore, by changing the parameters α and β we can influence how much the number of TP and FP regions will be more important than their area.This allowed us to predetermine the relevance conditions for hits in part of the ground truth mask region.This flexibility allows us to tailor the coefficient to the individual characteristics of the segmented object and thus better assess the quality of the prediction in real-life situations.Thus, we do not need to use many different metrics, but only one, which makes the work and the evaluation of the prediction itself much easier.
The proposed method was developed for the purpose of verifying the results obtained during the study of automatic detection of vegetation patches-an example of Heracleum sosnowskyi (Manden.).Currently, further field work is planned to verify the usefulness of the method during the analysis of the effectiveness of detection of communities of various invasive plants.We anticipate that the usefulness of the method is not limited to this application area.In general, it can find application in any field where we are more interested in detecting certain areas rather than determining their exact boundaries, for example, during preliminary analysis of medical imaging.Thus, a logical direction for further work will be to explore the scope of applicability of the proposed method in fields other than remote sensing.
As noted in the article, the selection of α and β parameters is currently carried out arbitrarily on an experimental basis for a particular type of data.The usefulness of the method could be improved by developing, for example, a way to automatically select these parameters or by determining their optimal values for specific applications.This requires further experimental work with different types of data and will be the next element in the development of the proposed method.
The next step in the development of the method will be an attempt to modify it so that it is more versatile and allows the evaluation of the results of classification, rather than just segmentation.A certain limitation of the proposed method, compared to the compared Dice or IoU, is its higher computational complexity.We estimate that at present it can be used effectively to evaluate the final results, while using it as a metric in the learning process may slow it down.

Figure 1 .
Figure 1.Localization of the study area.

Figure 2 .
Figure 2. Example of comparison of prediction results.

Figure 1 .
Figure 1.Localization of the study area.

Figure 1 .
Figure 1.Localization of the study area.

Figure 2 .
Figure 2. Example of comparison of prediction results.Figure 2. Example of comparison of prediction results.

Figure 2 .
Figure 2. Example of comparison of prediction results.Figure 2. Example of comparison of prediction results.

22 Figure 3 .
Figure 3. (A) Orthophotomap with marked contours of the range and lake, (B) ground truth masklocation of Heracleum sosnowskyi, (C) prediction compared with mask (model using three channels -RGB), (D) prediction compared with mask (model using five channels-RGB, red edge, near-infrared).For explanation of false negative, true positive and false positive terms, see the explanation on the previous page.The example shown in Figure 2 shows a few of the disadvantages of using the Tau, IoU, and Dice metrics to compare the prediction quality.Even at first glance, one can see a significant improvement in the prediction from Figure 3D (infrared model) compared with the prediction from Figure 3C.False indications on the lake surface observed in Figure 3C have been eliminated.The standard values of the metrics given in Table1do not, however, reflect this.The changes in the metrics' values were minimal, even despite the clear modifications to the prediction structure.This is because of the linear dependence of the coefficient values on the total area of the correctly hit pixels.As a result, standard

Figure 3 .
Figure 3. (A)Orthophotomap with marked contours of the range and lake, (B) ground truth masklocation of Heracleum sosnowskyi, (C) prediction compared with mask (model using three channels -RGB), (D) prediction compared with mask (model using five channels-RGB, red edge, nearinfrared).For explanation of false negative, true positive and false positive terms, see the explanation on the previous page.

•
-positive region coefficient calculated for the TP regions, taking values from 0 to 1 •  -negative region coefficient calculated for FP regions and taking values from 0 to 1 • Δ =  −  taking values from −1 to 1.

Figure 4 .
Figure 4.The ground truth (blue) and the prediction mask (pale orange) on top of each other with separated regions.

Figure 4 . 22 Figure 5 .
Figure 4.The ground truth (blue) and the prediction mask (pale orange) on top of each other with separated regions.Remote Sens. 2024, 16, x FOR PEER REVIEW 10 of 22

Figure 5 .
Figure 5.The separated T1 region from the ground truth mask together with the surrounding rectangle (bounding box) and parts of the prediction mask.

Figure 6 .Figure 6 .
Figure 6.Decomposition of the image into submatrices containing the ground truth mask regions ((A)-passing through the ground truth regions: positive coefficient) and the prediction mask regions ((B)-passing through the prediction regions: negative coefficient).

22 Figure 7 .
Figure 7. Example test set-(A) ground truth mask T, (B) prediction P, (C) comparison P and T.

Figure 7 .
Figure 7. Example test set-(A) ground truth mask T, (B) prediction P, (C) comparison P and T.

Figure 7 .
Figure 7. Example test set-(A) ground truth mask T, (B) prediction P, (C) comparison P and T.

Figure 8 .
Figure 8. Test images for constant values of IoU and Dice coefficients (A-F-case examples).

Figure 8 .
Figure 8. Test images for constant values of IoU and Dice coefficients (A-F-case examples).

Figure 9 .
Figure 9. Test images for a fixed TP surface.

Figure 9 .
Figure 9. Test images for a fixed TP surface.

Figure 10 .
Figure 10.Test images for fixed regions.

Figure 10 .
Figure 10.Test images for fixed regions.

Table 7 .
The values of the coefficients for the test images in Figure10.

22 Figure 11 .
Figure 11.Test images of hits in each region.

Figure 11 .Table 8 .
Figure 11.Test images of hits in each region.

Table 1 .
Prediction rates of Figure 3C,D.

Table 2 .
Parameters of the regions of both predictions.

Table 3 .
Coefficient values of the proposed metric for the parameters α i β equal to 5, calculated for the scenario discussed in the introduction.

Table 5 .
Verification of the effect of the parameter α on the M + .

Table 6 .
The values of the coefficients for the test images in Figure9.