Extraction of cropland field parcels with high resolution remote sensing using multi-task learning

ABSTRACT Parcel-level farmland information contains rich spatial distribution and boundary details, which is crucial for digital agriculture and agricultural resource surveys. However, the spatial complexity and heterogeneity of features resulting from high resolution makes it difficult to obtain parcel-level information quickly and accurately. In addition, existing methods do not sufficiently take into account the spatial topological information, particularly for blurred boundaries. Here, we develop a multi-task network model to extract plot-level cropland information. Specifically, the model consists of a cascaded multi-task network with integrated semantic and edge detection, a refinement network with fixed edge local connectivity, and an integrated fusion model. To validate the performance of the model, two typical tests were conducted in Denmark (Europe) and Chongqing (Asia) with high-resolution remote sensing images provided by Sentinel-2 (10 m) and Google Earth (0.53 m) as data sources. The results show that our proposed model outperforms other baseline models and exhibits higher performance. This study is expected to provide important support for the design of new global agricultural information management systems in the future.


Introduction
Timely and accurate parcel-level farmland information is important for policy-makers, agricultural information survey and global food security (Fritz et al., 2015), which is largely based on quantifying the spatial distribution of agricultural land and individual parcellevel fields (Kuemmerle et al., 2009) accurately.Early information on field boundaries was mainly collected through manual interpretation, which involved huge costs and low efficiency.As increasing remotely sensed data have become available through open data policies, automated and semi-automated agricultural field extractions have been gradually adopted into modern agricultural information management (Garcia-Pedrero et al., 2019;Wagner & Oppelt, 2020;Waldner et al., 2015;Xiong et al., 2017;Yan & Roy, 2014).
In remotely sensed images, although agricultural boundaries form destructive connections between plots through ditches and roads, they are easily distinguished by different ground materials.However, those boundaries that are adjacent to natural features are often ambiguous and highly heterogeneous in remotely sensed imagery.Moreover, the varied global cropping patterns also challenged the extraction of field boundaries with wide applicability.For this, existing studies can be divided into handcrafted features-based (HFB) and convolutional neural network (CNN)-based methods.HFB mainly used objectinherent characteristics such as topological characteristic and brightness characteristic to aggregate homogenized regions.For complex agricultural boundaries, a new growing snakes active contour model was usually employed to extract agricultural fields at the sub-pixel level.In addition, Hong et al. (2021) developed a parcel-level boundary extraction algorithm using different traditional methods, such as the Suzuki85 algorithm and canny edge detection.However, this algorithm can extract regularly arranged agricultural area boundaries only.For this issue, Torre and Radeva (2000) designed a region competition technique that integrates region growing and deformable models to segment agricultural fields bypassing the extraction of boundaries.Belgiu and Csillik (2018) evaluated the performance of a timeweighted dynamic time warping (TWDTW) method that uses Sentinel-2 time series in pixel-based and object-based classifications of various crop types in three different case areas.Similarly, Graesser and Ramankutty (2017) proposed a combined method that consists of multi-spectral image edge extraction, multi-scale contrast limited adaptive histogram equalization, and adaptive thresholding, which used Landsat imagery to estimate cropland changes.However, the above methods have two issues: they are sensitive to intra-plot variability, which could produce more segments and rely heavily on a correct selection of parameters (e.g. the similarity measure used to group the pixels of the image), which obviously requires prior knowledge of the scene or trial-anderror tuning (Meyer & Beucher, 1990;Mueller et al., 2004).
Fortunately, CNN, which has been successfully applied to image classification (Cheng et al., 2018;He et al., 2016), semantic segmentation (Ronneberger et al., 2015), and object detection (Redmon & Farhadi, 2018), is promising for automatic agricultural field extraction from remote sensing images.An updated technique based on fully convolutional network (FCN) (Long et al., 2015) and combinatorial grouping was proposed to delineate agricultural fields in smallholder farms (Persello et al., 2019).However, for mountainous areas where lack of continuous optical images, W. Liu et al. (2020) applied a CNN to obtain an accurate image object distribution map on very high resolution (VHR) optical images and the Long Short-Term Memory (LSTM) networks were next adopted to identify farmland parcels on time-series synthetic aperture radar (SAR) data.Masoud et al. (2020) combined a dilated fully convolutional network and super-resolution mapping techniques to produce refined agricultural field boundaries.To further improve the model performance, Garcia-Pedrero et al. (2019) proposed a methodology based on U-Net to delineate agricultural boundaries automatically, and made a hit.
Obviously, the method to construct a neural network for a particular task cannot obtain good results regarding different agri-landscapes.Multi-task learning (MTL) (Jou & Chang, 2016;Misra et al., 2016;Ruder, 2017;S. Liu et al., 2019) is a machine learning technique that solves multiple tasks simultaneously and overcomes this limitation to some extent.It uses retrieved information from related tasks to improve generalization and transfer knowledge as an inductive bias (Ruder, 2017).For example, a multi-task model is applied to the extraction of building footprints (Bischke et al., 2019).Furthermore, some studies have introduced MTL into automatic road extraction from remote sensing images.We were happy to see that a cascaded end-to-end model called CasNet (Cheng et al., 2017) was first proposed, in which the first network predicts the road segmentation map, and the second network extracts the centerline based on the first one.Next, as a typical case, RoadNet (Y.Liu et al., 2019) is subsequently developed forming an updated three-task-branch network.The interrelated tasks of the above two models can promote complementarity, and road segmentation provides ideal initialization with a small amount of complex background for other tasks.Although the two models achieve good results, they do not fully consider the spatial connectivity and structural characteristics of the road.To address these two issues, Wei et al. (2020) integrated a multi-point tracking method into a multi-stage framework, which forms a centerline graph based on the road segmentation prediction map and then fuses it with the segmentation map to obtain the final segmentation map and centerline.As a result, learning the road network topology indeed improves road connectivity.Further, Ding and Bruzzone (2021) proposed a direction-aware residual network that adopts multiple loss functions to supervise the learning of the road surface, the structure of the road, and the directionality of the road to optimize the road extraction results.Meanwhile, the MTL method is also employed for farmland plot extraction.Waldner and Diakogiannis (2020) proposed a multi-task semantic segmentation model using ResUNet-a (Diakogiannis et al., 2020).In this model, the decoder is divided into three branches to predict the extent of the fields, their boundaries, and the distance to the closest boundary in satellite images.The generalization ability of this model is further improved by averaging the model predictions from time-series data and adjusting the parameters by the watershed post-processing method.
Although remarkable improvements have been made in cropland extraction by recent deep learningbased approaches, extracting boundary information under more complex and fragmental farmlands has not been conducted yet.The boundary detection model has been widely used in the field of remote sensing (Cheng et al., 2022;Lee et al., 2021).Given an edge detection model such as (H.Liu et al., 2019), it is simply used to extract farmland plots.Unclosed edges containing isolated segments or missing information will thus make it impossible to obtain real cropland parcels.The segmentation strategy proposed by Xia et al. (2018) includes two edge tasks and one semantic task to extract soft boundaries, hard boundaries and cropland surface and merge the three results to construct plots.However, the shortcomings of the strategy are that the semantic segmentation model and edge detection model are trained independently.As a result, the two tasks have a strong correlation, but this is not actually reflected in the training phase.
The problem to be addressed at this point is how to combine cropland features extracted by segmentation and edge detection models to realize their complementary strengths.The novel framework presented in this study adopts a cascaded CNN network to extract cropland surfaces and edges simultaneously to construct cropland parcels with much better accuracy and completeness.Meanwhile, in the edge detection task, a point regression network is used to detect potentially connectable points with the local center, and the final break-point pair is confirmed through a series of post-processing for the broken edge line to make up for its lack of topology information learning.This study is expected to provide technical support for agricultural information surveys and cropland management.
For further details in paper framework, Section II elaborates on the overall framework of our plot extraction method, including the cascade network SGENet and patch refinement model.Section III describes the specific conditions of farmland in the two case areas.Section IV presents ablation experiments to verify the effectiveness and generalization of our method on diverse data sets.Section V and Section VI make discussion and conclusion, respectively.

Overall framework
Farmland plot extraction aims to classify pixels into two categories of farmland category and background category in the image.In order to accurately classify farmlands with heterogeneity, we propose a hybrid framework, which mainly utilizes and combines edge and semantic information obtained from a multi-task cascade network.We find that 1) semantic information can guide the extraction of edge information for plots.2) Implicit edge information where edge extraction results are often disconnected can be refined using local features.3) objects can be segmented more accurately by combining both information.These observations lead us to divide the final cropland extraction task into three stages, which are illustrated in Figure 1.First, the coarse semantic map (classification map) and imperfect boundaries are output sequentially by the cascaded network.Then, we use patch-level network to optimize the connectivity (patch with breakpoint phenomenon) on the edge results given by the previous model.Finally, the optimized edge (edge map) and coarse semantic results are merged through a voting process to produce fine segmentation maps (final prediction).

Multi-task learning model (MTL)
The cascade model architecture is presented in Figure 2(a).The proposed MTL model creates three tasks related to the problem of agricultural land boundary detection to enhance performance.The created tasks consist of three parts: semantic segmentation, gaussian extraction, and edge detection, which correspond to SSNet, GENet, and EDNet, respectively.
As shown in Figure 2(b), the function of SSNet is to simplify the complex cropland background and obtain the initial classification map (Figure 1).Semantic segmentation net (SSNet) is an encoder-decoder structure with the RGB channels of an image at a size 512 × 512 as input I H × W × 3 .ResNet-101 pretrained on ImageNet is used as our encoder network.The five residual module extract high-level and low-level feature maps by downsampling convolutional layers.In the decoder stage, the feature maps are upsampled and perform skip connection operations with the feature maps of the same level as the previous encoder until back to the same size as the input image I.A final softmax layer is employed to output the probability of each pixel belonging to background classes, surface and boundary.The three-dimensional feature map S H×W × 3 represents the probability maps of the three categories.The background information is filtered out, and the remaining two feature maps S H×W × 2 and the original image I H × W × 3 are concated to generate feature map SI H × W × 5 as the input of Gaussian Extraction task.
Gaussian Extraction Net (GENet) uses the output SI of the previous task as input (Figure 2(a)).Its network structure is the same as the semantic task (Figure 2(b)).The difference with SSNet is the ground truth mask and output.Ground truth masks are processed by a distance transformation from the vector reference agricultural land boundary data.Inspired by Road Connectivity (Batra et al., 2019), the Gaussian distribution is equally divided into eight intervals.Each interval is used as a category, and the original regression process is converted into a classification task that is conducive to the convergence of multitask loss (learning edge buffer information).It improves semantic segmentation accuracy at locations where the agricultural land boundaries are ambiguous (e.g. the inside of the boundary contains ditches, bridges, roads, and other features).The GENet output is an eight-dimensional feature map after using softmax activation. 1 × 1 convolution is performed to reduce the dimension to 1 so that the final features are merged.The final output G H × W × 1 concated with I H × W × 3 to get GI H × W × 4 .It is used to feed the Edge detection task (Figure 2(a)).
The function of Edge Detection Net (EDNet) task is to produce the final geometric distribution of farmland edges, in combination with the semantic feature cues obtained from the above two tasks.As shown in Figure 2(c).It also uses ResNet-101 as the backbone to extract different feature layers.Since the edge information always comes from low-level feature maps, stage 5 residual blocks containing higher-level information on the backbone are discarded.The network structure of EDNet is simplified on original HED (Xie & Tu, 2015).The output depth of stage 1 to stage 4 of each convolutional layer is halved.As a result, it has fewer parameters to avoid overfitting.A 1 × 1 − 1 conv layer follows each residual block of every stage.Upsampling and sigmoid activation layers are used jointly to generate side outputs at different scales in four stages.The four side outputs are concatenated on channels and then passed through a 1 × 1 − 1 conv layer to get the final edge result E H × W × 2 , which represents the intensity distribution map of boundary and non-boundary.
There are three different prediction results at each cascade stage, namely S, G and E, which are predictions of SSNet, GENet, and ENet, respectively (Figure 2).A joint loss function is used here to supervise, using two segmentation losses and one edge loss, which are all cross-entropy losses.

Patch-level model for connectivity
A patch-level model is introduced to estimate the connectivity between adjacent breakpoints using local features at the local scale.Given an image patch centered on the breakpoint, our proposed patch refinement network infers whether there are other locations at the patch border that are connected to the center point.The network architecture is shown in Figure 3.After the network predicts the possible connection orientations between the center and border in the patch, the best combination of breakpoints will be selected for breakpoint connection.
As shown in Figure 3, our patch refinement network is mainly composed of a preprocessing network and hourglass modules.Of these two, the preprocessing network uses large convolution kernels to initially extract features.Then, two hourglass modules (Newell et al., 2016) are stacked together.The symmetric design allows for bottom-up (from high to low resolution) in the first half of the module, and top-down (from low to high resolution) in the second half.To capture the local information of the points on the edge line at different scales, the bottom-up and top-down structures conduct intermediate supervision to make optional predictions for each hourglass module.Meanwhile, intermediate supervision is re-conducted after the first module.As a result, the heatmap in the second module represents the final output of the network that is used to extract the connectivity from the center of the image.
Connectivity is not a property of a single point but of pairs of pixels.The local network is designed to estimate the points from the border in a patch that are connected to a given input point.Therefore, given a patch, the position of the input and output points needs to be encoded and updated.Specifically, this study breaks the edge line into points in the design of the models, and each of our patches is cut out based on the position of the breakpoint; then these breakpoints are used as the center point of the patch.The label of each patch is the Gaussian distribution map generated by these breakpoints in the patch.The network is trained with a set of k × k pixel patches from the training set.As a result, the output is a heatmap that reflects the probability of each location connected to the central point of the patch.
As shown in Eq. 1 and Figure 4, our postprocessing method is mainly divided into two stages: locating the position of the breakpoint and matching the relevance of the breakpoint.This study first converts the predicted intensity map E H × W × 2 into a binary image, extracts the skeleton line based on the binary image, and slides the window on the skeleton line to obtain the breakpoint.Meanwhile, a 3 × 3 identity matrix is used as a filter to traverse all pixels in the binarization map with the center position of the matrix to detect the breakpoint of the binarization map.Whether each pixel is a breakpoint is determined by Eq. 1 -2: where, M i;j ð Þ , Pixel i;j ð Þ , w and h are dot product sum of 3 × 3 pixel matrix and identity matrix in the image (i, j) position, the pixel value of the image (i, j) after binarization, the width of the predicted image and the height of the predicted image.
Where Bool breakpoint i;j ð Þ determines whether pixel point (i,j) is a breakpoint or not.
The breakpoints of the agricultural landline are detected to obtain a set of breakpoints.Then, the slices with one of the breakpoints as the center are generated.Next, the proposed patch refinement network predicts the connectivity from the center point of the sliced image to the surrounding points, as shown in the green circle in Figure 4a.The central point is connected to the breakpoint that is predicted to be connectable, as shown in Figure 4b.Subsequently, the line segment matching is performed on the values with a certain angle tolerance, and the results are shown in Figure 4c.According to the priori of an agricultural field in high-resolution remote sensing images, the agricultural field boundaries are obvious and have a sharp contrast to the texture of the surrounding environment on a local scale.In particular, in the scenario of this paper, the orientation of the agricultural field boundary is estimated only at the locations of breakpoint seeds based on a small patch, in which the edges are much clearer with fewer interferences.Thus, the characteristics of the agricultural field boundary in these local areas are more abundant and clear, and the main orientation can be estimated more robustly.The regularity and periodicity of the regular texture make it possible for texture primitives to embody some characteristics on the whole, such as orientation.Thus, the orientation of the texture is predicted to represent the possible orientation of the edges by canny edge detection.Finally, the orientation of the broken line at the center point is estimated, and the line segment is retained for repair when the orientation angle can be complementary.Such an integrated process is illustrated in Figure 4d.After the prediction is finished, the paired breakpoints are removed from the breakpoint set, and the above process is repeated until the breakpoint set is empty.This effectively ensures that the algorithm is iterative and convergent when performing detection and prediction regarding all breakpoints.In this way, we optimize the edge results in the cascaded network.Finally, the obtained edge maps (Figure 1) were prepared for voting fusion in the next stages.

Voting strategy
Considering spatial similarity and heterogeneity (Wu et al., 2013) (i.e. the same objects show similar characteristics in neighboring areas, and different features in the same area show high heterogeneity), farmland generally has uncertain texture features.For this, our study defines the visual features of the cultivated land parcels into a combination of edges (boundary transformation) and textures (a mixture of artificial construction and natural growth, which is semantic information).The edges and textures are fused by the voting mechanism, which is illustrated as follows (Figure 5).
It should be noted that, the segmentation result only provides inaccurate pixel-wise output, and different kinds of pixels may be involved within the predicted edge.So, this study utilizes a majority voting strategy (Ma et al., 2019)   counted, and the most frequent class will be selected as the category of r j .This process is represented as: where, sign (•) denotes an indicator function, sign (true) = 1, and sign (false) = 0; r is the possible class.The voting mechanism is applied within all predicted boundaries to obtain the parcels.In the multi-task learning model, the semantic results S H×W × 2 are merged with the optimized edge results.This paper superimposes the semantic segmentation map and the refined edge results for analysis: if half of the pixels within a boundary belong to the farmland plot category, and then it is classified as a plot, otherwise, as a background category, within a single contour.

Accuracy evaluation
In this study, the completeness, correctness, and quality introduced by The intersection over union (IOU) metric, also known as the Jaccard index, evaluates the overlap between reference and classified objects.In this study, it is adopted to describe the agricultural land boundary surface divided by the total amount of pixels that describe the agricultural land boundary in both images (refer to Section III).The calculation formula of IOU is:

Denmark
The first study area is located in central Denmark (Figure 6 As shown in Figure 7a, there are heterogeneous geographic phenomena in the farmland areas, and the shape and size of the farmland are different.The special cultivated method results in strips in the interior of the field objects (Figure 7b), which can easily cause interpretation confusion with the actual farmland edge.In the case of borders with forests and grasses or land parcels, the natural disruptions partition the locations where a change of crop type is not significant (Figure 7c).Due to

Chongqing
The second area in Chongqing, China, is selected as an experimental case, which follows the typical Asian agricultural pattern.Unlike European ones, this cultivated land is slender and narrow and has serious fragmentation.It is a representative area of the hilly area in southwest China.
As shown in Figure 9, the cultivated land in this area has the following characteristics: (1) Regular cultivated land is relatively neat in spatial distribution, and it has clear edges and relatively uniform internal texture (Figure 9a); (2) the narrow and long cultivated land has clear boundaries and uniform internal texture (Figure 9b); (3) the edge characteristics of the cultivated land are relatively fuzzy and have different shapes, which can be told by the texture characteristic of the crop fields inside the plot (Figure 9c).
The satellite images of the study area were obtained from the Google Earth Platform (https://earth.google.com/), with RGB bands and a spatial resolution of 0.53 m.The southwestern region is cloudy and rainy, and most of the images have a high cloud cover, making it difficult to obtain single-period cloudless image data.To address this issue, the images used in this experiment are cropped and mosaicked from multiple phases of images, and the time span is from March 2018 to April 2019.Generally, in highresolution remote sensing images, the cultivated land, forests, and grasses are confusing objects.Therefore, the images selected in this experiment contain a large number of forests and grasses to verify the effectiveness of the proposed method.
The experimental data of cultivated land samples are collected, and the ArcGIS 10.2 software (https:// www.esri.com/) is used for drawing samples.In this way, a total of 1000 image slices with a size of 512 × 512 pixels are obtained.This paper randomly selects 600 of the images as the training set, 200 as the validation set and 200 as the test set.Similarly, the images are flipped and rotated for data augmentation.

Experiment and results
To investigate the capabilities of the method proposed in this paper, two experiments were conducted in the study area of Denmark and Chongqing.The experimental setup, results, and analysis are shown in the following sections.

Experiment setup and deployments
All network models were trained using the Adam optimizer.The learning rate was set to 0.001, and it was decayed by a factor of 0.1 when the performance stopped improving after two consecutive epochs.The models were trained with a batch size of 8, the largest number of images the available GPU could handle.The cascade model fully embodies the relationship between the area features and the line features of the farmland.To reflect the model's learning performance representative features, the evaluation indicators were used on the test set, also including area indicators and line indicators.Existing studies suggest that transfer learning could accelerate network convergence and improve network performance (Oquab et al., 2014).Thus, in this study, all experiments were performed on clusters with strong processing power and memory and running on the Ubuntu operating system (Linux OS).More specifically, all models were trained on an NVIDIA GeForce GTX 3090 GPU paired with 24 GB memory.In addition, the Pytorch deep learning framework (https://pytorch.org/features/)was used in this study.

The performance of the multi-task cascade model
Our model was compared with two edge detection networks, DexiNed (Poma et al., 2020) and RCF (Y.Liu et al., 2017).Combined with the semantic segmentation network, the plots of cultivated land were given semantic information.To facilitate experimental comparison, U-Net was used as the semantic segmentation network, which was the same as the cascaded semantic segmentation network in this study.Referring to the literature (instance segmentation), MaskRCNN (He et al., 2017) and PointRend (Kirillov et al., 2020) were also selected for comparisons.Similarly, for cascaded networks, Bsinet is also used as a comparison model (Long et al., 2022).
It can be seen from Figures 10 and 11 that our proposed method in this paper can avoid false classification effectively and achieve an excellent extraction performance in remote sensing scenes, especially for complex landscape backgrounds with weak edges and narrow plots (Figure 10).Using a single-task edge semantic segmentation model, the edge features of farmland not well responded in complex scenes, while our cascade network can distinguish the inner and outer parts of the farmland well without the interference of the surrounding background.The corresponding final processed farmland extraction results are also significantly better than the single-task model.It can also be seen from Figure 11 that our proposed method extracts the largest number of correct plots.The farmland plots extracted by our method have better visual performance, and the plot edges are more consistent with their real spatial distribution.
Table 1 and Table 2 provide detailed information about the comparisons of the line and area evaluation indicators, respectively.It can be seen from the quantitative results that the cultivated land extraction method proposed in this article is superior to the existing methods in terms of line and area evaluation indicators.Our method performs 11.31% and 11.72% better than the existing best method in terms of IOU and the Qua line indicator.However, considering the complex internal structure characteristics of farmland at high resolution, the recognition ability of general segmentation models is not enough to deal with this kind of scene.The cultivated land extraction methods based on the literature (XIA) present a large number of false classifications of the edges of non-cultivated land plots.

Post-processing refinement
The farmland edge lines predicted by the above cascaded network may not be closed.At the end of our framework, some useful closed lines are filtered to vote the fused edge lines and semantic segmentation results.Meanwhile, to repair the connection of the farmland boundary, a locally optimized post-  processing method is adopted to further improve the farmland boundary extraction results.
To train the connectivity of the patch-level model, this paper randomly selected the slices with a size of 48 × 48 pixels from each image in the training set.The pixel center of all slices is at the farmland boundary.Figure 12 shows some results of applying the slicelevel model of local connections.It can be seen that despite the complex planting structure of the plot and the fuzzy information on the edge of the plot, the points that our model can predict are basically located on the border of the farmland.

Case analysis
In the previous section, it has been proved that the multi-task model achieves better extraction results than the single-task method.In this section, it will be shown that our SGENet (S: semantic segmentation task, G: Gaussian segmentation task, E: edge detection task) is the optimal combination.Meanwhile, the qualitative and quantitative results will be analyzed and discussed.

The study area in Denmark
The visual results of the farmland boundary (Figure 13) indicate that the SGENet model can be more prominent in the areas where the weak boundary is not obvious.It also performs well in segmenting the farmland connecting with artificial features, which illustrates the necessity of the Gaussian segmentation task.The addition of this task improves the ability to extract different types of farmland boundaries, and the topological structure of the extracted farmland boundaries becomes clearer, which is conducive to the subsequent local optimization.
The indicator comparison is shown in Tables 3 and  4 (all index calculations are based on plot data, that is, the line and area of non-plots are not included).It can be seen from Table 3 that the three-task SGENet model performs 9.72% better than the two-task SENet model in terms of Qua.Also, the SGENet         model performs 4.59% and 2.07% better than the three-task model without patch refinement in terms of Qua and IOU.
After our model was optimized, the accuracy of the farmland boundary line decreased a little in the line indicator.As shown in Figure 14, because the boundaries of some adjacent farmland plots are blurred, they were merged into the same plot.In model extraction, these adjacent farmland boundaries were extracted, especially after post-processing.As a result of the increase in the overall index, the accuracy decreased.
It can be seen from Figure 15, compared with other ablation experiments, our final framework optimizes the edge detection results through the Gaussian segmentation task and local-repair post-processing methods.After the edge detection results are fused with the intermediate semantic output, the parcel omission can be well reduced.
Figure 16 shows the intermediate output of our cascade model, including the area probability map, the line probability map, and the final instance segmentation result map.From the binarized area probability map, the strong semantic information of the farmland can be observed, but the probability of the extracted farmland boundary information is small.In the line probability map, the farmland boundary is obvious, which complements the surface probability map.The final plot segmentation map is formed by combining the advantages of the two maps and adding a breakpoint connection method to make the broken line form closed contours.It can also be seen from the image map that instance segmentation can be achieved by rationally filtering out incorrect small patches from the process of combining surface elements and line elements and paying attention to the accuracy of farmland boundary extraction.
Figure 17 visualizes the cropland boundaries extracted by the SGENet model (green line) and SGENet + patch refinement (red line).We found that most of the farmland boundaries in the large remote sensing image area are extracted by our cascade network.However, some adjacent farmland borders are blurred, and the boundary extraction results with weaker features are intermittent and imperfect.Due to the inability to connect and close the edge lines wrapping farmland, some plots were missed.Our patch refinement strategy uses the local structured information of farmland to connect some unclosed edge lines and increases the number of correct plots  to be extracted.Moreover, the results also show that our method achieves excellent extraction results in the overall field areas in Denmark (Figure 18).

The study area of Chongqing
In this hilly area, a lot of farmlands are fragmented into plots of different shapes, and many farmland plots are separated by mountain roads or surrounding forests and grasses.As shown in Figure 19, compared with the DexiNed model, our model has obvious advantages in the case of the narrow and long borders of cultivated land, the junction of forests and grasses, and cultivated land.SGENet can refine and highlight the borders of farmland under complex terrain, and the strength of the extracted borders is obvious.
The major boundaries in the study area of Chongqing are correctly detected (Figure 20) by our proposed method, while a few are missing because the closed line cannot be formed without patch refinement.The farmland landscapes in mountainous areas are mostly composed of farmland boundaries with complex structures, and many adjacent farmland boundaries are interdependent.Because the texture feature is not obvious, the strength of some boundary extraction is weak.Patch refinement can make up for this deficiency and enhance the detection of farmland contours (Figure 20,a2,b2,a3,b3), thus effectively improving the extraction of plots.
Our method also significantly outperforms DexiNed in the study of Chongqing, and it provides more accurate results.For example, our method performs better than DexiNed by 31.55%,11.38%, and 21.47% in Com, Qual, and IOU, respectively (Table 5).The value of the Cor indicator has declined, and this is because the farmland boundaries in mountainous areas are as ambiguous as those in the study area of Denmark and are not marked.With patch refinement, the indicator values are significantly improved in addition to the corr indicator.

Discussion
Monitoring of farmland plots is an important component in global digital agriculture practices.This study proposes a novel model to extract cropland parcels from high-resolution remote sensing images.The main results of this study suggest that our proposed SGENet model has great advantage in extracting plots of cropland fields with high-performance, especially in areas with serious and complex fragmentations.

Datasets and its uncertainty
High-resolution remote sensing images with rich spectral information will undoubtedly help our models to learn more and make accurate judgments.As the data used in this study only contain information in the RGB band and does not involve multi-band satellite data, the recognition ability of the model may be affected.Nevertheless, as satellite data processing techniques improve, future high-resolution satellite products will greatly reduce these uncertainties, and also the proposed model in this study will benefit and further expand the application scenarios spanning from RGB bands to multispectral satellite data in the future.
Considering the chromatic aberration of images at different periods, we used multi-period images for mosaicking to obtain cloud-free images, so there may be some minor mismatches between the reference data and the image data.The obvious errors were corrected  through manual inspection.Nevertheless, our model can still tolerate some small labeling errors in accuracy evaluation.

Topology information and its benefits to model
Due to the high resolution of the images used in this study, there is a large amount of heterogeneity and fragmentation in the landscape, especially in mountainous areas.Field boundaries are often indistinct or vaguely visible because the textural patterns are often less significant.In this case, parcels cannot be extracted well by the semantic segmentation model or edge model only.Compared with the single-task semantic segmentation, our model achieves an improved performance in cropland fields extraction.
In addition, compared with the instance segmentation network, our model also has significant advantages, especially in the IOU indicator of the final extracted plot.The main reason is that the parcel object at the border of the slice is incomplete, which affects the overall segmentation result of the instance segmentation model.The boundaries of farmland are connected, and this feature is more obvious in local areas.In our designs of method, based on the topology of the edge lines of the plot, intermittent farmland boundaries are connected using local features to repair more parcel results, so as to maximize the recall rate of the plot.

Limitations of edge detection task and optimized strategy
Our model is used in the case where edge features are trained under the guidance of semantic features.This effectively filters the boundaries of non-cultivated land.Since the farmland boundary may be composed of different feature materials, it is difficult to define the specific type of the unified farmland boundary.The edge detection task cannot learn to cover many types of complex edge information.The Gaussian extraction task is added in this study as the transition between semantic task and edge task.The structured location information of the Gaussian extraction task is beneficial for the learning of the edge detection task.In addition, the patch refinement model further strengthens the extraction of weak boundaries.Additionally, the attribute categories provided by the semantic task and the geometric information provided by the boundary detection task are merged to extract the final farmland parcels.

Breakpoints refine strategy and its robustness
It should be noted that there are certain defects in the post-processing stage of our method.In the case of inferring the correct connection direction, our refine strategy cannot effectively repair the connection of the breakpoints on the farmland boundary (Figure 21).Since there is only a single breakpoint in the local area, we could not find any matching points.This kind of connection failure can only be compensated by adding logical judgments for such scenarios.
Considering that such scenarios are rare, if all scenarios are traversed in a loop, the computational cost will be greatly increased.More intelligent point connectivity models require further exploration.

Model generalization ability in regional tests
Unlike the case in Denmark, the second one in the Southwest of Chongqing is featured with small fields of irregular shapes, and most of the fields are cultivated with mixed crops.We reclassified the metadata to unify all agricultural land into the plots that need to be extracted.Despite the complexity of the task, the proposed deep cascade network obtained more significant performance than other methods in both cases.This at least indicates that our model is more robust to different farm geometries and crop types.To further verify the generalization ability of the model in other regions, we used a model trained on the Denmark Sentinel 2 imagery to directly predict the Germany Sentinel 2 image.As shown in Figure 22, our finding is that the transferability of our cascade model needs to be further improved.However, the refinement model can still significantly improve the detection rate of the plot, which reflects the good applicability of the post-processing method.Further, we used a model pre-trained on the source image data to directly predict new data from the downsampled Denmark Sentinel 2. It was found that the spatial resolution downsampling factor is inversely proportional to the robustness of the model (Figure 23).We observe that our model capability appears sensitive to changes in the spatial resolution of the data.Different regions, different spatial resolutions of images, or different sensors often lead to very different data distributions and feature representations for the same category.Such differences often induce CNNs to produce less satisfactory segmentation results (Farahani et al., 2021).Heterogeneous domain adaptation (Tasar et al., 2020;Yao et al., 2021) can solve the learning of such cross-domain samples with different probability distributions and feature representations (Dou et al., 2019;Pizzati et al., 2020;Voreiter et al., 2020;Vu et al., 2019;Yang & Soatto, 2020).

Impacts of model parameters on evaluation indicators
This paper introduces a buffer-based evaluation for the boundaries because it is difficult to directly compare the pixel difference between the extracted boundaries and the ground truth.Based on this, the influence of different buffer widths ρ on the performance of boundary extraction methods was investigated.
Figure 24 shows the performance variation of different boundary extraction methods with buffer widths (1, 3, 5, and 7 pixels) on the Denmark data set.The results  indicate that as the buffer width grew, the extraction accuracy was significantly improved.Our method performed better than RCF and DexiNed on Qual, and it achieved a 20% improvement on Qual compared to RCF.Also, when the buffer width ρ changed from 5 to 7, the fusion performance slightly improved.Therefore, in this study, we try to set widths up to 7, indicating a more relaxed constraint for the indicators.

Implications for agricultural info-survey systems
Global agricultural land is widely distributed across different eco-climatic zones, and planting patterns and crop types vary considerably.The Find-ability, Accessibility, Interoperability, and Reusability (FAIR) are crucial for the agricultural land parcel data.Remote sensing and Geographic Information Systems (GIS) are playing an increasingly important role in the management of agricultural parcel information (Kocur-Bera, 2020;Lin & Zhang, 2021).However, traditional parcel surveys were associated with high economic costs.The combination of high-resolution remote sensing data and intelligent algorithms would make it feasible to manage parcel information from regional to global scale easily and efficiently (e.g.convergence of multiple sources of satellite data, high performance cloudcomputing platforms, online real-time data processing and improving the reliability of the model in the face of large amounts of data).Due to the advantages of low cost and global spatial and temporal consistency of open policy-based satellite data, the multitasking model we proposed is more promising for constructing global-scale agricultural parcel information, which strongly tackles the weak interoperability of previous parcel databases obtained based on multi-source satellite data.

Remarking conclusions
In this study, we summarize the shortcomings of previous studies, recognize the importance of spatially local topological information, local repair and enhanced Gaussian-based optimization strategies.We propose a multi-task cascade network model for cropland parcel extraction.This multitask model can automatically learn multi-scale and multi-level features and perform the overall learning through a specially designed network for complex cropping scenarios and multi-scale cropland parcels.
To assess the robustness of the model, experiments were conducted in the study areas of Denmark and Chongqing, considering differences between global agricultural land characteristics.The results show that this model can extract accurate information about cropland parcels and achieve better consistency with the satellite datasets in terms of spatio-temporal characteristics.Our proposed model is expected to be integrated with practical agricultural observation systems, and thus to contribute to the development of a global agricultural information management platform.

Figure 1 .
Figure 1.Flowchart of the proposed framework for plot extraction.

Figure 3 .
Figure 3.The structure of the simplified HED model.Stages 1-4 indicate single-stream deep convolutional networks for learning multi-scale features and different levels of visual perception, a 1 × 1 − 1 conv layer follows the last conv layer of each stage.Then, the feature map is up-sampled with the up-sampling layer to obtain the corresponding lateral output; Fusion indicates all upsampling layers are connected by a concatenated layer.Then a 1 × 1 − 1 conv layer is applied to fuse the feature maps obtained from each side output.Finally, the output of the fusion loss is obtained through a loss layer.K×K in K×K-s represents the size of the kernel, s in K×K-s represents the stride size and N is the multiple of upsampling.
to obtain the final cropland parcels so that the various segmentation and boundary information are combined to obtain the best result.Let G denotes the ground truth, and b r j � � N j¼i denotes the corresponding area of b r j � � N j¼i in G.There are M pixels in the parcel, i.e.M = b r j � � � � , and the category of the m-th pixel is represented as l m , m = 1, . . .M. Then, the number of pixels in b r j belonging to each category is

Figure 4 .
Figure 4. Overview of the patch refinement network architecture.

Figure 5 .
Figure 5. Details of the breakpoint connection process.

Figure 6 .
Figure 6.Fusion process of edge results and semantic results based on voting mechanism.a indicates the original cropland field; b indicates superposition of outputs from edge detection stage and semantic segmentation stage; c indicates the final cropland parcels using the voting mechanism.

Figure 7 .
Figure 7.The study area (left) and the detailed views (right) showing the geographical distribution of the training set, validation set, and test set.

Figure 8 .
Figure 8. Exampled 512 × 512 image chips with different agricultural fields.a is a long and narrow farmland plot; b is a farmland plot with a complex planting structure, and c is a farmland plot with fuzzy boundaries.

Figure 9 .
Figure 9. Example of data input feeding the cascade network model.a, original image; b, semantic mask; c, gaussian mask and d, boundary mask.

Figure 10 .
Figure 10.The 512 × 512 image chips with different agricultural fields in Chongqing.

Figure 11 .
Figure 11.Comparison of the edge detection results by our cascaded network with other edge detection models.The first column presents the original image; the second column presents the ground truth; the third column presents the segmentation map predicted by the RCF model; the fourth column presents the edge detection result predicted by the DexiNed model; the fifth column presents the boundary result predicted by our cascaded network.

Figure 12 .
Figure 12.The extraction results of the cultivated land plots.The five columns present the original image, the ground truth, the instantiation result inferred by RCF, DexiNed, MaskRcnn, PointRend, and the final segmentation result obtained by our method framework (from left to right).

Figure 13 .
Figure 13.Visual results of the patch refinement model for connectivity applied to the patches of the Denmark dataset.The red dots represent the detection results.

Figure 14 .
Figure 14.Ablation experiment results of two cascade methods.SENet refers to the two-task cascade model with semantic segmentation and edge detection, and SGENet refers to the three-task cascade model with semantic segmentation, gaussian extraction, and edge detection.

Figure 16 .
Figure 16.Comparison of plot extraction using the voting fusion strategy.The first row presents the ground truth; the second row presents the fusion result of the semantic output and the edge output predicted by the SENet model; the third row presents the fusion result by the SGENet model; the fourth row presents the fusion result by SGENet with patch refinement.

Figure 17 .
Figure 17.Field extraction examples in the study area of Denmark.

Figure 18 .
Figure 18.Extraction result of the study area.

Figure 19 .
Figure 19.Edge intensity graph.The four columns respectively represent the original image, the ground truth, the edge output inferred by the DexiNed model and our SGENet model from left to right.(the edge strength predicted by the model is binarized by the Otsu method to obtain the edge line).

Figure 20 .
Figure 20.Comparative experimental results before and after boundary patch refinement in the study area of Chongqing.From top to bottom, the first row presents the extracted results of the farmland plots in the entire study area.(red: SGENet, yellow: SGENet+patch refinement); the following two rows present the partial extraction results; the blue boxes exhibit the area where the broken line is particularly obvious.

Figure 21 .
Figure 21.The performance variation of different boundary extraction methods with different buffer widths on the Denmark data set.

Figure 22 .
Figure 22. Results of agricultural fields extractions using our method on the Germany.

Figure 23 .
Figure 23.Extraction of farmland in Denmark under different spatial resolutions.

Figure 24 .
Figure 24.The performance variation of different boundary extraction methods with different buffer widths on the Denmark data set.
, left), which follows the typical European agricultural pattern.It is defined by the extent of the

Table 2 .
Comparison of different area indicators of different methods.

Table 1 .
Comparison of different line indicators of different methods.

Table 3 .
Comparison of line indicators.

Table 4 .
Comparison of area indicators.

Table 5 .
Comparisons of different evaluation indicators.