Shape and Style GAN-based Multispectral Data Augmentation for Crop/Weed Segmentation in Precision Farming

The use of deep learning methods for precision farming is gaining increasing interest. However, collecting training data in this application field is particularly challenging and costly due to the need of acquiring information during the different growing stages of the cultivation of interest. In this paper, we present a method for data augmentation that uses two GANs to create artificial images to augment the training data. To obtain a higher image quality, instead of re-creating the entire scene, we take original images and replace only the patches containing objects of interest with artificial ones containing new objects with different shapes and styles. In doing this, we take into account both the foreground (i.e., crop samples) and the background (i.e., the soil) of the patches. Quantitative experiments, conducted on publicly available datasets, demonstrate the effectiveness of the proposed approach. The source code and data discussed in this work are available as open source.


Introduction
Modern agriculture is undergoing a transformative revolution, driven by the integration of artificial intelligence (AI) technologies into farming practices.Among these AIpowered advancements, Precision Agriculture has emerged as a promising approach to optimize resource utilization and to enhance crop yield.It aims at improving crop yields thus increasing productivity.New technologies can play a relevant role in this field, leading to more sustainable agricultural production and better management of natural resources.This paradigm shift relies on cutting-edge technologies such as deep learning algorithms for the accurate detection and management of crops and weeds in cultivated fields.
However, deep learning models require a large amount of training examples to work properly.Collecting a large training dataset involves a considerable time effort, especially in the case of pixel-wise labeling, where each pixel in each image has to be labeled individually.In addition to the difficulty of acquiring a large amount of data, it is important also to take into account the class distribution, which is usually imbalanced, meaning that one specific class has a higher number of instances than others (Wang and Yao, 2012).In the case of class imbalance, the classifier could be less accurate when searching for the decision boundaries.
In some scenarios, unbalanced datasets are more frequent and common data augmentation techniques are not suitable, due to possible color and shape variation over time of the objects of interest and the presence of varying light conditions.Precision Agriculture is one of those scenarios.Classifying crops and weeds for targeted interventions on a single plant is a crucial point for applying precision agriculture technologies.However, collecting samples for this kind of task is still challenging due to the variability of the environmental conditions and the large variety of crops and weeds.In fact, samples should be acquired across different weather conditions and growth stages (including variations in shapes, size, and colors).
In this paper, we describe a solution for the class imbalance problem in crop/weed segmentation.Our idea is to generate only the objects that belong to the minority classes that are relevant for semantic segmentation purposes (see Fig. 1).To do so, we first generate new crop shapes using a Deep Convolutional Generative Adversarial Network (DC-GAN).Then, we use a conditional Generative Adversarial Network (cGAN) to generate synthetic style samples of the target object.Finally, we replace the real target object (minority class) with a synthetically generated one, keeping the rest of the image (majority class) as it is (i.e., without modifications).
The contribution of this work is three-fold.
1. We propose an architecture composed of a DCGAN and a cGAN to achieve shape&style data augmentation.2. We provide a solution for keeping the verisimilitude of the synthetic data high by conditioning both the shape and the style of the generated images.3. Our approach is designed to work with multispectral images, which are very useful in precision agriculture applications.
Moreover, the source code and the data generated by our method are made publicly available at www.sites.google.com/diag.uniroma1.it/shapestyle.
The remainder of the paper is organized as follows.Section 2 presents a brief overview of related work.Our approach is detailed in Section 3, while experimental results are shown in Section 4. Section 5 provides the conclusions.

Related Work
The crop/weed segmentation problem has garnered significant attention in recent years, prompting extensive research efforts (Lu et al., 2022).In the early stage, many researchers used traditional machine learning methods that utilize hand-crafted features to identify a set of distinguishing features that will be useful in discriminating between plant classes.For example, (Nguyen Thanh Le et al., 2019) suggests using multi-feature algorithms based on shape and color features to detect the weed in a soybean field.(Zhang et al., 2019) analyze different color spaces such as RGB, HSV, and HIS trying to extract common features for different types of weeds at the pea seedling stage.
Other approaches aim at improving the generalization capability of traditional machine learning methods by using images captured within the different wavelength ranges across the electromagnetic spectrum like multi-spectral images.For example, (Lottes et al., 2017) propose to use a multi-spectral camera to detect weeds in sugar beet fields.Their method starts with detecting vegetation, then an object-based features extraction is implemented, followed by a random forest classification, and finally, they apply a smoothing post-process through a Markov random field.
Hand-crafted feature-based (and derived) methods suffer the dependency from the choice of the features and this can limit the robustness of the system.A solution to increase the robustness and the generalization capabilities of these systems comes from the use of Neural Network methods.For example, Potena et al. (2016) apply a cascade of two Convolutional Neural Networks (CNNs) to the crop/weed classification task, while McCool et al. (2017) propose a three-stage approach with the use of model compression techniques and mixtures of models.
If CNNs are very common and useful in classification, Semantic Segmentation Deep Neural Networks are convenient for achieving segmentation.One of the most commonly adopted approaches for crop/weed segmentation is SegNet (Badrinarayanan et al., 2015).For example, Di Cicco et al. (2017) train SegNet with real and synthetic images achieving good segmentation performance.Also Sa et al. (2017) use SegNet for dense semantic weed classification on multispectral images.Milioto et al. (2018) augment the RGB input image with task-relevant background knowledge, allowing to increase the generalization capability of the network.Relying on the same mechanism, we present in Fawakherji et al. (2019) a pipeline with two CNNs, one for pixel-wise segmentation and the other one for classification, which exploits data coming from different contexts to achieve a good generalization with respect to different types of crop.
Although both CNNs and Semantic Segmentation Networks proved to be useful technologies, their applicability is limited by the need for a large quantity of data in the training phase.In the field of precision farming, the collection of large annotated data requires a notable effort in terms of time.First of all, data have to be collected across the weed growth stages and under different weather conditions.Then, once the data are available, the labeling process can be very time-consuming, especially when labeling is pixelwise.To tackle this problem, it is possible to shrink an unlabeled dataset by preserving only the most informative images while keeping a sufficient segmentation performance (Potena et al., 2016).Also, a graphic engine can be used to generate synthetic farming scenes, which contain natively the corresponding ground truth data (Di Cicco et al., 2017).Milioto et al. (2018) propose a CNN that requires a limited amount of data to generalize to unseen environments with high segmentation accuracy.
New approaches have taken advantage of GANs.For example, Giuffrida et al. (2017) propose a GAN capable of generating Arabidopsis plants, allowing to condition the generation by the desired number of leaves for the synthetically created plants.Another application of the GANs is presented by Madsen et al. (2019).They generate synthetic image samples of plant seedlings to compensate for a lack of training data.In particular, nine distinct species of plants are generated, improving the overall accuracy.In Espejo-Garcia et al. (2021), synthetic RGB images of individual tomato and black night-shade plants are generated for improving classification using a GAN.In Khan et al. (2021), artificial data generated from UAV images by means of Semisupervised GANs is used for supporting crop/weed species identification at an early stage.In (Kim and Park, 2022), authors propose a multi-task semantic segmentation-convolutional neural network for detecting crops and weeds (MTS-CNN) using one-stage training.More recently, (Divyanth et al., 2022) aims to curtail the effort needed to prepare very large image datasets by creating artificial images of maize and four common weeds through conditional GAN (cGANs).The style of leaves is preserved in (Xu et al., 2022), where images in the source domain are translated into the target domain.In contrast, the variations unrelated to the domain are maintained to augment the dataset.As a difference from other existing approaches, in our method, the foreground and the background are generated altogether and the new artificial samples are generated using jointly RGB and NIR data.A comparison with similar state-of-the-art approaches is shown in Table 1.

Materials and Methods
Before describing our strategy, we point out that, in this work, we focus on sugar beets.In particular, we use the publicly available Bonn sugar beet dataset (Chebrolu et al., 2017) to demonstrate the effectiveness of our approach.The Bonn sugar beet dataset has been collected through a Bonirob farm robot across different weeks on a sugar beet field.It consists of images captured by a four-channel JAI AD-13 camera (RGB + NIR), mounted on the robot and facing downwards, and annotated at the pixel level.Examples of RGB, NIR, and ground truth images from the Bonn sugar beet dataset are shown in Fig. 2.

Proposed Strategy
The main objective of our approach is to balance our dataset in order to improve the performance of the crop/weed segmentation task.To achieve our goal, we create semiartificial images by synthesizing only the crop objects in real images.For the generation process, we consider both the shape and the style of the sugar beet crop.The main steps of the proposed approach are shown in Fig. 3.
We can summarize the steps of our approach as follows.
• Crop and background style generation.
In the first step, we start with generating the new crop shape using a Deep Convolutional Generative Adversarial Network (DCGAN) (Radford et al., 2016), which receives only the normal distribution as input.The shape generator creates small (256 × 256 pixels) binary patches containing white pixels for the crop and black pixels as background: This shape represents the mask for the synthetic crop.In the second step, we start building the RGB image that corresponds to the synthetic mask by generating the texture for the crop through a cGAN.We use as input the mask generated in the previous step and the normal distribution to generate a random style.To finish building the RGB image, we must generate the background style (or texture).
An important aspect has to be considered here: The synthetic background should fit with the full image background, so when replacing the real patches with the synthetic ones, we want to preserve the consistency of the background.For this reason, we encode the original style of the original background by using an image variational autoencoder, getting mean and variance values.Then, we use them along with the generated mask as input to a cGAN to obtain a guided style generation of our background.The third step concerns building the new semi-artificial image by replacing the original crop patches with the synthetic ones.
The following sections contain the details for the three above introduced processing steps.

Crop Shape Generation
The first step in our method concerns the generation of the crop mask in which we use the DCGAN.One of the problems to solve in the mask generation process is the difficulty in training a network with images presenting an abrupt change in the border between the crop and the soil.To solve this problem, we used blurred masks during the training process, and this helped the network to learn better how to behave when switching from the crop to the soil.We also performed several preprocessing steps on the images before feeding them to the DCGAN for training.These steps included resizing all images to a uniform size, normalizing pixel values to a range of [-1, 1] to stabilize training, applying data augmentation techniques such as rotation, flipping, and cropping to increase dataset diversity.DCGANs are a direct extension of GANs, thus in the same way as GANs, DCGANs are made of two distinct models, a generator and a discriminator.
Generator.The generator takes as input the random noise distribution  with latent size 100.This layer is a dense layer.Then, we shape the results into four dimensional tensors and we implement a batch normalization.In particular, we start with a 8 × 8 size.The batch normalization module contains an up-sampling block, which consists of an up-sampling layer followed by a convolutional layer with filter size 3 × 3 and then an activation ReLU layer.This block is repeated for six times to arrive to our target size, which is 256 × 256.After the up-sampling block, we add two convolutional layers and, at the end of the generator, we add an activation layer with ℎ as the activation function.
Discriminator.The main objective of the discriminator is to distinguish between the real and fake generated samples.The first input for the discriminator is the sample coming from the real dataset, which is a small patch of the crop mask .We represent the crop with white pixels and the background with black pixel.The second input for the generator is the fake generated sample created in the first stage.The output is a scalar that indicates if the input is coming from the fake or from the real distribution.The discriminator starts with an input layer of shape 256 × 256 × 1, after that we add a Gaussian noise for the samples coming from both the real and fake distributions.According to Arjovsky and Bottou (2017), this makes the results smooth in both data and model probability distributions.After the input layer, we add a convolutional layer, followed by a LeakyReLU activation function and a dropout stage.Then, we have a downsampling block, which is composed of convolutional layers followed by LeakyReLU and dropout.Finally, we implement a Batch Normalization and we repeat this procedure five times to arrive at the size of 4 × 4. The model ends with flatten and dense layers with a sigmoid activation function.

Crop and Background Style Generation
After generating the shape of the crop, we need to add some style (i.e., the texture).We start with crop style generation, in which we use the SPADE generative adversarial network (Park et al., 2019).The inputs for our generator are the masks generated in the previous step plus Normal noise.The generator output is the crop with style as shown in Fig. 4.
When the style of the crop is ready, we generate the style of the background (i.e., the soil).To this end, we use again the SPADE cGAN but, since we need the background style to be aligned with the entire scene, we have to guide the generator of the SPADE cGAN.
To do so, we encode the style of the cropped original background patch from the real images by using a variational autoencoder.Then, we use the encoded style as input to our conditional GAN (SPADE) along with the mask generated in the previous step.
The style encoder is composed of a series of convolutional layers with stride 2, followed by two linear layers that output a mean vector  and a variance vector .Then, we use  and  to compute the noise input to the generator.
For both crop and soil, we train the cGAN with a fourlayer image to generate the RGB and NIR images together.This allows the texture of the soil to match both images.

Scene Composition
The final step in our approach concerns the composition of the artificial crop and background with the real scene in order to generate the semi-artificial scene (see Fig. 5).To this end, we use the ground truth mask of the real scene to extract the real crop patches from the real RGB/NIR image.Then, we replace each crop patch in the real RGB/NIR images with the synthetically generated ones.We consider only the crop objects whose stem is located in the image.For the mask replacement, we extract the mask patch that corresponds to the real crop patch that we want to replace.For the crop mask patch, we simply replace it with the fake generated crop mask.For the weed mask, we deal with the overlapping between the fake generated crop and the weed by removing the ground truth pixels that belong to the weed mask.

Training and Objective Function
To train the SPADE network we used a learning rate of 0.0001 for the generator and 0.0004 for the discriminator plus the ADAM optimizer with  1 = 0 and  2 = 0.999 for the generator and discriminator, respectively.The objective function of the SPADE contains the Multiscale Adversarial Loss: This loss is implemented in a multiscale way, where we create a pyramid from the generated image by resizing the image to different scales and then, for each scale, we compute the loss.Then, the Feature Matching Loss allows the generator to create images that not only fool the discriminator, but also capture the same statistical properties of the images.To this end, we extract the feature maps from the discriminator for both fake and real images and then compute the 1 distance between these two feature maps.This is repeated for all the scales of the generated images: (2) where  represents the feature maps,   is the normalization for each feature map, and  represents the image scale.Finally, the VGG loss is computed as follows.
where  (, ) represents the feature map  of VGG19 and  is the input.
It is worth noticing that, VGG loss is obtained in the same way as the feature matching loss, but with the difference that we compute the feature maps for both real and fake generated by using a VGG19 pre-trained model on imageNet dataset, instead of using the discriminator.
We include the encoder in the training process by adding the  divergence loss: where () is the standard Gaussian prior distribution ( | ) is the variational distribution, and  is fully determined by a mean and variance vector.This loss is similar to the loss in the Variational Auto-Encoder (Kingma and Welling, 2014), where the generator of SPADE GAN plays the role of the decoder.
Fig. 6 shows some samples generated with different styles.Styles are presented in the column on the left and the masks on the top.From Fig. 6, it is possible also to visualize that the network has learned how to generate different RGB and NIR styles.

Experimental Results
We carried out two different experiments, the first to demonstrate that our method allows to obtain a better segmentation than traditional approaches and the second to show the contribution of, having both multi-spectral and synthetic data augmentation.

Training Generative Adversarial Networks for Shape and Style Generation
Our DCGAN training process for crop shape generation involves refining both the discriminator and generator networks iteratively.We trained the model using a dataset containing 1000 crop mask patches with size of 256 × 256 extracted from the Bonn Dataset Chebrolu et al. (2017) at different growth stages of the crop, ensuring a diverse representation.Adversarial ground truths, 'valid' and 'fake', guide the discriminator's classification.The generator synthesizes fake images from random noise, aiming to produce realistic crop shapes.The discriminator learns to distinguish real from generated images, while the generator aims to create indistinguishable crop shapes.For training the cGAN we utilize a dataset extracted from the Bonn Dataset, comprising 1000 image patches with the size of 256 × 256 each for both NIR (Near-Infrared) and RGB channels, alongside corresponding masks as conditional inputs, ensuring a comprehensive representation of various crop and soil conditions.

Semantic Segmentation Results
This experiment aims to show the effectiveness of the proposed approach in improving the mIoU of semantic segmentation.Another objective in this experiment is to compare the proposed approach with traditional augmentation strategies like basic image manipulations (i.e., rotation, shifting, flipping, zooming, and cropping) and texture manipulations (i.e., Gaussian and median blurring, noise injection, contrast, and brightness variation).
We trained Bonnet CNN (Milioto and Stachniss, 2019), with six different datasets, using data from the Bonn sugar beet datasets: 1. Original, which is a reduced version of the Bonn dataset.We used a total of 1.600 images, randomly For the synthetic datasets, we have replaced with synthetic samples only those plants whose stems are fully framed in the image.For the plants that are mostly out of the frame, the original one is kept.We experimentally verified that it is necessary to have the stem of the plant roughly in the center the of mask, to obtain an effective synthetic image generation.
To evaluate the semantic segmentation output, we used the Mean Intersection over Union (denoted as mIoU).Quantitative results of the semantic segmentation on real images from the Bonn dataset are shown in Table 2.The results prove that the IoU increases by using the original dataset augmented with the synthetic ones compared to using only the original dataset.Additionally, the rate of correctly predicted crop and weed samples increases when we use the mixed dataset for training.The correctly predicted samples increase more than 19% in the case of sugar beet, and around 6% for weed samples w.r.t. the Original dataset.Moreover, using only the synthetic dataset also leads to a competitive performance when compared to using only the original one.

Synthetic Multi-Spectral Images Evaluation
To show the contribution of having both multi-spectral and synthetic data augmentation, we considered four different training sets, i.e., Original and Mixed containing RGB  3 shows the segmentation results for this experiment.The segmentation capability improves when using the Mixed dataset, i.e., when the dataset containing real images is augmented with synthetic data.This supports the idea of creating artificial samples to improve the segmentation performance.Furthermore, the results in Table 3 show that using the Mixed RGB plus NIR dataset during the training process leads to a better performance.This supports our claim that also the NIR channel generated using our approach improves the segmentation capability of the convolutional network architecture used in our experiments.

Ablation Test
As a further demonstration of the validity of our approach, we extend the experiments by focusing only on 256 × 256 patches representing a single instance of the crop.We have performed the training on the following three datasets: • Real: 2.000 crop and soil real patches extracted from the Bonn dataset.
• Real + style augmentation: Real dataset augmented with 500 patches generated by style GAN.
• Real + shape and style augmentation: Original dataset augmented with 500 patches generated by the proposed approach.
For testing, we used 400 images of sugar beet patches extracted from the Bonn sugar beet dataset not used in the training phase.The comparison results, presented in Table 4, show that the model trained with the data augmented with shape and style overcomes both the models trained with the real data and data augmented with style only.

Conclusions
In this paper, we have presented a data augmentation strategy for improving segmentation that exploits two types of GANs, namely DCGAN and cGAN, to generate entire agricultural scenes by synthesizing only the most relevant objects.The core of the proposed approach lies in exploiting the shapes of real objects to condition the trained generative models.The existing shapes are extracted from real-world labeled images.In addition, the generation process also synthesizes the NIR channel.The synthetically augmented dataset, obtained in this way, can then be used to train a semantic segmentation network.
We introduced a shape and style augmentation approach, in which we augment the style and the shape of the target object: To generate the shape, we used a DCGAN and then, we used the first approach to build the style of the target object.We applied our method to the crop/weed segmentation problem.
Different kinds of quantitative evaluation have been carried out to demonstrate that augmenting datasets with our approach can improve the performance of state-of-the-art segmentation architectures.The experimental results show that the segmentation quality increases by using the real dataset augmented with synthetic data.

Figure 1 :
Figure 1: Synthetic image generation.The real plant in the patch highlighted by a red box in the original image is replaced by a synthetic plant with a different shape and style.
using only 20% of labelled dataset.Ours Data augmentation that uses two GANs to create artificial images to augment the training data Sugarbeet Shape and Style GAN Intersection over union (mIoU) improved to 0.99 from 0.94 for background class and to 0.93 from 0.76 for vegetation.

Figure 3 :
Figure 3: Our pipeline for synthetic shape and style generation.The main input is a real scene (RGB, NIR and ground truth mask) along with Gaussian noise and the final output is Semiartificial scene.

Figure 5 :
Figure 5: Multispectral synthetic scene generation.(RGB, NIR, and Ground truth).Highlighted in green is the real scene and, on the left highlighted in blue, the new crop shape.The rest of the table represents the synthetic images obtained by inserting in the original image a plant sample generated with our method.

Figure 6 :
Figure 6: Examples of crop and soil generated with different styles.The first row represents a set of crop/soil masks, while the first column on the left represents the real sugar beet patches used to guide the generation process of the style.The rest represents the synthetic sugar beet patches obtained by using the guided style images and the synthetically generated mask.
This work is part of a project that has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101016906 This work has been partially supported by project AGRITECH Spoke 9 -Codice progetto MUR: AGRITECH "National Research Centre for Agricultural Technologies" -CUP CN00000022, of the National Recovery and Resilience Plan (PNRR) financed by the European Union "Next Generation EU".

Table 1
Comparison across recent approaches using GANs for synthetic data generation in precision agriculture

Table 2
Segmentation results of Bonnet architecture, trained on six different datasets, tested on Bonn test dataset

Table 3
Pixel-wise segmentation performance, networks trained on two different inputs (RGB and RGB + NIR), tested on Bonn test dataset.
images only and Original and Mixed containing both RGB and NIR images.Table