1 Introduction

In the era of the Internet and technology, online shopping has become an integral part of our lives. Especially with the current situation related to the COVID-19 pandemic, online shopping is becoming more and more popular. Providing an interactive shopping experience, which is more than just still images of products, is an important problem, particularly of commercial value in the field of online fashion commerce. Various new features such as fashion tagging, compatibility prediction, clothes retrieval, and virtual try-ons have become hot topics in recent years. Virtual try-ons represent visualizing fashion products in person without having to try them out. Virtual try-on also aims to answer the question “How does this fashion style look on people in various poses and of various proportions?” This makes virtual try-ons an important tool for fashion research for companies, apart from their widely used application as a virtual trial room.

Building a try-on model is an extremely complicated process since it involves jointly inferring many features such as pose, shape, and size of the person as well as features of the cloth that they wish to try on. Most virtual try-ons aim to simplify and separate these problems. They allow a predefined image format for the person with minimal noise and a fixed or limited set of poses. Also, some of the recent approaches have a 3D virtual avatar that reconstructs the face and puts on the predefined 3D apparel on it. This is based on the fact that many deep learning approaches are computationally expensive and so are unable to provide the real-time results much needed for consumer products. None of these approaches address the issue of garment transfer of truly arbitrary clothes on people in arbitrary poses.

In this article, we present a review and an analysis of the recent developments and state-of-the-art in the field of virtual try-on. We will discuss various innovative loss functions, model architectures, and novel techniques used by these papers. Most of these developments and models involve deep learning as it is widely employed to tackle complex problems involving substantial amounts of data. These deep learning models formulate the virtual try-on as a conditional image generation problem.

The problem of garment transfer has been divided into two sub-problems to make the training easier. The first part is commonly known as the warping stage. In this stage, the pose of the person is transferred to the cloth image and the cloth becomes wrapped accordingly, hence the name warping stage. The second part of the problem is known as the texturing stage. This stage can further be divided into two sub-problems, first is to identify where the warped cloth is to be superimposed on the person’s image and once superimposed, the textures are to be added to the cloth to make it appear realistic.

The proposed model is novel and has certain significant advantages as listed below:

  1. 1.

    As discussed afterward in Sect. 4, our model introduces a “teacher-student” like method of training; as a result, it generates highly realistic try-ons without human parsing since the “teacher-student” like mechanism enables to learn from other models.

  2. 2.

    Our model provides some new architectural improvements such as adding residual connections and using different combinations of loss functions, which reduces the model complexity and the number of steps before getting the final output.

  3. 3.

    Our model is extremely suitable for industrial applications and commercial use, given its lightweight nature and fast training supplemented by the testing speed.

  4. 4.

    Since it is not using any parser and is a purely deep neural network, it supports parallelization better via GPUs and TPUs. So, it offers an advantage in terms of multiple try-ons on single model.

Rest of the paper has been organized as follows: Sect. 2, Related Work, summarizes various models for virtual try-on; Sect. 3, Models, provides details of the three virtual try-on models used for comparison with our proposed model; Sect. 4, Proposed Model, presents our proposed model in detail; Sect. 5, Experimentation, brings out the details of the experiments performed; Sect. 6, Results, contains the results obtained; and finally, Sect. 7, Conclusion, furnishes the concluding remarks on the work.

2 Related work

It is a fundamental observation that clothes are deformed to a significant extent when put on by an individual, in order to adapt to the shape of the person’s body posture. Thus, the implementation of virtual try-on network (VITON) (Han et al. 2018) 3D clothing remains a major challenge as compared to try-on shoes, masks, glasses, cosmetics, hats, watches, etc. Adding to the complexity of this task, a learning model should identify both the basic key points on the human body’s joints and the body shape in 3D, so that the AR experience is acceptable and appreciable by the users.

The most recent deep learning model for the task, DensePose, is also not suitable for augmented reality because its inference speed is slow for real-time applications (Han et al. 2018). While it aims to map pixels of an RGB image of a person to the 3D surface of the human body, the fitting of 3D clothing items is observed to entail many errors due to lower accuracy in the body mesh detections. Collecting and using more annotated data to improve the results is again a time- and resource-consuming process.

The alternative is to consider silhouettes of 2D images of people in order to align the 2D images of the clothing items. In its non-disclosed software logic, the Zeekit company uses the former to align the 2D (images of) clothing items (Han et al. 2018), and hence users can apply many clothing types (dresses, pants, shirts, etc.) to their photograph.

The algorithm for 2D clothes transfer may have the following steps:

  1. 1.

    Individual body parts may have correspondence that needs to be identified in the areas of the image.

  2. 2.

    For identified body parts, their positions are detected.

  3. 3.

    The transferred clothing image is warped.

  4. 4.

    With minimum produced artifacts, a warped image is applied to the image of a person.

Although SieveNet (Jandial et al. 2020) and SwapNet (Raj et al. 2018) provide good results, these require external help such as the use of body parsers like Unite the People for segmentation, just as VITON (Han et al. 2018) does. These models require additional steps of segmentation, and hence are computationally quite extensive. The model proposed by Salimans et al. (2016), a combination of warping model and texturing model, is the model to support garment transfer in any pose, and the transition from warped region to coarse result looks more natural. The warping model (Salimans et al. 2016) is implemented using generative models, which produce images of warped clothes given as input. This model uses a variant of adversarial loss based on the pretrained VGG model as a discriminator and the warping model as a generator. The texturing model also uses a similar GAN-like training approach. However, it is computationally extensive. None of these approaches, including those based on GANs (Salimans et al. 2016; Heusel et al. 2017), addresses the issue of garment transfer of truly arbitrary clothes on people in arbitrary poses when the quality basis SSIM (Wang et al. 2004) of the resultant images is considered.

Wang et al. (2018) applied their proposed model to people’s images in constrained and unconstrained conditions (any environment) and found that their model could generate photorealistic images with a much better perceptual quality and richer, fine details. However, when it comes to processing the images of people captured in varying lighting, different environmental conditions, and unusual poses, the performance of the model is unstable.

3 Models

In this section, we will discuss the models for virtual try-on which our proposed model has been compared with.

3.1 Viton (Han et al. 2018)

Virtual try-on network (VITON) is one of the most important models in the field of garment transfer. It is the first model that introduces dividing the problem into sub-problems and training a model for each sub-problem separately. It is also the first model to support garment transfer in any pose. It uses UP model to generate pose maps for the model’s image and uses its captured pose information to generate a new image of the model wearing the virtual garment. The VITON model can be divided into two sub-models, which are trained separately. First is the warping model, and second is the texturing model.

3.1.1 Warping module

The warping model takes in the target clothing and model image along with pose encoding as input. This input is concatenated, and preprocessed and passed through an encoder–decoder GAN. The output of this GAN is a 4-channel image, where the first three channels contain RGB and the last channel contains the predicted segmentation mask. The image generated by the first three channels is similar to a coarsely wrapped cloth on the model’s image. The loss function used by this encoder–decoder GAN is similar to the perceptual geometric loss.

$$L_{GC} = \sum \lambda i \, \left| {\left| { \, \varphi i\left( {I\prime } \right) - \varphi i\left( I \right)} \right|} \right| + \left| {\left| {M - M_{0} } \right|} \right|$$
(1)

where I denotes reference image and I′ refers to the image generated by the warping module, φi(y) is the first term of the feature map of ith layer of the image y, and M and M0 respectively represent the segmentation mask of generated cloth and original cloth.

3.1.2 Refinement module

The refinement model uses shape context matching on the segmentation mask generated by the warping model and generates a wrapped cloth image. The RGB image generated by the warping model and the wrapped cloth image are both used as input for the refinement model. The output of this model is a composition mask. We combine the RGB image and the warped cloth image using this composition mask,

$$I = \alpha \odot c + \left( {1 - \alpha } \right) \odot I\prime$$
(2)

where c is the composition mask, and I′ is the RGB image.

This network is trained using a combination of perceptual loss, L1 normalization, and total variation normalization (TV). The latter two loss functions serve the purpose of regularizing the composition mask output.

$$L_{GR} = L_{{{\text{perc}}}} \left( {I^{\prime},I} \right) - \lambda_{{{\text{warp}}}} \left| {\left| \alpha \right|} \right| + \lambda_{TV} \left\| {\nabla \alpha } \right\|$$
(3)

where λwarp and λTV denote the weights for L1 norm and TV norm losses, respectively. Minimizing the negative L1 term encourages the model to utilize more information from the clothing image. The TV term represents the gradients of the generated composition mask α to make it spatially smooth, so that the transition from the warped region to the coarse result looks more natural.

3.2 SieveNet (Jandial et al. 2020)

Just like VITON, it also uses UP to capture the pose information from the model. The pose encoding and the cloth image are then used as inputs for the warping stage.

3.2.1 Warping module

The warping stage is further divided into two sub-stages. The first stage is responsible for transferring the macro or coarse (I0) features of the pose, whereas the second stage is responsible for transferring the micro or finer (I1) features of the pose. A novel method of perceptual geometric loss is used to train this GAN-based model. These two outputs, one containing macro features and the other containing micro features, are subjected to matching loss against the segmented cloth image (Iorig) from the model. This loss is the weighted sum of the PGM loss, the difference between the segmented cloth image and the two generated images.

$$Lwarp = LPGM + \left| {I0 - Iorig} \right| + \left| { I1 - Iorig} \right|$$
(4)

The PGM loss is a weighted sum of two different losses. First is the push loss. This compares I1 against Iorig, to push it more toward Iorig than I0.

$$Lpush = \left| {I1 - Iorig\left| - \right|I1 - I0} \right|$$
(5)

Second is the alignment loss, which measures the cosine similarity between the differences of two images and the segmented cloth image.

$$L_{{{\text{align}}}} = CosineSimilarity\left( {\left( {I_{0} - I_{{{\text{orig}}}} } \right),\left( {I_{1} - I_{{{\text{orig}}}} } \right)} \right)$$
(6)
$$L_{PGM} = \lambda_{1} *L_{{{\text{push}}}} + \lambda_{2} *L_{{{\text{align}}}}$$
(7)

3.2.2 Texture transfer

The texturing module is divided into two stages. The first stage is to generate segmentation for the new clothing item on the person. This is done using a 12-layer encoder–decoder network similar to the U-Net architecture. The generated segmentation is subject to cross-entropy loss with respect to the person’s segmentation. This trained GAN then predicts the segmentation mask for the new cloth.

The second and final stage of this model is the texturing model. It takes the warped cloth image (I1), the predicted segmentation mask from the first stage, and the model’s image as inputs. These inputs are then passed through a similar 12-layer U-Net like encoder–decoder architecture. The output of this GAN is the final result of our model.

The model is subject to the following Loss Function.

$$L_{tt} = L_{l1} + L_{{{\text{percep}}}} + L_{{{\text{mask}}}}$$
(8)
$$L_{l1} = \left| {I_{{{\text{try}} - {\text{on}}}} - I_{m} } \right|$$
(9)
$$L_{{{\text{percep}}}} = \left| {VGG(I_{{{\text{try}} - {\text{on}}}} ) - VGG(I_{m} )} \right|$$
(10)
$$L_{{{\text{mask}}}} = \left| {M_{{{\text{cm}}}} - M_{{{\text{cloth }}gt}} } \right|$$
(11)

where Itry-on is the output of the final stage of the model, and Im is our initial image of the model.

3.3 SwapNet (Raj et al. 2018)

SwapNet is the implementation with the best results thus far. It also divides the garment transfer problem into two sub-problems, the warping module, and the texturing module. It extensively uses encoder–decoder architectures and resnet blocks, which provide higher feature resolution as compared to VGG used by SieveNet. It also possesses the ability to extract cloth from another model’s image, so we do not need the explicit garment image to perform garment transfer.

3.3.1 Warping module

The warping module uses two encoders for capturing the features from clothing and pose, respectively. These encoded representations are equal in size when concatenated, which shows that equal weight is given to both pose and clothing representation while doing further feature resolution and combining them to generate the output. Let us consider an image A containing the person with a desired clothing, and an image B containing the target person. The warping module takes body segmentation of B, Bbs, and cloth segmentation of A, Acs, as inputs. It gives a predicted cloth segmentation, Bcs, as output. Instead of using a 3-channel RGB prediction, the output is an 18-channel probability map.

The loss function for the warping module is a combination of cross-entropy loss and GAN loss functions.

$${L}_{{\text{adv}}}={E}_{x\sim p\left({A}_{cs}\right)}\left[D\left(x\right)\right]+{E}_{z\sim p\left(f1enc\left(Acs,Bbs\right)\right)}\left[1-D\left(f1dec\left(z\right)\right)\right]$$
(12)
$$L_{{{\text{warp}}}} = \lambda_{{{\text{adv}}}} L_{{{\text{adv}}}} + L_{ce}$$
(13)

where λadv Ladv refers to the adversarial component of the loss, and f1enc and f1dec are respectively the encoder and decoder components of the warp module. Also, Lce represents the cross-entropy loss function. The segmentations, Acs and Bcs, are generated using the UP body pose model.

3.3.2 Texturing module

The texturing module is a classic 12-layer U-Net architecture. It initially does ROI pooling followed by encoder–decoder and feature resolution on the person image B. The output is then concatenated with the output of the warping stage before it is passed to the U-Net network. The output of the U-Net network is the final output of the SwapNet model.

Similar to the first stage, this stage is also weakly supervised. The U-Net architecture is a GAN. This GAN is trained using a pretrained VGG model as a discriminator. We use L1 norm loss, VGG loss, and GAN loss for training the U-Net model.

$${L}_{L1}=\Vert f2\left({z}_{cs}{\prime},A\right)-A\Vert\\$$
(14)
$${L}_{feat}={\sum }_{l}{\lambda }_{l}\Vert {\varphi }_{l}\left(f2\left({z}_{cs}{\prime},A\right)\right)-{\varphi }_{l}\left(A\right)\Vert \\$$
(15)
$${L}_{adv}={E}_{x\sim p\left(A\right)}\left[D\left(x\right)\right]+{E}_{z\sim p\left(f2enc\right)}\left[1-D\left(f1dec\left(z\right)\right)\right]$$
(16)

where, \({\varphi }_{l}\) accounts for loss w.r.t. activations of some layer of a pretrained VGG-19 network. The discriminator for this stage has the following objective:

$${L}_{{\text{adv}}}={E}_{x\sim p\left(A\right)}\left[D\left(x\right)\right]+{E}_{z\sim p\left(f2enc\right)}\left[1-D\left(f1dec\left(z\right)\right)\right]+{E}_{x\sim P\left(z\right)}\left[\Vert {\delta }_{z}D\left(z\right)\Vert \right]$$
(17)

This network is more resistant to noise in data, as it employs a lot of encoder–decoder architectures which remove noise.

4 Architectural choices

The major problem with the above-discussed VITON architectures is that these require external help such as the use of body parsers like Unite the People for segmentation. These require additional steps of segmentation. All of this parsing takes up excess memory and reduces the space for our model and training examples.

We propose a process to generate highly realistic try-ons without human parsing, which employs a “teacher-student” like mechanism enabling to learn from other models. The overall process consists of two major steps: 1) warping the try-on cloth to align with the shape and pose of the model and 2) transferring the texture from the try-on cloth to the try-on model. First, we describe the inputs needed for our model. Next, we define the deep learning model architectures and loss functions for both steps. The following section brings out the details of the actual training processes, along with other implementation details. The section on Results contains details about testing and comparing our approach against other cited approaches. The final section furnishes the concluding remarks on the work.

4.1 Inputs

The dataset being used is the one taken by the authors of VITON (Han et al. 2018). The input for the entire algorithm consists of two main inputs, namely the try-on cloth (\({I}_{{\text{input}}-{\text{cloth}}}\)) and the model (\({I}_{{\text{input}}-{\text{model}}}\)). These two inputs are sufficient for testing the model. However, training the model requires two more inputs. Firstly, we need a cloth-based segmentation mask (\({I}_{{\text{model}}-{\text{segm}}}\)) of the try-on model. We use JPPNet, as proposed by Liang et al. (2018), to compute this segmentation mask. This will help us overcome the unavailability of ideal training triplets as discussed in Han, et al. (2018). Secondly, we need garment transfer results from another model which we use as a “teacher” for our “teacher-student” learning approach. Further details for the same will be provided in the next section.

4.2 Warping module

This is the first stage of our approach, which tries to warp the try-on cloth to align with the shape and pose of our try-on model. Warping is achieved using thin-plate spline (TPS)-based spatial transformers (Jaderberg et al. 2015), as introduced in Wang et al. (2018). The key difference between our approach and the method used in Wang et al. (2018) is that we use a two-step process with some guided steps and a different perceptual loss for training our deep learning model.

4.2.1 Pose variations and artifacts

The two major issues we faced while warping cloth are Pose Variation and Artifacts:

  • Pose Variation, as the name suggests, refers to the variety of poses the try-on model can be standing in. It also includes various shapes and sizes of people and their clothes.

  • Artifacts refer to various objects on the try-on model which distorts the area of the cloth. For example, long hair overlapping over the cloth can lead to an irregular cloth shape.

A successful warping would require taking into account, and subsequently overcoming both these issues. We have divided the warping module into two stages to accomplish the same. The first stage is a deep learning-based regression model, which is responsible for providing coarse-level warping (\({I}_{{\text{coarse}}-{\text{warp}}}\)) of the try-on cloth according to the approximate shape of the try-on model. Its main objective is to account for and overcome pose variations. As the name suggests, it does not take into account smaller features such as artifacts. This output is then used in the second stage to calculate fine-level transformation parameters and the corresponding warping output. This stage is called the coarse-to-fine warping stage. In order to speedup and prevent overcompensation, instead of using multiple stages (as used in VITON), in our model, the final output (\({I}_{{\text{fine}}-{\text{warp}}}\)) is calculated using the original try-on cloth instead of the first stage output. This is done so as to prevent overcompensating for the changes needed during transformation, since the first stage changes are taken into account when calculating transformation parameters for the second stage. To further facilitate this hierarchical behavior, we introduce residual connections in the network so as to offset the parameters of the fine transformation with the coarse transformation.

4.2.2 Perceptual geometric loss

The Warp Loss (\({L}_{{\text{warp}}}\)) is defined below, which is a combination of three different components.

$${L}_{{\text{warp}}}={\lambda }_{1}*{L}_{s}^{0}+{\lambda }_{2}*{L}_{s}^{1}+{\lambda }_{3}*{L}_{pg}$$
(18)

where, \({L}_{s}^{0}\) and \({L}_{s}^{1}\), are L1 losses for coarse and fine warping, respectively. The main idea behind these is to bring the coarse and fine warping close to the shape of the segmented cloth of the try-on model.

$${L}_{s}^{0}=\left|{I}_{{\text{model}}-{\text{cloth}}}-{I}_{{\text{coarse}}-{\text{warp}}}\right|$$
(19)
$${L}_{s}^{1}=\left|{I}_{{\text{model}}-{\text{cloth}}}-{I}_{{\text{fine}}-{\text{warp}}}\right|$$
(20)
$${I}_{{\text{model}}-{\text{cloth}}}={I}_{{\text{model}}-segm}*{I}_{{\text{input}}-{\text{model}}}$$
(21)

where, \({I}_{{\text{model}}-{\text{cloth}}}\) represents the segmented cloth of the try-on model.

\({L}_{pg}\) refers to perceptual geometric loss. It is a combination of two components, namely the push loss (\({L}_{{\text{push}}}\)) and the Alignment loss (\({L}_{{\text{align}}}\)).

$${L}_{{\text{warp}}}={\lambda }_{4}*{L}_{{\text{push}}}+{\lambda }_{5}*{L}_{{\text{align}}}$$
(22)

Push Loss is responsible for “pushing” the model to act towards better coarse-to-fine warping.

$${L}_{{\text{push}}}=k*{L}_{s}^{1}-\left|{I}_{{\text{fine}}-{\text{warp}}}-{I}_{{\text{coarse}}-{\text{warp}}}\right|$$
(23)

where, k is a hyperparameter that we can use to control the “push” given by this loss, and ensure a stricter bound on the difference.

For alignment Loss (\({L}_{{\text{align}}}\)), we first use an ImageNet pretrained VGG-19 model to obtain feature maps for \({I}_{{\text{coarse}}-{\text{warp}}}\), \({I}_{{\text{fine}}-{\text{warp}}}\) and \({I}_{{\text{model}}-{\text{cloth}}}\). Then, we attempt to align the differences between \({I}_{{\text{coarse}}-{\text{warp}}}\) and \({I}_{{\text{model}}-{\text{cloth}}}\), and \({I}_{{\text{fine}}-{\text{warp}}}\) and \({I}_{{\text{model}}-{\text{cloth}}}\), in this feature space.

$${V}_{{\text{coarse}}}=VGG\left({I}_{{\text{coarse}}-{\text{warp}}}\right)-VGG\left({I}_{{\text{model}}-{\text{cloth}}}\right)$$
(24)
$${V}_{{\text{fine}}}=VGG\left({I}_{{\text{fine}}-{\text{warp}}}\right)-VGG\left({I}_{{\text{model}}-{\text{cloth}}}\right)$$
(25)
$${L}_{{\text{align}}}={\left(CosineSimilarity\left({V}_{{\text{coarse}}},{V}_{{\text{fine}}}\right)-1\right)}^{2}$$
(26)

Minimizing the Alignment Loss also facilitates the goal of minimizing Push Loss.

4.3 Refinement module

Once we have the warped try-on cloth image, the next step is to transfer this warped cloth image onto the try-on model image. Compared to warping, texture transfer is a more complex task as it requires attention to finer details leading to the higher complexity of the model. We first thought to go with a U-Net like architecture because of its encoder–decoder like structure and multiple layer-to-layer connections. The encoder–decoder like structure helps combine the desired feature from the try-on model and cloth into a feature space and decode it to achieve the desired garment transfer. The layer-to-layer connections help keep a balance between finer and coarser features transferred. However, in the initial experiments, a simple 12-layer U-Net architecture proved simply insufficient in our use-case. So, we decided to adopt and use the Res-UNet architecture (Ronneberger and Fischer 2015) which is built like a U-Net architecture, in combination with residual connections that can preserve the details of the warped clothes and generate realistic try-on results. The output from this Res-UNet architecture is the rendered image. The rendered image (\({I}_{{\text{rendered}}}\)) is expected to contain the final cloth warp. We use the warped cloth generated in the first stage to create a segmentation mask for try-on cloth (\({M}_{{\text{cloth}}}\)) and skin mask (\({M}_{{\text{skin}}}\)), and then use the following formula to generate the final generated try-on image (\({I}_{{\text{generated}}-{\text{try}}-{\text{on}}}\)).

$${I}_{{\text{generated}}-{\text{try}}-{\text{on}}}={M}_{{\text{cloth}}}*{I}_{{\text{rendered}}}+\left(1-{M}_{{\text{cloth}}}\right)*{I}_{{\text{input}}-{\text{model}}}$$
(27)
$${M}_{{\text{skin}}}=1-{M}_{{\text{cloth}}}$$
(28)

4.3.1 Perceptual path loss

The Texture Loss is defined below as a combination of three losses:

$${L}_{{\text{texture}}}={PPL}_{{\text{generated}}-{\text{try}}-{\text{on}}}+{PPL}_{{\text{rendered}}}+{PPL}_{{\text{skin}}}$$
(29)

where PPL refers to perceptual path loss.

$${PPL}_{{\text{generated}}-{\text{try}}-{\text{on}}}={\lambda }_{6}*\left|{I}_{{\text{generated}}-{\text{try}}-{\text{on}}}-{I}_{{\text{ground}}-{\text{truth}}}\right|+{\lambda }_{7}*VGG\left({I}_{{\text{generated}}-{\text{try}}-{\text{on}}},{I}_{{\text{ground}}-{\text{truth}}}\right)$$
(30)
$${PPL}_{{\text{rendered}}}={\lambda }_{8}*\left|{I}_{{\text{rendered}}}-{I}_{{\text{ground}}-{\text{truth}}}\right|+{\lambda }_{9}*VGG\left({I}_{{\text{rendered}}},{I}_{{\text{ground}}-{\text{truth}}}\right)$$
(31)
$${PPL}_{{\text{skin}}}={\lambda }_{10}*\left|\left({M}_{{\text{skin}}}*{I}_{{\text{rendered}}}\right)-\left({M}_{{\text{skin}}}*{I}_{{\text{ground}}-{\text{truth}}}\right)\right|+{\lambda }_{11}*VGG\left(\left({M}_{{\text{skin}}}*{I}_{{\text{rendered}}}\right),\left({M}_{{\text{skin}}}*{I}_{{\text{ground}}-{\text{truth}}}\right)\right)$$
(32)

Here, \({PPL}_{{\text{generated}}-{\text{try}}-{\text{on}}}\) is used to ensure that the final generated try-on image is as close to the ground truth image (\({I}_{{\text{ground}}-{\text{truth}}}\)) as possible.

However, we also need to keep in mind that the try-on model’s skin and other background details should remain intact. In addition, we need to maintain the shape and features of the warped cloth. To account for these two details, we introduce \({PPL}_{{\text{rendered}}}\) and \({PPL}_{{\text{skin}}}\). These are responsible to keep the overfitting of the model in check, and to guide it down the correct learning path. Figure 1 depicts the architecture of our model.

Fig. 1
figure 1

Two-step architecture of our model

5 Implementation details

The warping stage of the model requires two-step training. First, the try-on cloth and model encoders are trained with an encoder–decoder architecture with the same input–output and L1 Loss as learning criteria. Next, we train the coarse-to-fine refining network by using the model’s cloth image as the try-on cloth, and leverage the cloth-based segmentation mask (\({I}_{{\text{model}}-segm}\)) of the model to generate warped cloth of the model. This is then used as ground truth to compare the output against. We use the warp loss (\({L}_{{\text{warp}}}\)) defined earlier to train this model.

Once the warping model is trained, we train the texture stage separately. The texture stage training requires already existing garment transferred images in order to train itself. We decided to use VITON (Han, et al. 2018) and SieveNet (Jandial et al. 2020) for the same. We used the Texture Loss (\({L}_{{\text{texture}}}\)) as described earlier for training.

The hyperparameter configurations are as follows: batch size = 16, epochs = 15, optimizer = Adam (Kingma and Ba 2015), lr = 0.002, \({\lambda }_{1}\) = \({\lambda }_{2}\)= \({\lambda }_{3}\)= \({\lambda }_{7}\)= \({\lambda }_{9}\)= 1, \({\lambda }_{4}\)= \({\lambda }_{5}\)  = 0.5, \({\lambda }_{6}\)= \({\lambda }_{8}\)= 5, \({\lambda }_{10}\)= 30, \({\lambda }_{11}\)= 2 and k = 3.

6 Results

For quantitative analysis, we go with Inception Score (IS) (Salimans et al. 2016), Frechet Inception Distance (FID) (Heusel et al. 2017), and Structural Similarity (SSIM) (Wang et al. 2004). Except for IS, the other metrics were tested against the ground truth to come up with an evaluation score. The metric, IS, an unpaired metric tries to classify if a given image is real or not. The higher the value of IS, better is the result. FID is a distance metric so lower the distance, better is the result.

According to the calculated results (Table 1), our model performed best for the SSIM metric indicating that it retains the facial features along with the general body structure and posture of people. This can be visually observed from the zoomed-in images shown in Fig. 2. In Fig. 2a, the model images are shown; in Fig. 2b, the model images after try-on are shown. In Fig. 2c, the first three images show a case where facial artifacts remain intact in all three major scenarios; the first scenario when the facial features are overlapping with the warped cloth, the second scenario where the features are in the background, and lastly, in the third scenario, when there are no outlying features, our model does not generate new artifacts. The fourth image shows how arms are assimilated with the warped cloth. The fifth image shows how the lower body fits well with the warped cloth. A careful reading of “Failure cases” given in Sect. 5.2 of SeiveNet (Jandial et al. 2020) and that of “Limitations” included in Sect. 4.2.1 of SwapNet (Raj et al. 2018) reveals that dealing with large changes in pose and handing artifacts are two problems associated with SeiveNet and SwapNet. It is clear from the sub-sections, “Person representation analysis” and “Failure cases” of Sect. 4.2.2, "Qualitative Results," of VITON (Han, et al. 2018), that VITON too has some drawbacks with respect to handling complicated poses and artifacts. Finally, the lower FID score can be justified due to differences in garments, as new garment-based features can increase FID.

Table 1 Comparison of IS, FID, and SSIM Scores
Fig. 2
figure 2

Example images to show that the proposed model retains the facial features and artifacts along with the general body structure and posture of people; a zoomed-in versions of the images before virtual try-on, b here, we can visually see that after virtual try-on facial artifacts remain intact in the zoomed-in versions as shown in (c), c In this figure, we visually observe no artifacts when there are no outlying features

As for qualitative results, Fig. 3 shows a few examples of the results delivered by our network. Our model is extremely lightweight and flexible. It takes up ~ 1.5 GB space on GPU, which is significantly lesser than any of the comparative models taking up five to a couple of dozen GB of memory. Also, the training time for the model is roughly 4 h, which is also significantly lesser than that for any other model. The model takes ~ 0.15 s to evaluate an image, which again is a lot faster than these models taking 5 to 15 s per evaluation. All these tests are performed on the Tesla K-80 GPU in a standard Google Collaboratory environment. Upon comparing it to a similar parser-free model PF-AFN (Ge et al. 2021), we find that our model is much simpler in terms of complexity. Also, it is not dependent on a specific model for training purpose. Further, it is faster in terms of training and testing time because of its lower complexity.

Fig. 3
figure 3

Garment transfer with various clothing and postures

Our model is extremely suitable for industrial applications and commercial use, given its lightweight and fast training supplemented by the testing speed. The model can be easily re-trained for other garment transfers other than top clothes, given the appropriate dataset. Further our model simply uses the image of the model and the cloth and requires no extra parsing. Both VITON and SieveNet require parsing for segmentation and body pose representations from specific frameworks to function properly. SwapNet does not require segmentation but it still requires body pose encoding. The model is self-reliant.

7 Conclusion

In this article, we propose a parser-free virtual try-on model, which is essentially a fully automated end-to-end image-based virtual try-on model for upper body clothing. We have improved upon the previous models in terms of Structural Similarity, and have also used a refinement module enabling us to remove unwanted noise generated at times by the GAN. The VITON dataset which we used also demonstrates that the model handles people with various clothing such as long, short, and no sleeves as well as people in various postures well. Further, it can still perform garment transfer up to an acceptable level. We also kept the model parser-free allowing users to be able to use the model on the go after deployment. Such an advantage over other models, which are dependent on human pose parsing algorithms, can definitely provide a much wider scope of application for such a model ranging from fashion research to general public shopping usage.