Fast and robust virtual try-on based on parser-free generative adversarial network

Rohil, Mukesh Kumar; Parikh, Arpan

doi:10.1007/s10055-023-00899-5

Fast and robust virtual try-on based on parser-free generative adversarial network

Original Article
Published: 03 January 2024

Volume 28, article number 5, (2024)
Cite this article

Download PDF

Virtual Reality Aims and scope Submit manuscript

Fast and robust virtual try-on based on parser-free generative adversarial network

Download PDF

346 Accesses
1 Citation
Explore all metrics

Abstract

Image-based virtual try-on models have recently become popular leading to many new developments, especially in the past three years. The problem of virtual try-on requires trying on a cloth image on a target person’s image. Implementing the same turns out to be a complicated task. It involves calculating the position, angle, and texture for the cloth to be placed on the target that could be in varying orientations. Also, texture may change as a result of any change in orientation. Therefore, generating textures for the cloth also poses a major challenge. In this article, we propose a generative adversarial network-based virtual try-on network that is robust, fast, and parser-free. We dive into some of the latest developments in the field of virtual try-on models and discuss their market feasibility as well as techniques. It is observed that the performance of our proposed network is comparable to the state-of-the-art models, and it outperforms the latter in terms of execution speed owing to its low time complexity. Moreover, it uses a parser-free architecture. It does not require any external input or processing while testing or applying a trained model. It uses a “teacher-student” approach to learn from existing models. The loss function is based on final output of the model. Therefore, it can also learn its shortcomings from the output of the model, unlike other architectures where much of the training is done in a self-supervised manner from the real person’s image.

PF-VTON: Toward High-Quality Parser-Free Virtual Try-On Network

Do Not Mask What You Do Not Need to Mask: A Parser-Free Virtual Try-On

Virtual Try-On Using Style Transfer

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the era of the Internet and technology, online shopping has become an integral part of our lives. Especially with the current situation related to the COVID-19 pandemic, online shopping is becoming more and more popular. Providing an interactive shopping experience, which is more than just still images of products, is an important problem, particularly of commercial value in the field of online fashion commerce. Various new features such as fashion tagging, compatibility prediction, clothes retrieval, and virtual try-ons have become hot topics in recent years. Virtual try-ons represent visualizing fashion products in person without having to try them out. Virtual try-on also aims to answer the question “How does this fashion style look on people in various poses and of various proportions?” This makes virtual try-ons an important tool for fashion research for companies, apart from their widely used application as a virtual trial room.

Building a try-on model is an extremely complicated process since it involves jointly inferring many features such as pose, shape, and size of the person as well as features of the cloth that they wish to try on. Most virtual try-ons aim to simplify and separate these problems. They allow a predefined image format for the person with minimal noise and a fixed or limited set of poses. Also, some of the recent approaches have a 3D virtual avatar that reconstructs the face and puts on the predefined 3D apparel on it. This is based on the fact that many deep learning approaches are computationally expensive and so are unable to provide the real-time results much needed for consumer products. None of these approaches address the issue of garment transfer of truly arbitrary clothes on people in arbitrary poses.

In this article, we present a review and an analysis of the recent developments and state-of-the-art in the field of virtual try-on. We will discuss various innovative loss functions, model architectures, and novel techniques used by these papers. Most of these developments and models involve deep learning as it is widely employed to tackle complex problems involving substantial amounts of data. These deep learning models formulate the virtual try-on as a conditional image generation problem.

The problem of garment transfer has been divided into two sub-problems to make the training easier. The first part is commonly known as the warping stage. In this stage, the pose of the person is transferred to the cloth image and the cloth becomes wrapped accordingly, hence the name warping stage. The second part of the problem is known as the texturing stage. This stage can further be divided into two sub-problems, first is to identify where the warped cloth is to be superimposed on the person’s image and once superimposed, the textures are to be added to the cloth to make it appear realistic.

The proposed model is novel and has certain significant advantages as listed below:

1.
As discussed afterward in Sect. 4, our model introduces a “teacher-student” like method of training; as a result, it generates highly realistic try-ons without human parsing since the “teacher-student” like mechanism enables to learn from other models.
2.
Our model provides some new architectural improvements such as adding residual connections and using different combinations of loss functions, which reduces the model complexity and the number of steps before getting the final output.
3.
Our model is extremely suitable for industrial applications and commercial use, given its lightweight nature and fast training supplemented by the testing speed.
4.
Since it is not using any parser and is a purely deep neural network, it supports parallelization better via GPUs and TPUs. So, it offers an advantage in terms of multiple try-ons on single model.

Rest of the paper has been organized as follows: Sect. 2, Related Work, summarizes various models for virtual try-on; Sect. 3, Models, provides details of the three virtual try-on models used for comparison with our proposed model; Sect. 4, Proposed Model, presents our proposed model in detail; Sect. 5, Experimentation, brings out the details of the experiments performed; Sect. 6, Results, contains the results obtained; and finally, Sect. 7, Conclusion, furnishes the concluding remarks on the work.

2 Related work

It is a fundamental observation that clothes are deformed to a significant extent when put on by an individual, in order to adapt to the shape of the person’s body posture. Thus, the implementation of virtual try-on network (VITON) (Han et al. 2018) 3D clothing remains a major challenge as compared to try-on shoes, masks, glasses, cosmetics, hats, watches, etc. Adding to the complexity of this task, a learning model should identify both the basic key points on the human body’s joints and the body shape in 3D, so that the AR experience is acceptable and appreciable by the users.

The most recent deep learning model for the task, DensePose, is also not suitable for augmented reality because its inference speed is slow for real-time applications (Han et al. 2018). While it aims to map pixels of an RGB image of a person to the 3D surface of the human body, the fitting of 3D clothing items is observed to entail many errors due to lower accuracy in the body mesh detections. Collecting and using more annotated data to improve the results is again a time- and resource-consuming process.

The alternative is to consider silhouettes of 2D images of people in order to align the 2D images of the clothing items. In its non-disclosed software logic, the Zeekit company uses the former to align the 2D (images of) clothing items (Han et al. 2018), and hence users can apply many clothing types (dresses, pants, shirts, etc.) to their photograph.

The algorithm for 2D clothes transfer may have the following steps:

1.
Individual body parts may have correspondence that needs to be identified in the areas of the image.
2.
For identified body parts, their positions are detected.
3.
The transferred clothing image is warped.
4.
With minimum produced artifacts, a warped image is applied to the image of a person.

Although SieveNet (Jandial et al. 2020) and SwapNet (Raj et al. 2018) provide good results, these require external help such as the use of body parsers like Unite the People for segmentation, just as VITON (Han et al. 2018) does. These models require additional steps of segmentation, and hence are computationally quite extensive. The model proposed by Salimans et al. (2016), a combination of warping model and texturing model, is the model to support garment transfer in any pose, and the transition from warped region to coarse result looks more natural. The warping model (Salimans et al. 2016) is implemented using generative models, which produce images of warped clothes given as input. This model uses a variant of adversarial loss based on the pretrained VGG model as a discriminator and the warping model as a generator. The texturing model also uses a similar GAN-like training approach. However, it is computationally extensive. None of these approaches, including those based on GANs (Salimans et al. 2016; Heusel et al. 2017), addresses the issue of garment transfer of truly arbitrary clothes on people in arbitrary poses when the quality basis SSIM (Wang et al. 2004) of the resultant images is considered.

Wang et al. (2018) applied their proposed model to people’s images in constrained and unconstrained conditions (any environment) and found that their model could generate photorealistic images with a much better perceptual quality and richer, fine details. However, when it comes to processing the images of people captured in varying lighting, different environmental conditions, and unusual poses, the performance of the model is unstable.

3 Models

In this section, we will discuss the models for virtual try-on which our proposed model has been compared with.

3.1 Viton (Han et al. 2018)

Virtual try-on network (VITON) is one of the most important models in the field of garment transfer. It is the first model that introduces dividing the problem into sub-problems and training a model for each sub-problem separately. It is also the first model to support garment transfer in any pose. It uses UP model to generate pose maps for the model’s image and uses its captured pose information to generate a new image of the model wearing the virtual garment. The VITON model can be divided into two sub-models, which are trained separately. First is the warping model, and second is the texturing model.

3.1.1 Warping module

The warping model takes in the target clothing and model image along with pose encoding as input. This input is concatenated, and preprocessed and passed through an encoder–decoder GAN. The output of this GAN is a 4-channel image, where the first three channels contain RGB and the last channel contains the predicted segmentation mask. The image generated by the first three channels is similar to a coarsely wrapped cloth on the model’s image. The loss function used by this encoder–decoder GAN is similar to the perceptual geometric loss.

$$L_{GC} = \sum \lambda i \, \left| {\left| { \, \varphi i\left( {I\prime } \right) - \varphi i\left( I \right)} \right|} \right| + \left| {\left| {M - M_{0} } \right|} \right|$$

(1)

where I denotes reference image and I′ refers to the image generated by the warping module, φi(y) is the first term of the feature map of ith layer of the image y, and M and M₀ respectively represent the segmentation mask of generated cloth and original cloth.

3.1.2 Refinement module

The refinement model uses shape context matching on the segmentation mask generated by the warping model and generates a wrapped cloth image. The RGB image generated by the warping model and the wrapped cloth image are both used as input for the refinement model. The output of this model is a composition mask. We combine the RGB image and the warped cloth image using this composition mask,

$$I = \alpha \odot c + \left( {1 - \alpha } \right) \odot I\prime$$

(2)

where c is the composition mask, and I′ is the RGB image.

This network is trained using a combination of perceptual loss, L₁ normalization, and total variation normalization (TV). The latter two loss functions serve the purpose of regularizing the composition mask output.

$$L_{GR} = L_{{{\text{perc}}}} \left( {I^{\prime},I} \right) - \lambda_{{{\text{warp}}}} \left| {\left| \alpha \right|} \right| + \lambda_{TV} \left\| {\nabla \alpha } \right\|$$

(3)

where λ_warp and λTV denote the weights for L₁ norm and TV norm losses, respectively. Minimizing the negative L₁ term encourages the model to utilize more information from the clothing image. The TV term represents the gradients of the generated composition mask α to make it spatially smooth, so that the transition from the warped region to the coarse result looks more natural.

3.2 SieveNet (Jandial et al. 2020)

Just like VITON, it also uses UP to capture the pose information from the model. The pose encoding and the cloth image are then used as inputs for the warping stage.

3.2.1 Warping module

The warping stage is further divided into two sub-stages. The first stage is responsible for transferring the macro or coarse (I0) features of the pose, whereas the second stage is responsible for transferring the micro or finer (I1) features of the pose. A novel method of perceptual geometric loss is used to train this GAN-based model. These two outputs, one containing macro features and the other containing micro features, are subjected to matching loss against the segmented cloth image (Iorig) from the model. This loss is the weighted sum of the PGM loss, the difference between the segmented cloth image and the two generated images.

$$Lwarp = LPGM + \left| {I0 - Iorig} \right| + \left| { I1 - Iorig} \right|$$

(4)

The PGM loss is a weighted sum of two different losses. First is the push loss. This compares I1 against Iorig, to push it more toward Iorig than I0.

$$Lpush = \left| {I1 - Iorig\left| - \right|I1 - I0} \right|$$

(5)

Second is the alignment loss, which measures the cosine similarity between the differences of two images and the segmented cloth image.

$$L_{{{\text{align}}}} = CosineSimilarity\left( {\left( {I_{0} - I_{{{\text{orig}}}} } \right),\left( {I_{1} - I_{{{\text{orig}}}} } \right)} \right)$$

(6)

$$L_{PGM} = \lambda_{1} *L_{{{\text{push}}}} + \lambda_{2} *L_{{{\text{align}}}}$$

(7)

3.2.2 Texture transfer

The texturing module is divided into two stages. The first stage is to generate segmentation for the new clothing item on the person. This is done using a 12-layer encoder–decoder network similar to the U-Net architecture. The generated segmentation is subject to cross-entropy loss with respect to the person’s segmentation. This trained GAN then predicts the segmentation mask for the new cloth.

The second and final stage of this model is the texturing model. It takes the warped cloth image (I1), the predicted segmentation mask from the first stage, and the model’s image as inputs. These inputs are then passed through a similar 12-layer U-Net like encoder–decoder architecture. The output of this GAN is the final result of our model.

The model is subject to the following Loss Function.

$$L_{tt} = L_{l1} + L_{{{\text{percep}}}} + L_{{{\text{mask}}}}$$

(8)

$$L_{l1} = \left| {I_{{{\text{try}} - {\text{on}}}} - I_{m} } \right|$$

(9)

$$L_{{{\text{percep}}}} = \left| {VGG(I_{{{\text{try}} - {\text{on}}}} ) - VGG(I_{m} )} \right|$$

(10)

$$L_{{{\text{mask}}}} = \left| {M_{{{\text{cm}}}} - M_{{{\text{cloth }}gt}} } \right|$$

(11)

where I_try-on is the output of the final stage of the model, and I_m is our initial image of the model.

3.3 SwapNet (Raj et al. 2018)

SwapNet is the implementation with the best results thus far. It also divides the garment transfer problem into two sub-problems, the warping module, and the texturing module. It extensively uses encoder–decoder architectures and resnet blocks, which provide higher feature resolution as compared to VGG used by SieveNet. It also possesses the ability to extract cloth from another model’s image, so we do not need the explicit garment image to perform garment transfer.

3.3.1 Warping module

The warping module uses two encoders for capturing the features from clothing and pose, respectively. These encoded representations are equal in size when concatenated, which shows that equal weight is given to both pose and clothing representation while doing further feature resolution and combining them to generate the output. Let us consider an image A containing the person with a desired clothing, and an image B containing the target person. The warping module takes body segmentation of B, Bbs, and cloth segmentation of A, Acs, as inputs. It gives a predicted cloth segmentation, Bcs, as output. Instead of using a 3-channel RGB prediction, the output is an 18-channel probability map.

The loss function for the warping module is a combination of cross-entropy loss and GAN loss functions.

$${L}_{{\text{adv}}}={E}_{x\sim p\left({A}_{cs}\right)}\left[D\left(x\right)\right]+{E}_{z\sim p\left(f1enc\left(Acs,Bbs\right)\right)}\left[1-D\left(f1dec\left(z\right)\right)\right]$$

(12)

$$L_{{{\text{warp}}}} = \lambda_{{{\text{adv}}}} L_{{{\text{adv}}}} + L_{ce}$$

(13)

where λ_adv L_adv refers to the adversarial component of the loss, and f1enc and f1dec are respectively the encoder and decoder components of the warp module. Also, Lce represents the cross-entropy loss function. The segmentations, Acs and Bcs, are generated using the UP body pose model.

3.3.2 Texturing module

The texturing module is a classic 12-layer U-Net architecture. It initially does ROI pooling followed by encoder–decoder and feature resolution on the person image B. The output is then concatenated with the output of the warping stage before it is passed to the U-Net network. The output of the U-Net network is the final output of the SwapNet model.

Similar to the first stage, this stage is also weakly supervised. The U-Net architecture is a GAN. This GAN is trained using a pretrained VGG model as a discriminator. We use L₁ norm loss, VGG loss, and GAN loss for training the U-Net model.

$${L}_{L1}=\Vert f2\left({z}_{cs}{\prime},A\right)-A\Vert\\$$

(14)

$${L}_{feat}={\sum }_{l}{\lambda }_{l}\Vert {\varphi }_{l}\left(f2\left({z}_{cs}{\prime},A\right)\right)-{\varphi }_{l}\left(A\right)\Vert \\$$

(15)

$${L}_{adv}={E}_{x\sim p\left(A\right)}\left[D\left(x\right)\right]+{E}_{z\sim p\left(f2enc\right)}\left[1-D\left(f1dec\left(z\right)\right)\right]$$

(16)

where, ${\varphi }_{l}$ accounts for loss w.r.t. activations of some layer of a pretrained VGG-19 network. The discriminator for this stage has the following objective:

$${L}_{{\text{adv}}}={E}_{x\sim p\left(A\right)}\left[D\left(x\right)\right]+{E}_{z\sim p\left(f2enc\right)}\left[1-D\left(f1dec\left(z\right)\right)\right]+{E}_{x\sim P\left(z\right)}\left[\Vert {\delta }_{z}D\left(z\right)\Vert \right]$$

(17)

This network is more resistant to noise in data, as it employs a lot of encoder–decoder architectures which remove noise.

4 Architectural choices

The major problem with the above-discussed VITON architectures is that these require external help such as the use of body parsers like Unite the People for segmentation. These require additional steps of segmentation. All of this parsing takes up excess memory and reduces the space for our model and training examples.

We propose a process to generate highly realistic try-ons without human parsing, which employs a “teacher-student” like mechanism enabling to learn from other models. The overall process consists of two major steps: 1) warping the try-on cloth to align with the shape and pose of the model and 2) transferring the texture from the try-on cloth to the try-on model. First, we describe the inputs needed for our model. Next, we define the deep learning model architectures and loss functions for both steps. The following section brings out the details of the actual training processes, along with other implementation details. The section on Results contains details about testing and comparing our approach against other cited approaches. The final section furnishes the concluding remarks on the work.

4.1 Inputs

The dataset being used is the one taken by the authors of VITON (Han et al. 2018). The input for the entire algorithm consists of two main inputs, namely the try-on cloth (${I}_{{\text{input}}-{\text{cloth}}}$) and the model (${I}_{{\text{input}}-{\text{model}}}$). These two inputs are sufficient for testing the model. However, training the model requires two more inputs. Firstly, we need a cloth-based segmentation mask (${I}_{{\text{model}}-{\text{segm}}}$) of the try-on model. We use JPPNet, as proposed by Liang et al. (2018), to compute this segmentation mask. This will help us overcome the unavailability of ideal training triplets as discussed in Han, et al. (2018). Secondly, we need garment transfer results from another model which we use as a “teacher” for our “teacher-student” learning approach. Further details for the same will be provided in the next section.

4.2 Warping module

This is the first stage of our approach, which tries to warp the try-on cloth to align with the shape and pose of our try-on model. Warping is achieved using thin-plate spline (TPS)-based spatial transformers (Jaderberg et al. 2015), as introduced in Wang et al. (2018). The key difference between our approach and the method used in Wang et al. (2018) is that we use a two-step process with some guided steps and a different perceptual loss for training our deep learning model.

4.2.1 Pose variations and artifacts

The two major issues we faced while warping cloth are Pose Variation and Artifacts:

Pose Variation, as the name suggests, refers to the variety of poses the try-on model can be standing in. It also includes various shapes and sizes of people and their clothes.
Artifacts refer to various objects on the try-on model which distorts the area of the cloth. For example, long hair overlapping over the cloth can lead to an irregular cloth shape.

A successful warping would require taking into account, and subsequently overcoming both these issues. We have divided the warping module into two stages to accomplish the same. The first stage is a deep learning-based regression model, which is responsible for providing coarse-level warping (${I}_{{\text{coarse}}-{\text{warp}}}$) of the try-on cloth according to the approximate shape of the try-on model. Its main objective is to account for and overcome pose variations. As the name suggests, it does not take into account smaller features such as artifacts. This output is then used in the second stage to calculate fine-level transformation parameters and the corresponding warping output. This stage is called the coarse-to-fine warping stage. In order to speedup and prevent overcompensation, instead of using multiple stages (as used in VITON), in our model, the final output (${I}_{{\text{fine}}-{\text{warp}}}$) is calculated using the original try-on cloth instead of the first stage output. This is done so as to prevent overcompensating for the changes needed during transformation, since the first stage changes are taken into account when calculating transformation parameters for the second stage. To further facilitate this hierarchical behavior, we introduce residual connections in the network so as to offset the parameters of the fine transformation with the coarse transformation.

4.2.2 Perceptual geometric loss

The Warp Loss (${L}_{{\text{warp}}}$) is defined below, which is a combination of three different components.

$${L}_{{\text{warp}}}={\lambda }_{1}*{L}_{s}^{0}+{\lambda }_{2}*{L}_{s}^{1}+{\lambda }_{3}*{L}_{pg}$$

(18)

where, ${L}_{s}^{0}$ and ${L}_{s}^{1}$, are L₁ losses for coarse and fine warping, respectively. The main idea behind these is to bring the coarse and fine warping close to the shape of the segmented cloth of the try-on model.

$${L}_{s}^{0}=\left|{I}_{{\text{model}}-{\text{cloth}}}-{I}_{{\text{coarse}}-{\text{warp}}}\right|$$

(19)

$${L}_{s}^{1}=\left|{I}_{{\text{model}}-{\text{cloth}}}-{I}_{{\text{fine}}-{\text{warp}}}\right|$$

(20)

$${I}_{{\text{model}}-{\text{cloth}}}={I}_{{\text{model}}-segm}*{I}_{{\text{input}}-{\text{model}}}$$

(21)

where, ${I}_{{\text{model}}-{\text{cloth}}}$ represents the segmented cloth of the try-on model.

${L}_{pg}$ refers to perceptual geometric loss. It is a combination of two components, namely the push loss (${L}_{{\text{push}}}$) and the Alignment loss (${L}_{{\text{align}}}$).

$${L}_{{\text{warp}}}={\lambda }_{4}*{L}_{{\text{push}}}+{\lambda }_{5}*{L}_{{\text{align}}}$$

(22)

Push Loss is responsible for “pushing” the model to act towards better coarse-to-fine warping.

$${L}_{{\text{push}}}=k*{L}_{s}^{1}-\left|{I}_{{\text{fine}}-{\text{warp}}}-{I}_{{\text{coarse}}-{\text{warp}}}\right|$$

(23)

where, k is a hyperparameter that we can use to control the “push” given by this loss, and ensure a stricter bound on the difference.

For alignment Loss (${L}_{{\text{align}}}$), we first use an ImageNet pretrained VGG-19 model to obtain feature maps for ${I}_{{\text{coarse}}-{\text{warp}}}$, ${I}_{{\text{fine}}-{\text{warp}}}$ and ${I}_{{\text{model}}-{\text{cloth}}}$. Then, we attempt to align the differences between ${I}_{{\text{coarse}}-{\text{warp}}}$ and ${I}_{{\text{model}}-{\text{cloth}}}$, and ${I}_{{\text{fine}}-{\text{warp}}}$ and ${I}_{{\text{model}}-{\text{cloth}}}$, in this feature space.

$${V}_{{\text{coarse}}}=VGG\left({I}_{{\text{coarse}}-{\text{warp}}}\right)-VGG\left({I}_{{\text{model}}-{\text{cloth}}}\right)$$

(24)

$${V}_{{\text{fine}}}=VGG\left({I}_{{\text{fine}}-{\text{warp}}}\right)-VGG\left({I}_{{\text{model}}-{\text{cloth}}}\right)$$

(25)

$${L}_{{\text{align}}}={\left(CosineSimilarity\left({V}_{{\text{coarse}}},{V}_{{\text{fine}}}\right)-1\right)}^{2}$$

(26)

Minimizing the Alignment Loss also facilitates the goal of minimizing Push Loss.

4.3 Refinement module

Once we have the warped try-on cloth image, the next step is to transfer this warped cloth image onto the try-on model image. Compared to warping, texture transfer is a more complex task as it requires attention to finer details leading to the higher complexity of the model. We first thought to go with a U-Net like architecture because of its encoder–decoder like structure and multiple layer-to-layer connections. The encoder–decoder like structure helps combine the desired feature from the try-on model and cloth into a feature space and decode it to achieve the desired garment transfer. The layer-to-layer connections help keep a balance between finer and coarser features transferred. However, in the initial experiments, a simple 12-layer U-Net architecture proved simply insufficient in our use-case. So, we decided to adopt and use the Res-UNet architecture (Ronneberger and Fischer 2015) which is built like a U-Net architecture, in combination with residual connections that can preserve the details of the warped clothes and generate realistic try-on results. The output from this Res-UNet architecture is the rendered image. The rendered image (${I}_{{\text{rendered}}}$) is expected to contain the final cloth warp. We use the warped cloth generated in the first stage to create a segmentation mask for try-on cloth (${M}_{{\text{cloth}}}$) and skin mask (${M}_{{\text{skin}}}$), and then use the following formula to generate the final generated try-on image (${I}_{{\text{generated}}-{\text{try}}-{\text{on}}}$).

$${I}_{{\text{generated}}-{\text{try}}-{\text{on}}}={M}_{{\text{cloth}}}*{I}_{{\text{rendered}}}+\left(1-{M}_{{\text{cloth}}}\right)*{I}_{{\text{input}}-{\text{model}}}$$

(27)

$${M}_{{\text{skin}}}=1-{M}_{{\text{cloth}}}$$

(28)

4.3.1 Perceptual path loss

The Texture Loss is defined below as a combination of three losses:

$${L}_{{\text{texture}}}={PPL}_{{\text{generated}}-{\text{try}}-{\text{on}}}+{PPL}_{{\text{rendered}}}+{PPL}_{{\text{skin}}}$$

(29)

where PPL refers to perceptual path loss.

$${PPL}_{{\text{generated}}-{\text{try}}-{\text{on}}}={\lambda }_{6}*\left|{I}_{{\text{generated}}-{\text{try}}-{\text{on}}}-{I}_{{\text{ground}}-{\text{truth}}}\right|+{\lambda }_{7}*VGG\left({I}_{{\text{generated}}-{\text{try}}-{\text{on}}},{I}_{{\text{ground}}-{\text{truth}}}\right)$$

(30)

$${PPL}_{{\text{rendered}}}={\lambda }_{8}*\left|{I}_{{\text{rendered}}}-{I}_{{\text{ground}}-{\text{truth}}}\right|+{\lambda }_{9}*VGG\left({I}_{{\text{rendered}}},{I}_{{\text{ground}}-{\text{truth}}}\right)$$

(31)

$${PPL}_{{\text{skin}}}={\lambda }_{10}*\left|\left({M}_{{\text{skin}}}*{I}_{{\text{rendered}}}\right)-\left({M}_{{\text{skin}}}*{I}_{{\text{ground}}-{\text{truth}}}\right)\right|+{\lambda }_{11}*VGG\left(\left({M}_{{\text{skin}}}*{I}_{{\text{rendered}}}\right),\left({M}_{{\text{skin}}}*{I}_{{\text{ground}}-{\text{truth}}}\right)\right)$$

(32)

Here, ${PPL}_{{\text{generated}}-{\text{try}}-{\text{on}}}$ is used to ensure that the final generated try-on image is as close to the ground truth image (${I}_{{\text{ground}}-{\text{truth}}}$) as possible.

However, we also need to keep in mind that the try-on model’s skin and other background details should remain intact. In addition, we need to maintain the shape and features of the warped cloth. To account for these two details, we introduce ${PPL}_{{\text{rendered}}}$ and ${PPL}_{{\text{skin}}}$. These are responsible to keep the overfitting of the model in check, and to guide it down the correct learning path. Figure 1 depicts the architecture of our model.

5 Implementation details

The warping stage of the model requires two-step training. First, the try-on cloth and model encoders are trained with an encoder–decoder architecture with the same input–output and L₁ Loss as learning criteria. Next, we train the coarse-to-fine refining network by using the model’s cloth image as the try-on cloth, and leverage the cloth-based segmentation mask (${I}_{{\text{model}}-segm}$) of the model to generate warped cloth of the model. This is then used as ground truth to compare the output against. We use the warp loss (${L}_{{\text{warp}}}$) defined earlier to train this model.

Once the warping model is trained, we train the texture stage separately. The texture stage training requires already existing garment transferred images in order to train itself. We decided to use VITON (Han, et al. 2018) and SieveNet (Jandial et al. 2020) for the same. We used the Texture Loss (${L}_{{\text{texture}}}$) as described earlier for training.

The hyperparameter configurations are as follows: batch size = 16, epochs = 15, optimizer = Adam (Kingma and Ba 2015), lr = 0.002, ${\lambda }_{1}$ = ${\lambda }_{2}$= ${\lambda }_{3}$= ${\lambda }_{7}$= ${\lambda }_{9}$= 1, ${\lambda }_{4}$= ${\lambda }_{5}$ = 0.5, ${\lambda }_{6}$= ${\lambda }_{8}$= 5, ${\lambda }_{10}$= 30, ${\lambda }_{11}$= 2 and k = 3.

6 Results

For quantitative analysis, we go with Inception Score (IS) (Salimans et al. 2016), Frechet Inception Distance (FID) (Heusel et al. 2017), and Structural Similarity (SSIM) (Wang et al. 2004). Except for IS, the other metrics were tested against the ground truth to come up with an evaluation score. The metric, IS, an unpaired metric tries to classify if a given image is real or not. The higher the value of IS, better is the result. FID is a distance metric so lower the distance, better is the result.

According to the calculated results (Table 1), our model performed best for the SSIM metric indicating that it retains the facial features along with the general body structure and posture of people. This can be visually observed from the zoomed-in images shown in Fig. 2. In Fig. 2a, the model images are shown; in Fig. 2b, the model images after try-on are shown. In Fig. 2c, the first three images show a case where facial artifacts remain intact in all three major scenarios; the first scenario when the facial features are overlapping with the warped cloth, the second scenario where the features are in the background, and lastly, in the third scenario, when there are no outlying features, our model does not generate new artifacts. The fourth image shows how arms are assimilated with the warped cloth. The fifth image shows how the lower body fits well with the warped cloth. A careful reading of “Failure cases” given in Sect. 5.2 of SeiveNet (Jandial et al. 2020) and that of “Limitations” included in Sect. 4.2.1 of SwapNet (Raj et al. 2018) reveals that dealing with large changes in pose and handing artifacts are two problems associated with SeiveNet and SwapNet. It is clear from the sub-sections, “Person representation analysis” and “Failure cases” of Sect. 4.2.2, "Qualitative Results," of VITON (Han, et al. 2018), that VITON too has some drawbacks with respect to handling complicated poses and artifacts. Finally, the lower FID score can be justified due to differences in garments, as new garment-based features can increase FID.

Table 1 Comparison of IS, FID, and SSIM Scores

Full size table

As for qualitative results, Fig. 3 shows a few examples of the results delivered by our network. Our model is extremely lightweight and flexible. It takes up ~ 1.5 GB space on GPU, which is significantly lesser than any of the comparative models taking up five to a couple of dozen GB of memory. Also, the training time for the model is roughly 4 h, which is also significantly lesser than that for any other model. The model takes ~ 0.15 s to evaluate an image, which again is a lot faster than these models taking 5 to 15 s per evaluation. All these tests are performed on the Tesla K-80 GPU in a standard Google Collaboratory environment. Upon comparing it to a similar parser-free model PF-AFN (Ge et al. 2021), we find that our model is much simpler in terms of complexity. Also, it is not dependent on a specific model for training purpose. Further, it is faster in terms of training and testing time because of its lower complexity.

Our model is extremely suitable for industrial applications and commercial use, given its lightweight and fast training supplemented by the testing speed. The model can be easily re-trained for other garment transfers other than top clothes, given the appropriate dataset. Further our model simply uses the image of the model and the cloth and requires no extra parsing. Both VITON and SieveNet require parsing for segmentation and body pose representations from specific frameworks to function properly. SwapNet does not require segmentation but it still requires body pose encoding. The model is self-reliant.

7 Conclusion

In this article, we propose a parser-free virtual try-on model, which is essentially a fully automated end-to-end image-based virtual try-on model for upper body clothing. We have improved upon the previous models in terms of Structural Similarity, and have also used a refinement module enabling us to remove unwanted noise generated at times by the GAN. The VITON dataset which we used also demonstrates that the model handles people with various clothing such as long, short, and no sleeves as well as people in various postures well. Further, it can still perform garment transfer up to an acceptable level. We also kept the model parser-free allowing users to be able to use the model on the go after deployment. Such an advantage over other models, which are dependent on human pose parsing algorithms, can definitely provide a much wider scope of application for such a model ranging from fashion research to general public shopping usage.

Data availability

The article has no associated data.

References

Ge Y, Song Y, Zhang R, Ge C, Liu W, & Luo P (2021). Parser-free virtual try-on via distilling appearance flows computer vision and pattern recognition. 2021. ArXiv Preprint ArXiv:2103.04559.
Han, X et al. (2018) Viton: an image-based virtual try-on network. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
Heusel M, Ramsauer H, Unterthiner T, Nessler B, and Hochreiter S (2017), GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in Proc. Adv. Neural Inf. Process. Syst., pp. 6626–6637
Jaderberg M, Simonyan K, Zisserman A, and Kavukcuoglu K (2015) Spatial transformer networks. In Cortes C, Lawrence ND, Lee DD, Sugiyama M, and Garnett R, editors, Advances in neural information processing systems 28, pages 2017–2025. Curran Associates, Inc., 2015.
Jandial S, Chopra A, Ayush K, Hemani M, Kumar A, & Krishnamurthy B (2020). SieveNet: a unified framework for robust image-based virtual try-on. 2020 winter conference on applications of computer vision. In Proc. arXiv.org. https://arxiv.org/abs/2001.06265.
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. 3rd international conference on learning representations, 2015. arXiv preprint arXiv:1412.6980, 6
Liang X, Gong K, Shen X, Lin L (2018) Look into person: joint body parsing and pose estimation network and a new benchmark. IEEE Transact Pattern Anal Mech Intell. https://doi.org/10.48550/arXiv.1804.01984
Article Google Scholar
Raj A, Hayes J, Ceylan D, Lu J, Chang H, & Sangkloy P (n.d.). Papers with code - SwapNet: garment transfer in single view images. SwapNet: Garment Transfer in Single View Images. In Proc. European conference of computer vision. 2018 https://paperswithcode.com/paper/swapnet-garment-transfer-in-single-view.
Ronneberger O, Fischer P, and Brox T (2015), U-net: convolutional networks for biomedical image segmentation. In MICCAI
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X, and Chen X (2016), Improved techniques for training GANs, in Proc. Adv. Neural Inf. Process. Syst. Red Hook, NY, USA: Curran Associates, 2234–2242
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) ‘Image quality assessment: from error visibility to structural similarity.’ IEEE Trans Image Process 13(4):600–612
Article PubMed ADS Google Scholar
Wang B, Zhang H, Liang X, Chen Y, Lin L, and Yang M (2018) Toward characteristic-preserving image-based virtual try-on network. In Proc. European conference of computer vision. CoRR, abs/1807.07688, 2018.

Download references

Author information

Authors and Affiliations

Birla Institute of Technology and Science, Pilani, Rajasthan, 333031, India
Mukesh Kumar Rohil & Arpan Parikh

Authors

Mukesh Kumar Rohil
View author publications
You can also search for this author in PubMed Google Scholar
Arpan Parikh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mukesh Kumar Rohil.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rohil, M.K., Parikh, A. Fast and robust virtual try-on based on parser-free generative adversarial network. Virtual Reality 28, 5 (2024). https://doi.org/10.1007/s10055-023-00899-5

Download citation

Received: 24 November 2022
Accepted: 16 October 2023
Published: 03 January 2024
DOI: https://doi.org/10.1007/s10055-023-00899-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Fast and robust virtual try-on based on parser-free generative adversarial network

Abstract

Similar content being viewed by others

PF-VTON: Toward High-Quality Parser-Free Virtual Try-On Network

Do Not Mask What You Do Not Need to Mask: A Parser-Free Virtual Try-On

Virtual Try-On Using Style Transfer

1 Introduction

2 Related work

3 Models

3.1 Viton (Han et al. 2018)

3.1.1 Warping module

3.1.2 Refinement module

3.2 SieveNet (Jandial et al. 2020)

3.2.1 Warping module

3.2.2 Texture transfer

3.3 SwapNet (Raj et al. 2018)

3.3.1 Warping module

3.3.2 Texturing module

4 Architectural choices

4.1 Inputs

4.2 Warping module

4.2.1 Pose variations and artifacts

4.2.2 Perceptual geometric loss

4.3 Refinement module

4.3.1 Perceptual path loss

5 Implementation details

6 Results

7 Conclusion

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation