ImprovedSSGAN: Ranking Discriminator With Semi-Supervised GAN for Ordinal Information

We propose Improved SSGAN, a multi-Generator/Discriminator semi-supervised GAN architecture to address the well-known problem of mode collapse in addition to an improved classification for ordinal information. To reduce the vulnerability of the generator to a relatively superior discriminator, semi-supervised GAN was introduced to make the job of the discriminator tough. However, such architecture doesn’t solve the collapse problem where the generator is stuck generating some specific mode of the data. In this work, N-1 rank discriminators with two-dimensional outputs are proposed for ordinal information by applying rank estimation techniques. The first dimension in each discriminator is used to predict binary rank information which is aggregated to make the final prediction. The second dimension in each discriminator is independently used to train one or more generators where a collapse in any of the discriminators is supported by other discriminators. We have also extended the architecture to a conditioned generator where the output of one generator is fed into another, which improves image quality. Weight-sharing techniques among the discriminators have also shown a faster convergence during training. Through extensive experiments on age face data, we have demonstrated that Improved SSGAN outperforms the semi-supervised GAN both in image generation quality and age estimation.


I. INTRODUCTION
Generative Adversarial Networks (GANs) [4] are a powerful class of generative models based on game theory. It set up an adversarial game between a discriminator and a generator network. The generator is trained to produce synthetic data sampled from real distribution given some noise source and the discriminator is trained to discriminate the generator's output from the real images. Since its first introduction, it yielded dramatic performance gains and has achieved great success in generating realistic and sharp-looking images. Furthermore, GANs have proven to be useful in various applications including style transfer [24], super-resolution images [25], The associate editor coordinating the review of this manuscript and approving it for publication was Krishna Kant Singh . music generation [27], natural language generation [26], and medical image generation [28].
Despite their success, GANs are difficult to train. On the one hand, the relatively superior discrimination might quickly learn to discriminate between real and fake, consequently, there will be no signal to train the generator. On the other hand, the generator might learn only some specific modes to fool the discriminator.
To improve such training difficulties Salimans et al. [5] proposed semi-supervised GAN (SSGAN) which advances the discriminator to multi-class classification. The discriminator is simultaneously trained with supervised learning to map the real data to one of its classes and with unsupervised learning to discriminate between the generated and real image. This makes the task of the discriminator tougher VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ which improves the balance between the discriminator and generator. In addition, SSGAN improves the classification performance over the baseline classifier with no generative components [5], [18], [20]. However, such an approach still suffers from collapse due to the generator where it learns only specific modes to fool the single discriminator. Furthermore, the challenge of balancing between the supervised multi-class classification which solves a high-dimension problem, and the unsupervised binary classification limits its performance when used for estimation. To tackle the above problem, we proposed an Improved Semi-supervised GAN for ordinal information by applying rank estimation techniques [1], [12]. Specifically, we propose multiple discriminators with two dimension output, each trained in a semi-supervised manner. Ordinal information like age [1] and medical image [29] can be estimated more accurately by using a set of basic binary classifiers instead of using multi-class mapping. In our proposed architecture, the task of the discriminator during supervised and unsupervised uses both binary classifications to improve the balance which results in better estimation performance compared to the multi-class age estimation [1]. Since we proposed multiple discriminators, the collapse due to the generator and discriminator is also improved which enables the proposed model to be trained longer and improve image quality. Different aggregation functions of discriminators which can range from a soft critic to a hard critic can be used to train the generator. We have also extended the model to multiple generators where the output of one generator is fed into the other which results in improved image quality. The weight-sharing variation of the proposed model also shows a faster convergence and improved estimation.
The rest of the paper is organized as follows: In section II, we concisely review related works proposed to improve GAN training. In section III, we briefly describe GAN, SSGAN, and rank estimation techniques. Section IV presents the proposed architecture and its variations. Section VI presents the implementation detail and evaluation of both estimation and generation. Finally, we present the conclusion in section VII.

II. RELATED WORK
Since its first introduction of GAN [4], several variations of GAN have been proposed to solve the problem of mode collapse. Deep Convolutional GANs (DCGAN) [6] introduced a convolution architecture to improve training and performance. Recently, EBGAN [9] proposed an energy-based model which views the discriminator as an energy function that tries to assign fewer energy values to the region of high data density and higher energy values outside these regions. Nowozin et al. [10] show that the GAN approach is a special case of an existing more general variational divergence estimation and any f-divergence can be used to train the GAN model. Accordingly, LSGAN [3] adopts the least square loss function for the discriminator. While those earlier variations of GAN lack convergence measures, WGAN [7] provided theoretical analysis for the convergence properties of objective functions used to optimize GAN. It also proposes wasserstein distance which is continuous and differentiable. Gulrajani et al. [8] proposed a gradient penalty instead of weight clipping to enforce a Lipschitz constant that improved image quality and convergence. The topic of enforcing Lipschitz constant is still an active research area. BEGAN [11] applied wasserstein distance to the EBGAN architecture. Our proposed work is orthogonal to this research where any objective function which follows a similar convergence path to cores-entropy can be used.
On the other hand, several previous works have incorporated semi-supervised learning into GAN to improve training and performance. References [18] and [5] are the first works to force the discriminator to output class label and train it with both supervised and unsupervised learning keeping the original GAN objective function. Reference [19] proposed a categorical generative adversarial network (CatGAN), which substitutes the binary discriminator with a multiple-class classifier and trains the model using unlabeled data. Reference [20] showed theoretically why semi-supervised GAN performs better. Our work extends such semi-supervised models to multiple discriminators with binary classification for ordinal data.
With respect to architectural modification, Ghosh et al. [13] proposed multiple generators with a single discriminator called MAD-GAN. The discriminator is designed to identify the generator that produced the fake image in addition to identifying the image as real or fake. It uses a single discriminator compared to our work and it uses only unsupervised learning. Unlike the original GAN that uses JS divergence which is a combination of both forward and reverse KL divergence, D2GAN [14] proposes two discriminators each trained by reverse and forward KL divergence. Even if it has two discriminators, both reverse and forward KL divergence are more susceptible to gradient exploit compared to JS divergence [7]. Sun et al. [16] used the concept of ranking for image synthesis but our objective is age estimation and image generation. GMAN [15] proposed multiple discriminators against a single generator. Unlike GMAN, the proposed work uses semi-supervised learning where each discriminator has independent distinct different Objectives. Moreover, our architecture is extended to multiple generators conditioned on one other.
In general, to the best of our knowledge, this work is the first one to propose ordinal ranking on semi-supervised GAN for image generation and estimation.

III. BACKGROUND A. GAN AND SEMI-SUPERVISED GAN
The GAN framework is composed of a generative and discriminator model. Given a data space, the generator model maps a latent vector from a known distribution into the data space. The discriminator model accepts a fake sample from the generator or a true sample from the data space and tries to distinguish them as real or fake. Formally, let x be a true sample from the data distribution, p data , and z be a latent vector in R d sampled from noise distribution p(z). Let G(z; θ G ) represent the generator model, where G is a differentiable function represented by multi-layer perceptron with parameters θ G . Similarly, let a multi-layer perceptron D(x; θ D ) represents the discriminator model with a scalar output representing the probability that x comes from p data rather than This means the two models play adversaries to each other in a two-player min-max game optimizing the following objective function: Goodfellow et al. [4] show that given enough capacity to G and D, the training criterion allows the data generating distribution converges to a real data distribution p data . However, the min-max objective function requires finding a Nash equilibrium which might be a non-convex function with continuous and high dimensional parameters. Typically, since GAN is trained using gradient descent techniques, it may fail to converge while searching the Nash equilibrium [2], [5]. Now consider a standard N class classifier model that accepts a data point x and maps it as one of the N possible outputs. Such a model outputs the class probability corresponding to each class, P model (y|x, 1 ≤ y ≤ N ). In supervised learning, the model is trained by minimizing the Cross-entropy between the real label and the predictive distribution: Ex,y∼P data [logP model (y|x)]. Any such standardized classifier can be improved with semi-supervised learning by simply adding new unlabeled data generated by the GAN generator [5]. This unlabeled data can be considered as a new class y = N + 1 and can be used for unsupervised learning. In the supervised case, the discriminator is similar to the standard classifier, D s (x) = P model (y|x, 1 ≤ y ≤ N ). In the unsupervised learning, P model (y = N + 1|x) corresponds to the probability that x is fake and 1 − P model (y = N + 1|x) corresponds to the probability that x is real representing D(x) in the original GAN function (eq. 4). The objective of semi-supervised is then: The hyper parameter λ S is added to balance the unsupervised and supervised losses.

B. RANK ESTIMATION
Based on the concept of ''learning to Rank'' from information retrieval, different studies have applied the concept to ordinal age information [1], [12] where relative order provides more stable information. In the standard multi-class classification, the model assumes the class labels are uncorrelated. However, ordinal data like age estimation has a strong relationship. Intuitively, it is relatively easy to predict whether a person is older or younger than a given age compared to estimating an exact age. Moreover, if there is an age of 20 years, an age label prediction is more likely to be 19 or 21 than 10 or 30. This kind of relationship is not reflected in a multi-class classification.
In ranking-based classification techniques, the model consists of multiple sets of basic classification models with binary output. This binary output is then aggregated to make the final decision. For instance, if there are 5 age groups in order, 4 binary models are needed to make the prediction. The first binary model tries to predict 0 if the sample image is from age group 1, and 1 otherwise. Similarly, the second binary model predicts 0 if the image is from group 1 or 2, and 1 otherwise. The final prediction is calculated by aggregating all the binary outputs.
More formally, lets X = {(x 1 , y 1 ), ..., (x n , y n )} denotes a dataset of labeled images. Let C j denotes the j th binary classifier model that takes an image x i and outputs a scalar representing 0 if the label y i is less than or equal to i and 1 if the label is greater than i. If there are K groups, there will be K − 1 binary classifiers. The final prediction is then inferred VOLUME 11, 2023 by aggregating the scalar output as follows: Chen et al. [1] mathematically show that the total ranking error is bounded by the maximum of the binary ranking error from each model. This means optimizing each model improves the ranking estimation.

IV. IMPROVED SEMI-SUPERVISED GAN
A rank estimation can be optimized by improving each binary classifier model with unsupervised training. One more task can be added to the binary classifier. Now consider this as one of our semi-supervised discriminators, given an image x, that tries to discriminate the ranking label and also if x is real or fake. Since both unsupervised and supervised are simultaneously trained with a single discriminator, they help each other to improve the feature extraction which improves the classification performance by the binary classifier. This improves the overall classification performance. Moreover, since each discriminator is trained independently against the generator, the mode collapse of one of the discriminators will be backed by other discriminators.
Once again let's consider K groups, correspondingly we have K − 1 discriminator each with two outputs. Let's represent D[ i ] as the i th discriminator model and let's denote the first output as D BC [i]. From the perspective of this output, the input data is divided into two: X + data with label y ≤ i and X − data with label y > i. Therefore, the objective of this classifier is to maximize Ex,y∼P data [logP model (y . Note that all the data comes from the real dataset. In the unsupervised case, the second output of the discriminator can be used to differentiate that the data is not fake. Adversarial to this, the generator tries to fool the discriminator. Therefore, if we represent D A [i] as the second output, the discriminator tries to maximize Ex∼P data [logD A [i](x)] and minimize Ez∼P (z) [logD A [i](G(z))] whereas the generator tries to maximize Ez∼P (z) [1−logD A [i](G(z))]. The total objective function will be: The parameter λ i BC is included to balance the training of both the adversarial and the binary classification. This value can be hoped to have better estimation when optimizing the solution by minimizing the two losses jointly.
The proposed architecture is depicted in figure 1(a). The aggregation is used to compute estimation accuracy only during test time. Therefore, each discriminator is optimized independently and can be trained parallelly. By using multiple discriminators the problem of mode collapse can be improved. If a single discriminator learns fast, the generator still gets feedback from other discriminators which brings back the previous one to the game. The generator also needs to learn to fool all the discriminators with some set of images which is unlikely. Algorithm 1 shows the basic training and testing procedure for improved SSGAN. In the presented algorithm the generator gets trained after each discriminator. Other training procedures are also possible where the generator is trained after two or more discriminators are trained. Stopping criteria can be selected depending on the users' demand: Estimation accuracy, Inception score [5], or FID [30] can be used.

V. VARIANT OF IMPROVED SSGAN
We have explored various variants of Improved SSGAN which improves prediction performance, training time and quality of image generation.

A. WEIGHT SHARING
Instead of using multiple discriminators, we have applied weight sharing where each discriminator shares the convolution layer as shown in figure 1(b). In the implementation, we have used two separate methods: Rank and Ordinal. The first one uses simply the Weight sharing architecture as shown in figure 1 (b). The second one uses the techniques proposed in [23] with consistency update used in [33] where it uses two neurons as an out and applied softmax instead sigmoid for the cost functions. These two methods improved the training time and the semi-supervised age estimation.

B. FROM SOFT CRITIC TO HARD CRITIC
From the generator's perspective, we explored different possibilities to combine the feedback from the different discriminators as shown in figure 2(a). The possibilities can range from a soft critic up to a hard critic. The generator objective will be formulated as min G max f (V (λ i V (D i , G))) where λ i is the weight for each discriminator feedback. By using f := min the generator will be trained against the discriminator with the lowest feedback to criticize the generator softly. When using f := max, the generator receives hard criticism from the best discriminator with the highest feedback. In practice, training the generator against very soft or very hard discriminators can impede the generator's learning. This is because with such feedback the generator is unlikely to compete with the discriminator to produce a good image, and so the generator will receive uniformly negative feedback. The best way is to balance the two extreme critics. One way to do so is to ensemble the different discriminators. When using f := mean the generator will be trained against the aggregated feedback from all the discriminators. Here we applied equal weight (λ i ) to each discriminator. The weight can be used to get more design possibilities to push to a more hard or soft critic.

C. CASCADED IMPROVED SSGAN
We have also proposed multiple generators and multiple discriminators architecture for semi-supervised learning called Cascaded Improve SSGAN. The proposed architecture is shown in figure 2(b). The task of the discriminators is similar to the original Improved SSGAN. The first generator takes a latent vector as input and generates an image. This generated image is used as input to the first discriminator and to the second generator. The second generator takes the output of the first generator as input instead of the latent vector. The rest generators are trained similarly to the second generator. Except for the first generator, we adopted U-NET [17] architecture to all the generators which accept images from the previous generator. We found that the U-NET architecture helps to stabilize the training and improve the generated images.
To apply the aggregated effect of discriminators similar to what is described in section V-B, we proposed an architecture shown in figure 2 (c) which can combine both cascaded version and aggregations functions.

A. IMAGE DATABASES
We used MORPH [21] dataset to evaluate the proposed architecture. It is one of the largest and most common datasets used for age estimation [1], [12], [22], and [23]. It contains more than 55,107 images of more than 13,000 individuals. The age ranges from 16 to 77. The dataset divided into 5 age groups is shown in table 1. Of the total images, 80% is used for training and the rest is used for testing.

B. IMPLEMENTATION DETAILS
The implementation detail for the proposed Improved SSGAN is shown in Listing 1. Both the discriminator and generator are optimized using Adam optimizer with β 1 = 0.5, β 2 = 0.999, and learning rate of 2e − 4. The size of the input image is 128 × 128. In the cascaded improved GAN, the first generator has a similar architecture to the generator used in improved SSGAN. The rest has 4 convolutions with Listing 1. Discriminator and generator architecture. VOLUME 11, 2023    kernel − size = 4, stride = 2, andpadding = 1 in the contracting path. In the expansive path, 4 transpose convolutions with kernel − size = 4, stride = 2, padding = 1 is used. Each convolution and transpose convolution is followed by ReLU and Batch Normalization.

C. EVALUATION
To compare the performance of the proposed model in terms of age estimation, we implemented normal CNN and Rank CNN as described in the paper [1], [23]. We haven't done any prepossessing on the image data for both basing and proposed model. We used mean absolute error as an evaluation metric for the age estimation. The Mean Absolute Error(MAE) is l i is the ground truth age for image i, and l i is the corresponding estimated   and [33] exhibits better performance when the number of labeled images used in training increases compared to basic improved SSGAN.
To measure the image generation quantitatively, we used the Generative Adversarial Metric (GAM) [31] inception score(IS) [5] and FID [30]. Inception uses the inception V3 network pre-trained on ImageNet and calculates statics of the network when applied to generated images. Figure 3 (a) shows IS score for the SSGAN and Improved SSGAN. The higher the value is the better. Even though IS correlates with human scoring for a dataset like CIFAR-10 which are multiple different images, it has a limitation when used for data that contain only face images. FID measures the similarity between two datasets of images by computing Fréchet Distance. In each epoch, we generated 50K images to compare with the real dataset. Figure 3 (b) shows the FID scores for SSGAN_DCGAN and Improved_SSGAN_DCGAN. The smaller the value is the better. Similar to IS, FID is also indirect and relies heavily on the choice of the classifier. Therefore, we used GAM to strongly support the generation performance.
GAM [31] computes the relative performance of two models by judging each generator under the opponent's discriminator. As shown in figure 5 during the training each discriminator computes with their own generator but during testing, they compete with the opponent's generator. Consider two Generative adversarial models, M A and M B , each containing independent Generator and Discriminator M A = (D A , G A ) and M B = (D B , G B ). GAM calculates the ratio of classification error rate (∈ (.)) using the predefined test to get r test and using the generated images to get r sample as shown below.
From the GAM calculation, r test tells which model generalizes better since it is based on testing data and r sample tells which model can fool the other model more easily, since the discriminators are classifying over the samples generated by their opponents (the sample ratio determines which model is better at generating good (data-like) samples). r test ∼ =1 assures that none of the discriminators is overfitted more than the other. The battle metric is generalized below.  Table 3 shows the GAM result for the architecture proposed in figure 2(a). We used three functions (:= mean; := max; := min). Our architecture performs better consistently when using the mean function. This is because using min or max might criticize the generator very soft or very hard which will inhibit the generator to get proper feedback. Figure 4 (a) shows the detailed result of the experiment. In all iterations, the ImprovedSSGAN perform better compared to SSGAN and DCGAN.
Evaluation result for the architecture proposed in figure 2(c) is presented in figure 6. The two generators (networks) can use the available functions independently. We have evaluated the architecture with the different combinations -:mean; min; max; min−max; max −min. The experiment demonstrated the conditional generator (network 1) is better than the unconditional generator(network 0) and the baseline DCGAN.
Due to space limitations, we couldn't present sample generation from each variation of models. Figure 7 shows sample images generated by each generator in the cascaded architecture shown in figure 2(b). It visually shows that the image quality improves going from the first generator to the last.

VII. CONCLUSION
An improved SSGAN is proposed to estimate and generate ordinal data by using the rank estimation method. To overcome the mode collapse during the training of GANs the proposed architecture uses multiple discriminators with a common goal. To strengthen the generator, we also proposed a cascaded GAN where multiple generators are connected in such a way that the output of a generator is conditioned by another generator. We have also proposed a weight sharing discriminator which shows an improve early convergence and estimation. We have seen both visual and quantitative improvement over the baseline semi-supervised GAN on face generation and age estimation.
As future work, the proposed architecture can be easily extended to other objective functions like LSGAN [3] and WGAN [7]. It is also possible to extend the proposed architecture to multi-label [32] non-ordinal datasets.