PTcomp: Post-Training Compression Technique for Generative Adversarial Networks

In a time of virtual spaces, the usage of generative adversarial networks is inevitable. Generative adversarial networks (GANs) are generative deep-learning models that can generate realistic data. GANs have been used in many applications like text-to-image, image-to-image, image synthesis, speech synthesis, etc. Its power lies in the diversity and novelty of the generated data. Despite their advantages, GANs are resource-hungry. GANs’ output resolution and high correlation make it more challenging to compress and fit on edge-devices storage and power budget. Hence, traditional compression techniques are not the best fit to use with GANs. Additionally, GANs training instability adds another dimension of difficulty. Therefore, compression techniques that require retraining are challenging for GANs. In this paper, we developed a weight clustering technique to compress GANs without the need for retraining, hence the name post-training compression technique (PTcomp). We also proposed a clustered-based pruning which adds more savings. Experiments on Cyclegan, Deep convolution gan (DCGAN), and Stargan using several datasets show the superiority of our technique against traditional post-training quantization. Our technique provides a 4x to 8x compression ratio with comparable quality to original models and 14% fewer mac operations due to pruning.


I. INTRODUCTION
Generative Adversarial Network (GAN) [1] is one of the most prominent generative deep-learning models. It is used to generate data that mimics real-world data. Its power lies in the diversity and novelty of the generated data. The main challenges of such models are their memory and power consumption. These make their application to low-power devices a big challenge.
GAN has been used in many applications like text-toimage translation where it changes an input text into an image [2], [3]. Another application like image-to-image translation which translates images from one domain to another, like translating preliminary sketches to full designs [4]. Yet other commonly used applications are image superresolution and image deblurring [5], [6]. Additionally, it is used in music generation [7], videos synthesis [8], [9], etc.
The associate editor coordinating the review of this manuscript and approving it for publication was Janmenjoy Nayak .
Although GANs have many applications, their deployment is challenging. The challenge arises from their hunger for storage and power. These problems hinder their usage. Hence, we need to optimize GANs to run on low-power devices. In other words, our problem is to use GAN with minimum power and memory requirements while maintaining the quality of the results.
GANs challenges can be summarized into the following: (1) the demand for high quality and resolution of the generated output, (2) the instability of training which makes even fine-tuning a challenging task. (3) the absence of a concrete measure of success. Although many measures exist, non of them capture for sure the correctness or quality of the image rather it gives a good indication. This is due to the diversity of applications that the GANs solves. Additionally, the measure depends on measuring the difference between extracted features from the generated data and sample data using some good discriminator. So it is bounded by both the ''good'' discriminator quality and the sample data diversity. VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Recent works have studied compressing GANs by reconstructing a small generator from a teacher generator, those techniques either use pruning or network architecture search (NAS) to reconstruct the small generator. Those methods rely heavily on training or retraining the models and hence they face the instability problem.
On the other hand, quantizing the network is the simplest straightforward approach. Several papers studied quantizing networks like QGANs [10], Mobile GANs [11], ganslim [12], etc. All those works agree on one common step which is retraining. As a consequence, this step makes them face the training instability problem.
In contrast, our technique does not require retraining. Because it optimizes each layer as a standing-alone layer and uses the clustering algorithm to minimize error instead of GAN metrics avoiding both training instability and the absence of concrete GAN measures. And to make it more usable and attractive we provide a parallelized version of the algorithm.
Our Contributions can be summarized as: • Devise an algorithm to compress GAN without the need for retraining.
• Scoring 4 ∼ 8x compression ratio using clustering without degradation of visual aspects or a need for retraining.
• Devising a new methodology to combine traditional pruning techniques with our compression methods.
• Devising a new pruning methodology that uses the compression method output as a hint for pruning.

II. BACKGROUND
Basic GAN models are built up from two families of Networks: Generators (G) and Discriminators (D) in an organization similar to Fig. 1. The generator network(s) is responsible for generating the fake data which should mimic real ones. While the discriminator network(s) is responsible for validating whether its input is fake or real. GANs have many structures, we classified them into two main families: (a) Synthesis family, and (b) Style-transfer family. The synthesis family is a family of networks that takes the input as noise and generate near-realistic data. On the other hand, the Style-transfer family are networks that take real data and transfer it to another data domain. For example, face generation is a synthesis network while changing a 2D into a 3D image is a style transfer network. An example of the common generator for synthesis and style-transfer families is seen in Fig. 2.
In GANs, the generator's main task is to generate data that could not be differentiated from real ones. While the discriminator's main task is to discriminate between real data and fake ones. Thus, the discriminator is competing against the generator and hence the name generative ''adversarial'' networks. GAN is trained by training both its discriminator and generator alternatively. And since the generator generates new data, therefore it has no ground truth. Hence, we use the discriminator to train the generator.
In training, the generator takes an input (z). Depending on the application, this input could be noise or data from a certain domain. Then the generator generates data G(z). The generated fake data G(z) alongside the real x are provided as inputs to the discriminator network. And the discriminator generates a decision D(x)orD(G(z)) depending on its input as seen in Fig. 1. The gan-loss equation is formulated as in eq. 1.
where p data (x) represents the original data distribution, and p z (z) represents the input distribution. The first part of the equation E x∼p data (x) [logD(x)] represents the probability (expectation) that the discriminator classifies real data as real. While the second part of the equation E z∼p z (z) [log(1−D(G(z))] represents the ability of the discriminator to classify fake data as fake. The discriminator tries to maximize those two parts of the equation, that is why we need max D V (D, G). On the other hand, the generator tries to fool the discriminator to detect that fake images as real, thus it tries to minimize the second part of the equation min G V (D, G). Combining the two network optimizations, we get the above equation [1].

III. RELATED WORK
Memory compression is done by reducing the amount of memory required by the model. This can be done by either reducing the number of weights used in a model or reducing the number of bits used by each weight. In this section, we will review previous work related to GAN's quantization which reduces the number of bits required by each weight.
Standard quantization is usually used to convert floatingpoint numbers to fixed-point numbers. It uses the affine mapping as in eq. 2. The affine transformation has the benefit of allowing multiplication and addition to be performed without the requirement to reverse the mapping [14]. Other types of quantization might need some calculation or adjustment before carrying out the calculation.
In the work [10], Wang et al. developed a quantization technique for GAN known as QGAN. They demonstrated that normal quantization methods like (uniform, log, and tanh) are not sufficient for a stable and convergent GAN. Moreover, the sensitivity of G and D to quantization is different. To allow the compressed generator to converge, it must be tuned using a balanced discriminator. This results in the proposal of a separate multi-precision quantization schema for G and D. Wang et al. then utilized the expectation-maximization (EM) technique to search for the best scale_factor and zero_point values for the equation 2. Then eq. 3 is used to calculate the quantized weight. The EM algorithm aims to minimize the mean square error between the quantized (W q ) and nonquantized weights.
where Q is the quantized number. W q is the quantized weight, and round(Q) is the integer part of the fixed-point number that will be used in the calculation. Wang et al. applied QGAN to several models like Wasserstein GAN (WGAN) [15], deep convolution GAN (DCGAN) [16], and least squares GAN (LSGAN) [17]. Using only 1-4 bits quantization, QGAN scored a compression ratio in a range of 8x-32x with negligible loss in inception score.
According to a study by Deng et al. [11] utilising a PATCH-GAN-like generator to reconstruct facial images, the Peak Signal-to-Noise Ratio (PSNR) worsens as the number of bits drops while the compression ratio gets better. Using QMGAN, the PSNR ratio dropped by 2.5x for a 1-bit network compared to a 32-bit network. The memory size of the 1-bit quantized network is 35 times smaller than that of the single precision. They experimented with various quantization values, they noted that the PSNR of 32-bits is nearly identical to that of 6-bits, despite the 6-bits having a roughly 5.4x memory size gain (smaller by 5.4x). They also use finetuning for their networks.
By utilising both quantization and knowledge distillation, Wang et al. [12] presented their unified framework ganslimming. On the generator model, he applied uniform quantization to both the activations and the weights. To make the quantization hardware friendly, they unified the quantization across all layers. They performed 4x∼8x compression ratios on style-transfer tasks. However, they use distillation to retrain the model. FlexPoint, a non-standard float-like format, was suggested in a paper by Köster et al. [18]. This new format creates one shared exponent for each tensor. As a result, the tensor operations are handled as if they were fixed-point operations, and an additional circuit is required to manage just one exponent per tensor (matrix) operation. The format is considered a compromise between fixed-point and floatingpoint. The tensor is stored as a single shared 5-bit exponent for the entire tensor and a 16-bit mantissa for each member in the tensor(matrix).
Models using flex-point format scored a lower value (a better value) for the Fréchet inception distance (FID) [19] compared to fixed-point or float16 networks. Additionally, The flex-point networks' scores were comparable to networks using float32. In contrast to fixed-point, Flex-point can be used in training using floating-point techniques with even faster calculation. On the other hand, it needs special hardware to reap its benefit.

IV. METHODOLOGY A. INTRODUCTION
GAN model is a type of deep learning model where one layer can have millions of parameters. As the output resolution increases, the model size increases since we add more layers to the model. Hence, compressing the model is pivotal. Traditionally, many deep learning applications use 8-bit fixed-point to compress data but using fixed-point requires retraining or fine-tuning the model. As stated before, GANs' training has many challenges. Thus, we propose using a clustering technique which surpasses fixed-point performance without retraining.
In this section, we will explore using weight clustering with the generator part of GAN. The choice of clustering is based on several parameters. First, Weight clustering offers storage optimization. Second, instead of retraining the model, we just need to choose good clusters. Clusters' choice is a more stable task than training GANs. Overhead of clusters per layer are negligible compared to the optimization offered by indexing(clustering) weights. Finally, to the best of our knowledge, it is the first work to explore clustering with GANs.

B. CLUSTERING METHOD
Our clustering algorithm is based on the famous Kmeans algorithm. To keep the overhead small without loss of accuracy, we cluster each layer independently. All channels in the same layer are clustered together. But each layer has different centroids than the other layers. The clustering algorithm for one layer is explained in algo. 1.
The algorithm starts by defining a number of clusters k and two stopping parameters τ and ϵ. τ is the maximum number of clustering iterations while ϵ is the centroid movement tolerance. Then the algorithm flattens the weight matrix into a 1D vector. Following that, it selects k points as initial centroids. The centroids are sorted ascendingly. Sorting the centroids makes the calculation simpler as it will be elaborated later. Next, it assigns each weight to the nearest cluster point. Subsequently, it adjusts the cluster center by calculating the average of the points assigned to this centroid. Weight assigning (algo. 3) and centroids recalculation (algo. 2) steps are repeated till one of the two criteria is satisfied, either the number of iterations reached (τ ) or the position of the centroids does not move anymore (ϵ).
Choosing the initial centroids is an important task. Hence, we used three different techniques to select the centroids as follows: 1) random selection (RS), 2) quantized-based selection (QS), and 3) data-distribution-based selection(DS). The first method is straightforward, it picks random weights as centroids. Its advantage is almost no calculation at all and can be applied directly. The second method starts with the quantized values as the centroids and allows the clustering algorithm to refine those values. Since it begins with quantized values, it decreases the mean absolute error by applying the clustering method. The third method is using data distribution. It calculates the histogram of the data. Then it chooses the centroids as the bins with the highest values. In all cases, this step is done only once at the beginning of the algorithm. In the experiments section, we will test three options and elaborate more on their differences.

Algorithm 1 Weight Clustering per Layer
input : W ∈ R h×w×ic×oc is the layer weights k is the number of clusters, ϵ is the tolerance, τ is the max number of iterations output: idx ∈ Z h×w×ic×oc cluster_size [c] end return centroids As previously mentioned, clustering is more stable than GAN's training because its criteria are objective rather than Algorithm 3 Calculating Nearest Cluster Based on Sorted Centroids Proc update_indices( w, centroids,indices) subjective. Thus, it does not need retraining or finetuning of the whole model. However, when combining it with other techniques like pruning, we have two alternatives. Either perform pruning then clustering, or apply clustering to weights then pruning.

C. PRUNING METHOD
Pruning's main advantage is reducing the number of multiplication operations due to replacing the least effective weights with zeros. By combining both techniques clustering and pruning, a challenge arises whether to prune then cluster or the opposite. Additionally, how would the pruned weights impact the clustering process? to answer the above questions, we devised two methods ''pre-cluster pruning'' and ''guided pruning.'' Pre-cluster pruning is the process of pruning before clustering. Whether this pruning is per-layer or global for the whole model, it nearly gives the same results. However, how would clustering deal with pruned weights?
There are two methods to deal with pruned weights. (1) consider them normal weights with zero values and perform clustering. (2) Exclude them from other weights then perform clustering. While the first method seems simpler, it reverses the pruning process since the zero weights will be clustered to a centroid that might be non-zero. Additionally, it will bias the centroid towards zero value since all the pruned weights are stuck at zero. Thus, we chose the second method.
The second method excludes the zero weights and then cluster the other weights with k − 1 centroids. It assigns the 0-centroid to the pruned weights. Thus, the index zero is actually a zero value, not just an index. This method avoids biasing of centroids and we call it ''pre-cluster pruning''.
Guided pruning is the combination of pruning with clustering information to speed up and enhance element selection. It is performed post-initial clustering. It depends on sorting the clusters' centroids. By sorting the centroids, it is easy to eliminate the smallest centroid (in absolute value) and the associated weights and update the rest of the centroids accordingly. This technique automatically chooses the threshold by eliminating a whole cluster.
By sorting centroids and since each centroid is just 1-value not a vector, assigning weights to clusters would only need three distance calculations instead of k calculations. The comparison would be between the current centroid, the one just above, and the one below. So the distance calculation would be O(3n) instead of O(n × k).

D. ANALYSIS
In the rest of this section, we will provide error and storage analysis of our given method. The following notations will be used in the rest of the section.
• K is the number of clusters per layer. • K lj is the j th centroid in the l th layer. • The number of bits required to represent centroids is log 2 (K ).
• W l is the weight matrix/vector in layer l, each weight element in the matrix/vector is in the range ]-1,1[.
• W lji is the i th weight belonging to the l th layer and to the cluster with centroid K lj .
• We use ''#'' symbol to represent count (i.e. #K is the number of centroids).

1) TIMING ANALYSIS
The algorithm complexity depends on three parameters: a) number of clusters (K), b) size of the weights (W), and 3) number of iterations required to execute algorithm t. Then we can measure time complexity per layer as O(t × W × K ). By sorting centroids our complexity will be divided into three parts: a) Sorting complexity, b)first iteration complexity and c) iteration complexity. First, sorting complexity is only done once at first iteration. We start from random order of centroids, thus we need to perform full sorting which requires O(K × log(K )). Second, first iteration complexity assigns each weight to exaclty on cluster. This can be performed using linear or binary search since the values are already sorted. Thus, its complexity again is W × log(k). Further iterations complexities will be to readjust the assigned weight which requires max of three distance calculation. Distance to current centroid, distance to previous centroid and distance to next centroid. Hence, the complexity will be O(t × (W × 3)). In conclusion, the final timing complexity is O(t × (W × 3) + W × log 2 (k) + K × log 2 (K )). Since K≪W, we can simplify complexity to O(3 × t × (W )).

2) ERROR ANALYSIS
By Clustering weights into K-clusters, the clustering algorithm tries to minimize the distance (objective function) between cluster centroids K lj and the points belonging to the same cluster as seen in the equation 4.

Objective function
where j → [0, #K [ and i → [0, #w j in cluster j [. The worst mean absolute error(MAE) would be for uniformally distributed points on a distance of 1 (k) . Using the worse MAE, a relation between number of clusters and MAE is represented in Fig. 3. We see clearly that as the number of clusters increases the error decreases. However, our main goal is compressing model so we need smaller number of clusters. Although the calculated error is per-layer not the whole network error. But it is an indication for the final error. As, will be seen from experiments section, this error bound is very high compared to real data, because real weights are not uniformally distributed. So the acutal error is smaller than the theoretical error.

3) STORAGE ANALYSIS
The amount of storage savings depends on the number of clusters and hence the number of bits. Each layer in the model requires storage = W l × 32 bits. A model with clustered weight requires storage = W l × log 2 (K ) + K × 32. The clustering ratio is as represented in eq. 5.For example, for K = 16, we get 8x less storage than the floating-point model. Fig. 4 displays the relation between the number of clusters (and hence the number of bits) and the total size of weights VOLUME 11, 2023  for different traditional layers.

cr =
storage requirement for uncompressed model storage requirement for compressed model In the previous equations, we ignored the term k ×32 since it is usually negligible compared to the total weight. This overhead is seen in Fig. 5. From this figure, we can see even for 256 clusters, the overhead is almost negligible.

4) PRUNING ANALYSIS
If hardware support skipping zero multiplication. Then the percentage of multiplications savings will be equal to the threshold percent as well. For example, if the number of mac operations in a model equals M . Then the total number of macs with pruning = M ×p where p is the pruning percentage.
In summary, we cluster weights layer by layer. This clustering requires no finetuning since it depends on the clustering algorithm loss instead of the training loss. The clustering algorithm loss is more stable and simpler to compute and requires no additional data or finetuning. Error and storage savings depend on the number of clusters used.
Additionally, pre-cluster and guided pruning require no retraining since it depends on the clustering algorithm. In addition, the pruning threshold is automatically selected based on the clustering algorithm. Pruning provides energy saving if the hardware support skipping zero.
In conclusion, the clustering algorithm requires no retraining which makes it more appealing to GANs. Additionally, flattening weights and sorting them in 1-D vectors reduce the clustering complexity by orders of magnitude. Moreover, pruning combined with clustering reduces the number of multiplication operations required if well-designed. In addition, in guided pruning, the threshold is automatically selected based on the clustering algorithm.

V. EXPERIMENTS
In this section, we present the experimental studies made for applying clustering. First, we compared different techniques for choosing cluster centroids. We compare them using both mean absolute error(MAE) and FID ratio. Then, we present the study of different Hyperparameters for the clustering algorithm. Following that, we compare the clustering method with other numeric formats without retraining. Through each experiment, we study the metric measure, the visual quality and the expected savings.
We used 8 datasets, 4 of which are used as dual datasets. Dual datasets are used in style transfer, where we can transfer an image from domain A to B or vice versa. The datasets are listed in Table. 1, showing the number of images in each dataset, the classes(varieties) in the dataset, the input/output resolution and GAN's task (synthesis network or style-transfer network). We used different models for both synthesis and style transfer tasks. Synthesis models used are original DCGAN [16], DCGAN with U-Net discriminator [20] and Animegan [21]. As for style-transfer models, we used Cyclegan [22] and Stargan [4].

A. CENTROID SELECTION
As mentioned in the methodology section, the choice of initial centroids impacts the clustering performance. Thus, 9768 VOLUME 11, 2023 we implemented three methods for initializing the first centroids. The techniques are 1) random selection (RS), 2) quantized-based selection (QS), and 3) data-distributionbased selection(DS). We reported MAE for each method and compared them with quantized results.
We carried out 13 experiments using the datasets and models as shown in Table 2. By calculating their average MAE, we get the results in Fig. 6. MAE is calculated per layer and then averaged across all layers per model. We can see that clustering always gets a better MAE than quantized in both 6-bits and 4-bits. Furthermore, we can see that the RS and DS methods have comparable MAE while QS is slightly lower.
If we split the data into uniform equal partitions and used random selection. We have a high probability of choosing the partitions with higher density. Additionally, DS does the same but with the exact selection of highest density partitions instead of using probability to choose them. That's why RS and DS results are close because their initialization has a high probability of being very close. However, RS gives more variability to handle skew in data.
On the other hand, we can see that both RS and DS methods performed better than the QS method. This is attributed to the fact that the quantized method starts with uniformly distributed centroids among the data range. While the weight distribution has the final call on which method is better, we have rarely seen any weights distributed uniformly, rather they are mostly multi-modal distribution. However, clustering using QS performs better than weight quantization without clustering, since it enhances the centroid instead of getting stuck to the uniform distribution.

B. HYPERPARAMETERS SELECTION
In this section, we study the impact of clustering algorithm hyperparameters. The algorithm has three parameters: 1) the number of clusters, 2) the centroid movement threshold (ϵ), and 3) the number of iterations (τ ). Since the number of clusters impacts the compression ratio (our main goal), we used two settings one with K=16 and the other with K=64. The reason for this choice is the balance between the compression gain, data alignment in memory and quality of results. More elaboration will be available in the next section.  For other parameters, we will study both MAE and FID ratio changes with the parameters.

1) CENTROID MOVEMENT THRESHOLD (ϵ)
To study the centroid movement parameter, we removed the maximum iteration condition τ from the clustering algorithm and set the stopping condition only to ϵ. Additionally, we used the same set of experiments as shown in Table. 2. We chose the RS selection algorithm to avoid the impact of randomness, although they give very close results.
As seen in Fig. 7, the change in MAE is almost zero and different values of ϵ give almost the same MAE score. By reporting the average FID ratio in Fig. 8, the average FID variance is 0.01 (1 percent) for 4-bits and even smaller for 6-bits. This result is aligned with the MAE result that the change in ϵ below 0.0001 does not make much difference. The argument behind this result is the number resolution. In the case of using 4-bits the expected error is 1/2×2 4 (0.03) while for 6-bits it is 1/2 × 2 6 (0.007), so using an ϵ below the expected error will result in convergence.

2) NUMBER OF ITERATIONS (τ )
To study the different values for the number of iterations τ , we set the ϵ to a constant value of 0.0001. Similar to other parameters, we used the set of experiments shown in Table. 2. We also chose the RS selection algorithm to avoid the impact of randomness.
By running our set of experiments, it shows most experiments terminate before reaching the assigned τ as seen VOLUME 11, 2023  in Fig. 9. In the case of 4-bits, all experiments terminate in less than 75 iterations. While in the case of 6-bits, it takes more iterations up to 150 iterations. This is because 6-bits have more variability than 4-bits and thus it takes more iterations to converge.
To study the impact of τ on performance we reported both the average MAE score in Fig. 10 and the average FID ratio in Fig. 11. After 25 iterations, the MAE score gets almost constant. This means that despite the extra iterations the added value of those iterations is almost negligible. A similar result is seen with the FID score with very small variations of less than 0.005 (0.5%). Thus, we chose 25 iterations since it will save processing time with no significant impact on the results.

C. CLUSTERING EXPERIMENTS
In this section, we present the experimental studies made for applying clustering and compare it to other numeric formats without retraining. Through each experiment, we study the metric measure, the visual quality and the expected savings.
For a fair comparison, we compared the clustered weight with the fixed-point weight using the same number of bits. For example, clustered weights using 64 clusters (C64) are compared to quantized fixed-point using 6 bits (Q6). And clustered weights using 16 clusters (C16) are compared to quantized fixed-point using 4-bits (Q4). In our  experiments, each layer is quantized/clustered alone. Additionally, we compare our results with the default model using single-precision representation. Additionally, we compressed only model weights using both clustering and quantization techniques. Moreover, using the previous section, we set τ as 25 iterations, ϵ as 0.0001 and centroid selection as RS method.
The visual results in Fig. 12 show that using 6-bits clustering and quantising are very close to floating-point, with clustered results being more smooth. On the other hand, when using 4-bits, the quantized method fails dramatically, while clustered method quality just degrades slightly. These visual results align with the FID scores.
In Table. 2, we show the ratio between the FID score of the original model vs the modified models FID ratio = FID of float 32bit model FID of modified model . We can see that the clustering technique is either better or very close to quantized performance in a range of 1%. There are few cases having quantization results better than clustering although when comparing MAE, clustering methods were always superior to quantization (Fig. 6). The cases where Quantization gets better than compressed are having a high FID ratio of 0.98+ and it also might have a ratio higher than float (>1). These cases show that the floating-point network does not have the optimal weights and it could be more tuned. And as the compressed tries to be closer to float than quantized. The quantized is able to beat the clustered and even the float (original) itself. Clustering methods results are not only close to floatingpoint results but also in some cases, achieve better FID (larger than 1). We also noticed that the last three models' clustering is better by more than 7% than quantization. This difference can be justified by the nature and the type of task. For the synthesis task, for small-sized images, both methods are very close. As the output resolution increases, the quantization needs for retraining increases, thus the clustering algorithm surpasses its performance.
By analyzing the results, we can see that cyclegan and dcgan performed well under both compression methods. This is due to their redundancy in weights, as we will see from the next sections, there is a significant amount of pruning that can be done on those models and smaller models can give the same results. In other words, those models have a lot of ineffective weights. Thus, the model is not sensitive to numeric imprecision. On the other hand, when the models are not that huge, the clustered methods superseded quantization, since clustering manages to retain the numeric precision of the cluster.
In Table. 3, we compare compression models using only 4-bits. Here, traditional quantized fixed-point starts to fail dramatically with Animegan's FID ratio scoring 28% and Cyclegan scoring 50%. On the other hand, clustering using 4-bits scored on average 94% which is close to the average score using quantization with 6-bits (94.4%). The reason for this failure in quantization is that 4-bits precision fails to capture the weights range due to the uniform nature of the quantization. While clustering is free to have any different steps between clusters. Moreover, centroids of the clusters are stored in full precision which gives more details to the model once again.
We extended our experiments to compress both model weights and activations (inputs to each layer). Since activations change depending on inputs, clustering activation overhead will be too large. Thus, we used quantization for compressing activations while we alternate the usage of clustering and quantization for the model weight compression. An expected degradation happened to the results when using 6-bits. However, on using 4-bits the model faced total failure as seen in Fig. 13. As previously analyzed, quantization starts to degrade in low bits settings, thus when   Comparison between different compression methods shown as ( method for weights + method for activations). Colors represent the compression technique used with weights, whether it is clustering or quantization. The first element in the configuration notion stands for the number of bits used to represent the weight where 6-bits refer to 64 clusters and 4-bits refer to 16 clusters. Q6 and Q4 represent quantizing input to 6 and 4 bits respectively.
combined with other techniques it degrades the whole results. In Fig. 14, we show the average FID ratio of the compressed weights and the compressed weights and inputs for both clustering and quantization. Clustering techniques always beat quantization. VOLUME 11, 2023  In summary, the main advantage of clustering is achieving a high compression ratio while preserving results quality and without the need for retraining. It scored an average FID ratio of 97% and 94% for 6 and 4-bits respectively. As shown in Fig. 15, the compression ratio is almost the ratio between the number of bits used. Using k=16 (4-bits) compression ratio is ∼ 8x with centroids overhead of less than 0.06%. This saves both storage space and power consumption used to transfer data from/to the chip.

D. PRUNING EXPERIMENTS
In this section, we present the experimental studies made for applying Pruning with clustering. Pruning is applied as a post-processing step. Through each experiment, we study the metric measure, the visual quality and the expected savings.
We compared the pre-cluster pruning with guided pruning. Pre-cluster pruning and guided pruning are both applied per layer. Per-layer means that pruning is applied to each layer separately. Additionally, we compare our results with pruning only and clustering alone. For a fair comparison, we set the compression threshold to 20% in all experiments.
Since pruning is eliminating weights or setting them to zero, it does not decrease the size of the model. However, it allows the usage of off-shelf methods to perform compression and take advantage of the huge number of zeros introduced by pruning. Thus, we compress the model using gzip algorithm [26] to measure the actual compression ratio.
On the other hand, the clustering method uses less number of bits for each number which can directly compress the model. However, the current platform does not support less than 8-bits. Thus, we emulate the 6 and 4-bits by setting the rest of the bits to zeros and then use the same gzip compression to get the actual compression ratio. The actual compression ratio is 3∼4% less than the theoretical compression ratio.
In Table. 4 and 5, we show the compression ratio using different techniques against the base model. Techniques used in the comparison are pruning only, combined clustering and pruning and clustering only for k=64 and k=16 respectively, where k is the number of clusters. Pruning is divided into perlayer, pre-clustered and guided pruning. Per-layer is pruning  only, pre-clustered is pruning followed by clustering and guided pruning is only applicable to clustering as explained in the methodology section. As seen in both tables, our devised technique of guided pruning is scoring the best compression ratio. It is also noticed that although the pruning threshold is 20% the compression ratio is only around 14%. Additionally, the difference between clustering only and combined clustering & pruning is very small around. This means, that most of the compression results from clustering not pruning and pruning only adds around 2∼4%. This can be best explained by the fact that clustering has two advantages that make it the biggest contributor. 1) the number of bits of each weight is significantly reduced, and 2) the indices to centroids are zero-based, which means that there are weights that have zero-index and hence the gzip algorithm will use it to compress the model more.
In Table. 6, we reported the FID ratio of different techniques compared to the base model. For k=64, we notice that compression only has the highest FID ratio with an average of 97% compared to 95% for pruning even though the former is scoring 5x better compression ratio rather than the latter. However, combining both techniques accumulate the losses and scores 91% FID ratio. Despite that, in certain cases, the combined technique performs better or slightly less  than clustering alone. In all cases, by examining the visual results in Fig. 16, we can find that the results are very close despite the differences. Additionally, we can see that both guided and pre-cluster pruning are performing closely with an average of 0.6% difference only.
In Table. 7, we see a small degradation in FID due to the aggressive compression using k=16. However, the FID ratio for clustering alone is almost equal to pruning even though the former compress 6x better than the latter. The artifacts appearing in some models in Fig. 16 are due to the numeric format used rather than the pruning itself. Also, we notice that pre-clustering is performing better than guided pruning, this is because clustering uses very few bits which makes more points cluster in each cluster leading to a higher compression ratio but also a lower quality for output.
In summary, clustering compression is the main factor affecting both the quality of the result and the compression value as seen in Fig. 17. This is due to the zero-indexed centroid arrays that allow big savings when using gzip and the smaller number of bits per each weight. However, combining it with pruning not only adds extra compression but also saves power if the hardware support zero-multiplication skipping  with around 20% fewer multiplications operations (assume pruning threshold = 20%).

VI. CONCLUSION
In this work, we used a clustering algorithm to compress GAN's generator. Clustering algorithm convergence is much faster and more stable than GAN retraining. Thus, by using clustering we do not need to finetune the model. Using such an algorithm propose a theoretical compression ratio of 4 ∼8x with an average of 97∼ 93% FID ratio.
Additionally, we combined clustering with pruning techniques. Performing this combination was challenging because pruning would bias the cluster centroids. Hence, we devised a new methodology to combine pruning with clustering. Moreover, we introduced guided pruning which depends on the sorted cluster centroid to perform pruning on centroids instead of performing it on weights. Although pruning contributed very little to compression ratio 3∼4%, it provides energy savings proportional to the pruning threshold due to zero-multiplication skipping.
Optimizing GAN is a research topic that still had many unsolved challenges. In future work, we plan to study other clustering methods. Another important topic is performing post-training ''structured'' compression instead of unstructured pruning. Additionally, more research is needed to VOLUME 11, 2023 enhance the stability of GANs and thus speed up the training and generation methodology. He has more than 28 years of industry experience mostly in the areas of software and semiconductor design. He is currently a Professor and the Chair of computer engineering with Cairo University. His current research interests include hardware for deep learning, reconfigurable and multi-core DSP processors for wireless and MEMS applications, and classical chemometrics and modern machine learning techniques for IR spectroscopy modeling. He also has an active interest in different areas of EDA algorithms and technologies, and control architectures for electro-mechanical systems.