Machine Learning with Applications

StyleGAN2 is able to generate very realistic and high-quality faces of humans using a training set ( FFHQ ). Instead of using one of the many commonly used metrics to evaluate the performance of a face generator (e.g., FID , IS and P&R ), this paper uses a more humanlike approach providing a different outlook on the performance of StyleGAN2. The generator within StyleGAN2 tries to learn the distribution of the input dataset. However, this does not necessarily mean that higher-level human concepts are preserved. We examine if general human attributes, such as age and gender, are transferred to the output dataset and if StyleGAN2 is able to generate actual new persons according to facial recognition methods. It is crucial for practical implementations that a face generator not only generates new humans, but that these humans are not clones of the original identities. This article addresses these questions. Although our approach can be used for other face generators, we only focused on StyleGAN2. First, multiple models are used to predict general human attributes. This shows that the generated images have the same attribute distributions as the input dataset. However, if truncation is applied to limit the latent variable space, the attribute distributions change towards the attributes corresponding with the latent variable used in truncation. Second, by clustering using face recognition models, we demonstrate that the generated images do not belong to an existing person from the input dataset. Thus, StyleGAN2 is able to generate new persons with similar human characteristics as the input dataset.


Introduction
Think of an unknown face. Humans are capable of imagining faces they have never seen before, combining facial attributes from multiple sources to create a new identity. Can a machine do the same? By looking at real images of humans, can it learn how to generate a unique and realistic face? And if so, are humans still able to distinguish between authentic and computer-generated faces? These questions are part of a larger quest of discovering the capabilities and boundaries of machines. Speech, music, paintings, images, and even videos are among the many things a computer is now able to generate. The quality of the produced content has increased rapidly since the introduction of generative adversarial networks ( GAN Goodfellow et al., 2014). Generating realistic faces shows the power, capabilities, and limitations of these approaches.
In 2019, the successor (Karras et al., 2019) of the well-known StyleGAN (Karras, Laine, & Aila, 2018) paper was published. When StyleGAN (Karras et al., 2018) was released in 2018, it immediately showed impressive results. At that time, this architecture improved the state-of-the-art performance considerably by injecting the generator at different stages with a style-based latent variable. Although humans can still distinguish between computer-generated and real images, the images look very realistic at first glance. This is a huge achievement. This is not a human approach, a person would rather compare human characteristics to evaluate the images. Although it is infeasible to compare a lot of generated images by hand, a humanlike approach is necessary. Zhou et al. (2019) did use humans to decide whether images generated by StyleGAN were fake or real. The results showed that StyleGAN was capable of generating faces that were hard to distinguish by humans from the input images. Our research focuses on two different aspects: Are human traits transferred from the input to the output dataset and are the generated images new identities? Lack of attribute and identity labels tagged by humans for StyleGAN2, leads us to use existing models that were trained using different humanly labeled data.
Hence, we take two separate paths to evaluate how well StyleGAN2 performs from a 'human' perspective. First, multiple models are used to predict general attributes of the images, such as age, gender, and race. In this way, we can determine if higher-level concepts are preserved. Second, we examine for multiple face recognition models if the generated images can be considered to be different persons, compared to the images from the input dataset.
With this two-pronged approach, we are able to show that the human attribute distributions are very similar for the input and output dataset, but the generated images are nonetheless different according to the facial recognition models. Thus, StyleGAN2 has the best of two worlds. It is able to copy high-level concepts from the input dataset, whilst still creating different persons. Furthermore, if truncation (see Section 2.1) is used to limit the latent variable space, the attribute distributions change significantly towards the attributes corresponding with the latent variable used in truncation. To our knowledge, this humanly approach to comparing high-level concepts of facial datasets is new. While we will only use our two-pronged approach for StyleGAN2, it can also be used to evaluate other face generators.
To summarize, in this paper we: • introduce a new two-pronged humanly approach to evaluate face generators, by predicting human attributes and clustering using face recognition models; • show that the state-of-the-art StyleGAN2 generates images that have the same attribute distributions as the input dataset; • determine that StyleGAN2 generates faces that often do not belong to persons in the input dataset according to face recognition models; • observe that adding truncation to the latent variable space changes the attribute distributions towards the attributes corresponding with the latent variable used in truncation.
The remainder of this paper is organized as follows. First, the relevant datasets are discussed in Section 2. Next, in Section 3 the methods are explained that are used to predict facial attributes, embed faces, and cluster on these embeddings. Furthermore, we also define how a cluster is evaluated and why clustering is a natural approach. The results are discussed in Section 4. Finally, Section 5 summarizes the general findings and discusses possible future research opportunities. Karras et al. (2019) made three datasets publicly available that are used in this research: The input dataset (FFHQ), consisting of 70,000 real facial images (without identity annotation); Two output datasets of StyleGAN2, both consisting of 100,000 generated images. The only difference in the creation of these output datasets is the so-called truncation (Brock, Donahue, & Simonyan, 2018;Karras et al., 2018Karras et al., , 2019 parameter. All images are high-quality pictures (1024 × 1024 pixels).

Truncation
To explain how truncation works, it is useful to take a look at the structure of StyleGAN (see Fig. 1). Note that there are some differences with the architecture of StyleGAN2. However, the following core principles still hold. Some latent variable ∈  from latent space  goes into a mapping network , after it is normalized using pixelwise feature vector normalization.
∈ R is called the truncation parameter. Note that = 1 gives ′ = , which is the same as not applying truncation at all. In Fig. 2, five faces are shown that are generated with̄as intermediate latent variable. This is equivalent to generating images with = 0. Furthermore, noise is injected in the synthesis network to increase stochasticity, see Fig. 1. However, this leads to only minor changes if the intermediate latent variable is constant. As can be seen in Fig. 2, the faces all look very similar.

Methodology
To compare the input images of a face generator with its output, two separate paths are taken. First, multiple models are used to predict human attributes. This allows for a high-level comparison between the different datasets. Are characteristics, such as age and gender, the same for the input and output datasets? Secondly, clustering using face recognition models could determine if the generated faces belong to an existing person from the input dataset. Do the output datasets consist of different persons, or are they embedded similarly compared to the input dataset? Combining these two approaches gives a clear view of the performance of a face generator.
The output of StyleGAN has already been examined to some extent. Karras et al. (2019) evaluated the generated images in order to eliminate artifacts. For different datasets, FID and PPL was compared between StyleGAN and StyleGAN2 (Karras et al., 2018). Furthermore, efforts have been made to understand and steer the latent space (Shen,  Yang, Tang, & Zhou, 2020). By manipulating the latent space, one could change certain attributes of an image. For example, Shen et al. (2020) showed that it is possible to alter the age, gender, smile, pose, and add or remove eyeglasses. However, to our knowledge, our humanly approach to comparing high-level concepts of facial datasets is new. While we will only look at datasets from StyleGAN2, our two-pronged approach can also be used to evaluate other face generators. Within our novel approach, existing methods are used for predicting, embedding and clustering. These methods will all be discussed in the upcoming sections. A general overview of our proposed approach can be found in Fig. 3.

Attribute prediction
A face has many characteristics. This leads to a wide variety of attribute predictions: pose (Hu, Chen, Zhou, & Zhang, 2004), skin color (Vezhnevets, Sazonov, & Andreeva, 2003), and even attractiveness (Xu et al., 2017), to name just a few. We select the following group of features to examine: age, gender, race, horizontal rotation, and vertical rotation. Note that these features cover general human concepts, but additional attribution models can always be added to or removed from this framework. To clarify what we mean by 'human concept', we argue that every identifiable aspect of a face (or body) that has been given a name can be called a human concept. For example, the eyebrow is identified by humans as a specific part of the face. But also, somebody can look young or old. We consider these examples as 'human concepts', because we abstract information from a group of pixels with respect to some convention. In our view, any model that predicts a general human attribute, trained with humanly labeled data, could give some insight into the difference between the input and output dataset. Adding more attribute models does give additional information, but to show how our approach works, we limit ourselves to the attribute prediction models that we introduce in the following sections. Note that adding or removing other attribute prediction models does not affect the results of an individual attribute model, as each model is assessed separately.

Predicting age, gender, and race
One of the main guiding papers for this research is the Diversity in Faces paper by Merler, Ratha, Feris, and Smith (2019). The aim of their research was to create an annotated dataset in order to improve the accuracy of face recognition and increase the facial diversity within commonly used datasets. Lack of diversity could harm the effectiveness of face recognition in practical implementations. It could even be discriminatory against minorities (Buolamwini & Gebru, 2018). Merler et al. (2019) use different models to annotate images from the YFCC-100M dataset (Thomee et al., 2016). These models predict a plethora of attributes for each face. The same kind of models, implemented in deepface (Serengil & Ozpinar, 2020), are used to predict the age, gender, and race of a person.
For each attribute, a similar procedure is followed. deepface uses a pre-trained VGG-Face network (Parkhi et al., 2015) as the starting point. Only the last few layers are replaced and retrained to fit the objective. There are some important details about these models (see Serengil and Ozpinar (2020) for technicalities): • Counterintuitively, age prediction is not made using regression. Rothe et al. (2018) claim that using classification instead of regression improved the performance and also stabilized the training process. The output layer consists of 101 variables, each corresponding to an age in years (0-100). The last layer has a softmax activation function, which ensures that the output of the last layer is a probability distribution over the different output variables. The age is finally predicted by taking the expectation over these output variables, see Rothe et al. (2018). • Gender prediction is made using two output variables, corresponding to woman and man. • For race prediction, a distinction is made between the following races: Asian, Indian, Black, White, Middle Eastern, and Latino Hispanic. • deepface uses the haarcascade frontalface default detector from OpenCV (Bradski, 2000) to center, trim, and resize an image. However, it can occur that the facial detector does not recognize a face. When this happens, the image is simply omitted from the analysis of the corresponding attributes.
Serengil and Ozpinar (2020) self-reported on the performance of the models. The mean absolute error of the age model was 4.65 and the accuracy of the gender model was 97.44% with 96.29% precision and 95.05% recall. However, the models were not evaluated on the datasets that will be used in this research, because there exist no annotated labels of these features yet. It is therefore unclear how well these models perform for the datasets that are used. Nevertheless, we want to stress the fact that these models are only used to compare the characteristics of each dataset globally. Even if the models perform worse (due to domain shift), they can still be insightful for comparing the datasets.

Predicting horizontal and vertical rotation
To measure the horizontal and vertical rotation, dlib (King, 2009) is used. It can predict the position of 68 general landmarks on a face (see Fig. 4). These landmarks can be used to crop an image or measure attributes such as face and nose width/height. We use the landmarks to estimate the horizontal and vertical position of a head. It must be noted that these points remain a prediction. Especially when a head is rotated too much, these predictions lose accuracy.
There are many ways to estimate the horizontal rotation (yaw) and vertical rotation (pitch) of a head (Breitenstein, Kuettel, Weise, van Gool, & Pfister, 2008;Díaz Barros, Mirbach, Garcia, Varanasi, & Stricker, 2019;Hu et al., 2004). However, we are mainly interested in the differences between the datasets. Therefore, we are not much concerned about obtaining the best accuracy for each individual image. Thus, we use a simple concept to estimate the horizontal and vertical rotation, see Figs. 5 and 6.
Let , denote the horizontal and vertical position of landmark , respectively. Observe that when a head rotates sideways, the horizontal distance between the tip of the nose and the corner of the eyes changes. To scale this measure properly, this distance is compared with the horizontal distance between both lateral eye corners. Thus, the fraction | right lateral eye corner − nose tip | | right lateral eye corner − left lateral eye corner | (1) is measured to approximate horizontal rotation (see Fig. 5). When a head is straight, the tip of the nose is assumed to be in the middle. But, when it rotates to a side, the fraction becomes smaller or larger, depending on the side it is rotating towards. Note that this fraction could be used to approximate the horizontal rotation in degrees using known facial rotations. However, varying nose shapes and facial asymmetries could influence the results.
Using a similar key insight, vertical rotation can be measured by comparing the vertical distance between the nose root and nose tip and the vertical distance between the nose root and the chin. Thus, the fraction is measured to approximate the vertical rotation, see Fig. 6. Note that this measure is more subjective to personal traits, as nose lengths can vary. Although this may raise issues for an individual image, we believe that this method is sufficient for comparing the datasets generally, as individual errors will not have a large impact on the general comparison.

Facial embedding with face recognition models
Are new individuals created or are the generated images too similar to individuals from the input dataset? To compare the images from a human perspective, some kind of facial embedding is necessary. It is imperative that the dimensionality of each image is reduced. Every image consists of 1024 × 1024 pixels and each pixel consists of three color values (RGB). Given the size of these datasets, it is unfeasible to compare the pixels for each pair of images. Furthermore, a human does not compare two images pixel by pixel. Instead, one matches facial features such as eyes, nose, hair, and mouth to evaluate if these two images belong to the same person. This is why we decide to use face recognition models, where a face is first embedded to a point in a latent space, such that distances can be measured between faces. If two points are close, they are assumed to be similar. In this way, we can determine if new individuals are created. The four facial embedding methods used for facial recognition are outlined below.
FaceNet . Schroff, Kalenichenko, and Philbin (2015) is well suited for our objective. It is a deep convolutional network that converts an image (160 × 160 pixels) to a 128-dimensional vector that lies on the 128dimensional hypersphere. To find an appropriate embedding, FaceNet uses a triplet loss function.
OpenFace. Amos, Ludwiczuk, and Satyanarayanan (2016) follows the same concept as FaceNet (Schroff et al., 2015). It is, however, opensource and focuses on real-time face recognition. It converts an image (96 × 96 pixels) to a 128-dimensional vector that lies on the 128-dimensional hypersphere.
DeepFace. Taigman, Yang, Ranzato, and Wolf (2014) uses 3D face modeling and a large deep neural network to recognize faces. It converts an image (152 × 152 pixels) to a 4096-dimensional vector, which is then used to identify individuals using a classification layer. Taigman et al. (2014) call this vector the ''raw face representation feature vector''. Parkhi et al. (2015) uses the well-known VGG-16 architecture (Simonyan & Zisserman, 2014) to specifically train for facial recognition. It converts an image (224 × 224 pixels) to a 2622dimensional vector. This model also uses the specific loss function from FaceNet (Schroff et al., 2015) to train the model for facial recognition.

Dimensionality reduction
The output vector of DeepFace (Taigman et al., 2014) and VGG-Face (Parkhi et al., 2015) is too large to properly cluster on. Therefore, the dimensionality is reduced with singular value decomposition (SVD) from a 4096-and 2622-dimensional vector to a 128-dimensional vector. Thus enforcing the same dimensions of the output vector for each embedding method. This dimensionality reduction could weaken the accuracy of these models, as some information is lost. However, if there is still a clear distinction between the datasets in this lower dimension, there must be a similar or larger, difference in the higher dimension. We call these models Reduced DeepFace and Reduced VGG-Face from now on.

Clustering
Once the faces are embedded using face recognition models, images can be compared. There are many options, however we will show why clustering is the most natural approach in our view. In the end, we want to answer the question if actual new persons are generated. The face recognition methods enable us to measure the distance between each pair of images. If the distance between images A and B is below a defined threshold, the images are considered to be of the same identical person. However, if the distance between images B and C is also below the threshold, images A, B and C all belong to the same person and form a cluster. Thus, a clustering approach naturally arises by this logic. Each cluster of images represents a single person, according to the face recognition methods.
To investigate if the output dataset contains the same identities as the input dataset, two combinations are made: • FFHQ ∪ ( = ): The input dataset combined with the generated images without truncation; • FFHQ ∪ ( = . ): The input dataset combined with the generated images with truncation.
The clustering is done on the embeddings of these two combinations.
Due to the size of the datasets (170,000 images in total), a clustering method with few parameters is preferred. Furthermore, there is no or little domain knowledge of proper parameter values, making most clustering methods too computationally expensive, as a range of values for the parameters needs to be evaluated. This leads to the decision to use HDBSCAN (Campello, Moulavi, Zimek, & Sander, 2015). The idea behind this algorithm is that instances and are neighbors if the distance between them is less than or equal to and two instances and are in the same cluster if there exists a sequence of instances from to such that each successive instance is a neighbor of the previous. HDBSCAN allows to be altered post-completion. In this research, we use the implementation of McInnes, Healy, and Astels (2017) with the Euclidean distance function. Although HDBSCAN has computational complexity ( 2 ) (Campello et al., 2015) with the number of samples, McInnes, Healy, and Astels (2016) show that HDBSCAN performs reasonably fast for large datasets. Furthermore, it returns a hierarchical clustering. This is useful to determine different statistics post-completion. If instead the very similar DBSCAN (Ester, Kriegel, Sander, & Xu, 1996) is used, some information about the parameter is necessary. determines the neighborhood of each point. The relevant range for varies greatly for different embeddings. Without large computational costs, it is possible to determine the results for different values of using the hierarchical cluster, after running HDBSCAN.
HDBSCAN has a single primary input parameter (Campello et al., 2015). This parameter determines if a group of samples is large enough to be considered an actual cluster. If two images are embedded closely together, they should be able to form a cluster, as they can belong to the same person. Thus, = 2 is a natural choice, as it allows all cluster sizes except a cluster containing a single image.

Cluster evaluation
The goal of clustering is to investigate if the input and output datasets consist of different individuals. Therefore, purity (Manning, Raghavan, & Schütze, 2008) is used to measure the intertwinedness of the clustering, as this metric evaluates if subclusters consist of purely real or generated images. Purity is measured by counting the samples of the most frequent class in each cluster and dividing by the total number of samples. More formally, let clustering of samples consist of subclusters for ∈ {1, … , }, for some ∈ N >0 . Each sample comes from a corresponding dataset labeled . For each subcluster , let denote the label of the dataset that occurs most frequently, then: If Purity( ) = 1, it means that every subcluster only contains samples of one class. If there are only two classes, a lower bound of purity is Purity( ) = 0.5, as in the worst case every subcluster is split 50/50 between the classes.
Baseline purity. Note that the upper and lower bound, previously given, cannot always be achieved. This is dependent on the distribution of the labels and the structure of a clustering. For example, if there is only one cluster and the labels are divided 80/20, the purity score will be 0.8. Therefore, a better baseline is necessary to evaluate how good/bad a purity score of a clustering is.
Assume that the input and output dataset are sampled from the same distribution. Then there is no way of telling which image is drawn from which dataset. For each parameter combination, HDBSCAN returns a cluster with a certain structure. Each cluster consists of a number of subclusters all with a corresponding size. If there would be no difference between the two datasets, it would correspond with randomly assigning each sample to a position in the cluster. As we have seen before, the structure of the cluster is important for the purity score. Therefore, we approximate the expected purity score of a randomly assigned cluster with the same structure as provided by HDBSCAN. Under the hypothesis that there is no difference between the datasets, we get an average purity score that is ultimately used to compare the results. If the results are close to this baseline, it means that the datasets are very similar. On the other hand, if there is a clear distinction between the baseline and the results, it would mean that the datasets are not alike.

Analysis
The images from the datasets are analyzed in two ways. First, models are used to predict certain attributes of each face (e.g., gender and age). This will determine the distribution of these features, which can be used to compare the datasets globally. Second, multiple embedding methods are used in combination with a clustering method. By looking inside each subcluster and evaluating the purity score (see Section 3.3.1), a comparison between the datasets can be made. The results of both approaches are discussed below.

Results attributes
For each image in every dataset, models were used to predict the following attributes: age, gender, race, horizontal rotation, and vertical rotation. The results are grouped together per dataset. This gives a global overview of these attributes for each dataset. In particular, we are interested in the similarity of the distributions. If these distributions are different, it would suggest that the underlying datasets are in fact different.

Results age
Without truncation ( = 1), the age distribution of the generated images is almost identical to the input dataset (FFHQ), see Fig. 7. Even the small peak around 40 years is similar for these two datasets. If truncation is added ( = 0.5), it can be observed that the distribution shifts more towards the younger age groups.

Results gender
The model returns for both classes (woman and man) a probability value. The dominant gender is the gender with the highest probability of the two. Note that without truncation ( = 1), the distribution is almost the same as the input dataset (FFHQ), see Fig. 8. Whereas with truncation ( = 0.5), relatively more females are generated compared to the input. Table 1 shows the average probability mass for each race. Table 2 shows the distribution of the dominant race. This is the class that obtained the maximum probability given by the model. Without truncation ( = 1), the distribution is very similar to the input dataset (FFHQ). However, if truncation is added ( = 0.5), white is predicted more often.

Results horizontal rotation
As explained by Fig. 5, the horizontal rotation is measured using the predicted landmarks of dlib (see Eq. (1)). In Fig. 9, it can be observed that without truncation ( = 1), the distribution is nearly identical. When truncation is added, the distribution narrows to 0.5, which means that more straight faces are generated or the faces are more symmetric.

Results vertical rotation
As explained by Fig. 6, the vertical rotation is measured using the predicted landmarks of dlib (see Eq. (2)). Again, the distribution without truncation ( = 1) is identical to the distribution of the input dataset FFHQ (see Fig. 10). If truncation is added ( = 0.5), the distribution shifts to the right. There are two possible explanations. First, it could mean that the generated images are rotated more downwards. Second, it is possible that the generated images have a longer nose. In Fig. 11, the distributions of the nose length can be found. There is a significant shift when truncation is added ( = 0.5). Thus, it can be concluded that the nose lengths are on average larger for = 0.5.

Failed detections deepface
The models that predict the age, gender, and race were trained using a specific face detector. When the detector finds a face, it automatically    (Bradski, 2000) (used in deepface) fails to detect a face. trims and resizes the image. However, this detector sometimes fails to detect a face. In this case, the image is simply omitted from the attribute analysis. Fig. 12 shows how often the detector is successful. Note that with truncation ( = 0.5) this failure probability decreases drastically. The results for FFHQ and no truncation ( = 1) are similar and relatively high. A failure rate of around 10 percent is rather substantial.
In Figs. 13, 14, and 15, the first images of each dataset are shown where the deepface detector fails. Only for = 0.5, it is not very clear why these images fail. However, we suspect that the following factors contribute to the general failure of the detector: • eyewear; • headwear; • rotated heads; • multiple persons; • young age; • obstructed eyes; • structural errors (deformation, glitches, missing parts, etc.).
Note that these are only visual observations and should be investigated further.

Failed detections dlib
Dlib uses another face detector. The failure rate of this detector is also measured. As can be seen in Fig. 17, the failure probability is much lower compared to Fig. 12. It is notable that if truncation is added ( = 0.5), the failure probability is even zero. However, the differences between the probabilities are so small that it is hard to draw any meaningful conclusions for the different datasets. The failure rate is very small for each dataset.
In Figs. 16 and 18, the first images of each dataset are shown, where the dlib detector fails. Note that for = 0.5, there are no failures. The same elements we observed in the failures of the deepface detector are prevalent in the dlib detector failures. However, the dlib detector seems to be more robust compared to the deepface detector.

Results clustering
In Section 3.3, it is discussed why clustering is a natural approach to determine if the newly generated images belong to an existing  person. The clustering results can be seen in Fig. 19. Note that different parameter values of are relevant for each embedding method. This makes HDBSCAN (Campello et al., 2015) very useful, as the parameter value of can be changed post-computation. Given the parameters, HDBSCAN returns a cluster. Two measures are of our interest. First, the number of subclusters within each cluster. This indicates how many unique persons exist in the data according to the embedding methods. Second, the purity of a cluster is measured (see Section 3.3.1). We cluster on the combination of the input dataset (FFHQ) and the output dataset with either no truncation ( = 1) or with truncation ( = 0.5).
In this way, the output dataset can be compared with the input dataset.
For each facial embedding the maximum number of subclusters (  Table 3) is determined with = 2 (see Section 3.3).

Purity results
The purity results for = 2 are shown in Fig. 19. The relevant range for is chosen based on the number of clusters. Two main conclusions can be drawn from these graphs. First of all, 7 out of 8 clusterings show a clear distinction between the baseline and the actual purity score. Only OpenFace without truncation ( = 1) shows no obvious separation. Therefore, it can be concluded that there is a definite difference between the input and the output datasets. Thus, the generated images belong to different persons compared to the input dataset, according to the facial recognition methods. Second, the gap between the baseline and the actual purity score is much larger with truncation ( = 0.5) than without truncation ( = 1.0). Thus, truncation makes it more likely that a cluster is predominantly real or generated. Fig. 19. Clustering purity: Using HDBSCAN with = 2, determines which clustering is made. The red line (1) denotes the number of subclusters of each cluster. The blue line (2) is the purity score (see Section 3.3.1). The black line (3) shows the approximated purity score under the hypothesis that the two datasets are similarly distributed using 100 simulations (see Section 3.3.1). The left side is the combination FFHQ ∪ ( = 1), whereas the right side is the combination FFHQ ∪ ( = 0.5). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Conclusion and further research
We presented a general two-pronged approach that tries to humanly compare the input and output datasets for a given face generator.
However, we explicitly applied this approach to the state-of-the-art generator StyleGAN2 (Karras et al., 2019). We started by comparing the input dataset (FFHQ) and the output datasets based on their attributes. Multiple models were used to predict attributes (age, gender, race, horizontal, and vertical rotation) for each image. The results were very clear. The attribute distributions were the same for the input dataset and the generated images without truncation ( = 1). However, when truncation is added ( = 0.5), the attribute distributions shift significantly towards the attributes corresponding with the latent variable used in truncation. Although there exist many evaluation measures for GANs (Borji, 2018), the three most commonly used measures are: Fréchet Inception Distance (FID), Inception Score (IS), and Precision and Recall (P&R) (Borji, 2021;Shmelkov, Schmid, & Alahari, 2018). FID measures the difference between the input and output images by embedding them into the feature space of an Inception Net (trained on ImageNet) (Borji, 2018). IS also uses the Inception Net to measure the diversity of the generated images compared to the mean. P&R quantifies how similar the generated images are to the input dataset and how well the entire training dataset is covered. Additionally, Style-GAN2 (Karras et al., 2019) evaluates the perceptual path length, (PPL) which measures the difference between the VGG-16 embedding (Simonyan & Zisserman, 2014) of two consecutive images, where a path in the latent space is subdivided into linear segments (Karras et al., 2018). These measures have been used to evaluate the performance of StyleGAN2 (Karras et al., 2019). Thus, the observation that StyleGAN2 is able to learn the input dataset is not new. It is known that GANs are able to learn the input distribution, although training sometimes appears successful, whilst the target distribution is actually far from the trained distribution (Arora, Ge, Liang, Ma, & Zhang, 2017). However, it has not previously been shown that higher-level human concepts are also preserved. It could be that somehow these measures indirectly assess these human concepts, although this has not yet been show. This article gives a direct approach and demonstrates that such human concepts are indeed preserved, which further strengthens the work of Karras et al. (2019).
In addition, four facial embedding models (FaceNet, OpenFace, Reduced DeepFace, and Reduced VGG-Face) were used to embed the images. This allowed us to cluster each combination of input and output dataset. By determining the purity score, which measures how intertwined each subcluster is, we were able to show that the generated images are not grouped together with the input dataset. This means that StyleGAN2 is able to generate new persons that do not exist in the input dataset, according to the facial embeddings. Recently, Khodadadeh et al. (2022) had a similar idea of using a face recognition method in combination with StyleGAN2. They used FaceNet in a loss function to generate faces with StyleGAN2 that belong to the same identity. Furthermore, they used 35 attribute methods to steer the latent space in order to generate faces with modified attributes, which is different compared to our research. The insight that StyleGAN2 is capable of generating new identities is novel and one of the contributions of our research.
Summarizing, by using a two-pronged humanly approach, consisting of predicting human attributes (Section 3.1) and clustering using face recognition models (Sections 3.2 and 3.3), the following conclusions can be drawn: • The images generated by StyleGAN2 (without truncation) have the same attribute distributions as the input dataset, according to the prediction models. • The generated faces belong with high probability to different persons compared to the input dataset, according to the clustering using face recognition models. • Adding truncation to the latent variable space changes the attribute distributions towards the attributes corresponding with the latent variable used in truncation, according to the prediction models.
Generalizing, our approach can also be used for other face generators. It is not specifically tailored for StyleGAN2. Furthermore, our approach is modular in the sense that different attribute prediction and facial embedding methods can be added or removed. It should therefore be used in conjunction with other evaluation measures such as FID and PPL to give a broader perspective of the performance of a face generator. It addresses different questions and concerns compared to previous measures. When our approach, for example, shows that the generated images belong to identities in the input dataset, adaptations could be made if this effect is undesirable due to privacy issues.

Future work
Finally, we address a number of topics for future research. Section 4 provides multiple insights that should be explored further. First, note that the maximum number of subclusters is relatively small (see Table 3). Not more than 5592 subclusters are formed maximally for a dataset consisting of 170,000 images. A lot of images are considered to be anomalies, which means that there is no face that closely resembles theirs. It could be that either the dataset is too small, due to the wide variety of possible faces, or the embedding methods are too specific.
Second, truncation ensures that the latent variables lie closer to the expected intermediate latent variablē(see Section 2.1). In the results, the attributes were very similar for the input dataset and the output dataset without truncation ( = 1). However, when truncation is added ( = 0.5), there was a shift in the attribute distributions. Our hypothesis is that this shift stems from the attribute values of the images generated with̄as intermediate latent variable (see Fig. 2). Taking the average of the predicted attribute values for the first 1000 images, generated with̄, leads to Table 4.
When we examine the differences in the attribute distributions between truncation ( = 0.5) and no truncation ( = 1), and compare these with the difference between Tables 4 and 5, we see that they coincide. Thus, we hypothesize that adding truncation focuses the attribute distributions towards the attribute values belonging to the faces generated with the expected intermediate latent variablē. Karras et al. (2019) regulate the generator to smoothen the perceptual path length of generated images under small perturbations in the latent space. This could be a reason why the human attributes are also similar under small perturbations. The hypothesis can be tested by replacinḡin the truncation procedure by a different intermediate latent variable and investigating the attribute distributions of the newly generated images. If the hypothesis holds, this method can also be used to generate images with the desired attributes. Future research is needed to explore howā nd the truncation variable influences the attribute distributions. The attribution methods that were used are all trained on alternate datasets. It is unclear how well their performance transfers to the data that is used in this research. Nevertheless, it still provides the insight that the input and output distributions were similar. Still, it remains interesting to evaluate how well the models can transfer their learned knowledge to this dataset. Additionally, the goal of our approach was to evaluate the generator in a more humanlike way. We decided to use methods that were trained using humanly labeled data, as it was unfeasible for us to label this dataset ourselves. However, it remains uncertain how 'humanlike' these methods are. Are they actually predicting correct attribute labels for our datasets? Although this questions is beyond our scope, it is interesting to evaluate if these models are good at replacing human experts. Lastly, the facial recognition methods are trained using datasets of real images. The results showed that the generated images are embedded differently than the input images. If the facial embedding methods are also trained using generated faces, a better comparison could possibly be made.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Availability of data and material
All data used in this research is cited in the appropriate sections.