ChildNet: Structural Kinship Face Synthesis Model With Appearance Control Mechanisms

Kinship face synthesis is an increasingly popular topic within the computer vision community, particularly the task of predicting the child appearance using parental images. Previous work has been limited in terms of model capacity and inadequate training data, which comprised of low-resolution and tightly cropped images, leading to lower synthesis quality. In this paper, we propose ChildNet, a method for kinship face synthesis that leverages the facial image generation capabilities of a state-of-the-art Generative Adversarial Network (GAN), and resolves the aforementioned problems. ChildNet is designed within the GAN latent space and is able to predict a child appearance that bears high resemblance to real parents’ children. To ensure fine-grained control, we propose an age and gender manipulation module that allows precise manipulation of the child synthesis result. ChildNet is capable of generating multiple child images per parent pair input, while providing a way to control the image generation variability. Additionally, we introduce a mechanism to control the dominant parent image. Finally, to facilitate the task of kinship face synthesis, we introduce a new kinship dataset, called Next of Kin. This dataset contains 3690 high-resolution face images with a diverse range of ethnicities and ages. We evaluate ChildNet in comprehensive experiments against three competing kinship face synthesis models, using two kinship datasets. The experiments demonstrate the superior performance of ChildNet in terms of identity similarity, while exhibiting high perceptual image quality. The source code for the model is publicly available at: https://github.com/MartinPernus/ChildNet.


I. INTRODUCTION
Kinship face synthesis is an emerging area of the computer vision field that involves the creation of facial images that resemble real kin-related persons, based on a single or multiple input face images. Techniques capable of automatic kinship face synthesis have vast potential in various fields such as entertainment, historical research, and forensics involving the search for long-lost family members. As a result, research The associate editor coordinating the review of this manuscript and approving it for publication was Bing Li . efforts have recently begun to focus on kinship face synthesis, which has yielded convincing results [1], [2], [3].
The methods for kinship face synthesis have recently made great progress, largely thanks to the success of image generation models that use Generative Adversarial Networks (GANs) [4]. These methods leverage the low-dimensional GAN latent space by first converting the input images to GAN latent codes before processing the codes with specialized latent operations. To obtain a latent code from an image, a GAN-inversion approach is typically used, which extracts a latent code that approximately corresponds to the input image when decoded with the GAN generator. By utilizing FIGURE 1. ChildNet, a model for kinship face synthesis. ChildNet is capable of a high-resolution kinship face synthesis, where the synthesized images are visually compelling and bear a strong resemblance to the real child images. Additionally, we introduce a new kinship dataset called Next of Kin (NoK), which is suitable for kinship face synthesis model training due to its high image quality. Example images of the NoK dataset are shown in the first two rows.
GAN-inversion on parental images, parental codes are obtained, which are then processed using specialized latent code operations. The operations are usually inspired by genetic mechanisms, such as heritable properties of facial attributes or mixing of genes. The result of the operations is a new (child) latent code that, when processed by a GAN decoder, corresponds to a facial image that can be considered as a real descendant of the parents.
The key challenge in enhancing the performance of existing models is to increase the model capacity in a manner that improves the resemblance of the synthesized result to the appearance of a real child. Furthermore, models lack adequate fine-grained control of the synthesis process, such as control over the variety of synthesized facial images and control over the dominant parent image.
In this paper we present a novel model named ChildNet that further improves the performance of kinship face synthesis and is capable of generating high-resolution visually convincing results, as shown in Figure 1. ChildNet overcomes the limitations of model capacity by training a neural network that operates within the extended latent space of StyleGAN2 [5], a state-of-the-art model for facial synthesis. The ChildNet architecture is based on two key mechanisms: an attention mechanism, which selectively attends to parental latent codes when predicting the child's code, and a mutation mechanism, which further modifies the code. The model is learned end-to-end, so no hard-coded external knowledge (e.g., hereditary characteristics) needs to be specified in the model implementation. Furthermore, we develop a separate age and gender module 1 in the extended latent space that 1 In order to maintain consistency with the language used in previous studies, we use binary framework for gender in this paper. We acknowledge that this may be a limitation and that future research should take a more nuanced approach to the gender concept in kinship face synthesis. allows us to further refine the synthesized child image given the additional information about the child's desired appearance. ChildNet is able to synthesize multiple child appearances per fixed input, where the synthesis variability can be explicitly controlled. Additionally, a mechanism for the control of the dominant parental image is presented, which can enforce a more father-or mother-like appearance. Based on experiments with two kinship datasets, ChildNet achieves higher facial similarity with real child images than competing methods, while exhibiting high degree of perceptual image quality.
A critical problem in kinship face synthesis research is the lack of appropriate datasets to train a high-quality synthesis model. Specifically, kinship face synthesis models are trained on datasets whose primary goal is kinship recognition, which typically contain facial images with highly variable image quality and tightly cropped facial images. This is in contrast to modern datasets used for training facial image generation models [6], [7]. These datasets typically contain high-resolution, high-quality face images with consistent positioning. Therefore, we present a new kinship dataset, named Next of Kin (NoK), that includes 3690 512×512 highquality, high-resolution facial images of 553 subjects that correspond to 161 families, i.e., parents and their children.
In summary, we make the following main contributions in this paper: • We propose ChildNet, a state-of-the-art model for kinship face synthesis. ChildNet is based on a novel latent-space manipulation scheme designed around the StyleGAN2 model and leads to convincing child synthesis results with photorealistic, high-resolution and artifact-free synthesized images.
• We introduce several synthesis control mechanisms that involve age and gender manipulation, image variability control, and determination of the dominant parental image. These mechanisms facilitate conditional child synthesis and attribute control for the synthesized images.
• We present the Next of Kin (NoK) dataset, a dataset of high-resolution face images with various metadata, e.g., kinship relation, age, ethnicity, gender and emotion.
• We conduct rigorous quantitative and qualitative evaluations to demonstrate that ChildNet outperforms competing models in accurately synthesizing child images.

II. RELATED WORK
In this section we present prior work closely related to our paper. We discuss GANs with a focus on their use in image processing. The GAN model represents a core synthesis component in our paper due to its ability to synthesize highresolution, high-quality facial images. Next, we focus on existing kinship datasets, discussing their characteristics and limitations. This highlights the necessity for a dataset tailored specifically for synthesis purposes. Finally, we examine the image kinship research, which mostly focused on the task of kinship recognition, with a recent focus towards the task of kinship synthesis.

A. GENERATIVE ADVERSARIAL NETWORKS FOR IMAGE GENERATION AND EDITING 1) GENERATIVE ADVERSARIAL NETWORKS (GANs)
GANs have become one of the most popular image generation models in recent years. Since the model conception [4], they have been improved in terms of model design and training procedures. DCGAN [8] defined a convolutional GAN architecture and noted several useful design principles. Karras et al. [6] proposed ProGAN, which used progressive learning strategy that enabled the generation of convincing megapixel resolution images. The model was further improved with the introduction of StyleGAN [7], whose design was inspired by the style transfer architecture. The next iteration, StyleGAN2 [5], modified the design to remove visual artifacts in the synthesized images, improving the overall image synthesis performance. StyleGAN3 [9] defined continuous interpretation of network signals to prevent the synthesized image's dependence on absolute pixel coordinates, creating smoother image latent-based interpolations. Considerable progress has also been made in the field of GAN training strategies using various training loss functions and regularization techniques [10], [11], [12], [13], [14]. Additional information on GANs can be found in one of the recent surveys on this topic [15], [16].

2) GAN-INVERSION
GAN-inversion methods are concerned with inverting an image into the latent space of a pre-trained GAN model before performing image editing operations. Abdal et al. [17] first introduced an iterative latent code inversion technique within the extended latent space of StyleGAN, demonstrating its significant image reconstruction capabilities. This work was further extended in [18], which enhanced the inversion algorithm. The pSp model [19] defined an image encoder that directly predicted GAN latent codes without the need of iterative latent code optimization. E4e model [20] proposed an encoder that focused not only on image reconstruction but also on preserving the original latent code distribution, resulting in improved image editing. ReStyle [21] introduced an iterative encoder refinement of the latent code. HyperStyle [22] presented a hyper-network that modifies the StyleGAN weights based on the input image. Some works proposed other types of latent spaces, e.g., StyleSpace [23], which achieved a more disentangled representation of the latent code. For an overview of GAN-inversion methods, readers are referred to [24].

3) GAN-INVERSION BASED IMAGE EDITING
Recently, a considerable number of methods have emerged that perform image manipulation based on a pre-trained GAN. Abdal et al. [17] performed various image manipulation techniques, e.g., face morphing, style transfer and expression transfer. Their follow-up work [18] demonstrated additional manipulations, e.g., image crossover, local image editing and inpainting. Shen et al. [25] proposed fitting a linear support vector machine to the StyleGAN latent space based on face attribute labels of the corresponding images.
The computed directions were then used to edit images in a manner that displayed the desired facial attribute. Some methods focused on unsupervised discovery of semantic latent directions in the latent space [26], [27]. MaskFaceGAN [28] introduced several constraints during the latent code optimization for face editing that achieved disentangled face editing. Richardson et al. [19] used an encoder model to synthesize frontalised face images, as well as perform other tasks such as conditional image synthesis, inpainting, and super-resolution. StyleCLIP [29] demonstrated latent-based image edits, which were conditioned on text input. To maintain high image perceptual quality while achieving the fast computation times offered by modern image processing methods [30], we use the E4e encoder [20]. Since Child-Net defines various mathematical operations that manipulate the latent codes, it is important to stay within the well-defined latent code space to preserve the GAN image quality. E4e is designed to predict latent codes that closely follow the distribution of the original latent codes, making it particularly suitable for our model. To further tailor our approach to the task of kinship face synthesis, we also propose custom latent code operations that allow precise image manipulation.

B. KINSHIP DATASETS
To train kinship models, it is crucial to have an image dataset annotated with kinship relationships. Over the years, multiple datasets with different characteristics in terms of image quality, cropping, number of images, kin relationships, and metadata have been introduced.
Lu et al. [31] introduced the KinFaceW-I and KinFaceW-II datasets. The face images were captured under uncontrolled conditions with no constraints on pose, occlusion, or lighting. Robinson et al. [32] presented the Families In the Wild (FIW) dataset. It consists of 1,000 families with over 10,000 family photographs and metadata on 11 types of kinship relationships. The FIW dataset was further expanded [33] to include additional data on underrepresented families, and increased the number of photos to 30,275. Qin et al. [34] proposed the TSKinFace database, which consists of families of 2589 individuals. Fang et al. [35] presented the CornellKin dataset of 150 pairs of public figures and celebrities, along with images of their parents or children. The Family-101 dataset [36] was introduced in order to study the similarity of characteristics among family members. This dataset consists of 101 families with 607 individuals and 14,816 images. Xia et al. [37] proposed the UB Kinface dataset to evaluate and analyze algorithms for the kinship verification task. The images are based on real collections of celebrities and politicians. The Siblings Database [38] consists of two main subsets: a high quality database (HQFaces) and a low quality VOLUME 11, 2023 49973 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
database (LQFaces). The HQFaces consists of 184 images (92 siblings pairs) taken by a professional photographer with uniform background and controlled lighting. The LQFaces contains 98 sibling pairs found on the internet, most of which are celebrities.
A common aspect of the existing datasets is that the faces are either tightly cropped, of poor quality, or have an insufficient number of samples, which makes them difficult to use for the task of face synthesis. The Next of Kin dataset, presented in this paper, fills an obvious research need and features high-quality face images that enable the development and training of contemporary generative models, capable of producing photo-realistic high-resolution (artificial) face images.

C. KINSHIP RECOGNITION AND SYNTHESIS 1) KINSHIP RECOGNITION
Most kinship-based research focused on the task of kinship recognition. One of the first methods for classifying parent-child pairs was presented in [35], which used lowlevel features, e.g., average skin region color and gradient histogram. Xia et al. [39] proposed a subspace learning method to obtain a more discriminative kinship verification problem. This work was extended in [40] using Gabor filters, metric learning, and subspace transfer learning to obtain improved results. Fang et al. [36] presented a face reconstruction method by linear combination of database parts, which was used to determine the family membership of a particular person. Wang et al. [41] proposed a denoising autoencoder based on metric learning for kinship recognition. The latest FIW kinship recognition challenge [42] proposed three main challenges: kinship verification, tri-subject verification, and search & retrieval of family members for missing children.

2) KINSHIP SYNTHESIS
Compared to kinship recognition, there are far fewer methods that deal with the task of kinship face synthesis. The most popular task is the synthesis of the child's appearance based on parental images. DNANet [1] defined a neural network in which the image encoder generated ''gene'' features, which were then merged with certain gene selection rules. The generated gene features were then utilized as the input to the decoder, which synthesized the child's facial image. Cui et al. [2] introduced a model that fused parental images based on genetic biology knowledge of heritable traits about the size of individual facial features. StyleDNA [3] defined an encoder-decoder model that controlled the age and gender of the synthesized child image, while the gene fusion strategy was inspired by DNANet.
In this work, we propose ChildNet, a high-capacity model that, unlike previous models, features a structural design that allows it to define individual operations on the parts of the latent code that correspond to different facial characteristics. The ChildNet design is based on attention [43], [44] and mutation mechanisms that are learned end-to-end without the need to manually design the gene-merging strategy. Furthermore, in comparison to existing works, we define additional synthesis mechanisms, such as child appearance variability control and control of the dominant parent image.

A. ChildNet OVERVIEW
The goal of ChildNet is to synthesize a child face image I C ∈ R 3×n×n from the two parent images I F ∈ R 3×n×n as the father's image and I M ∈ R 3×n×n as the mother's image. Additional information in the form of an age class ρ and gender class γ is used to further control the desired child appearance. Specifically, ChildNet defines the following mapping: where θ denotes the trainable parameters. A high-level overview of the ChildNet model is shown in Figure 2. The model is divided into two main sub-modules: the Kinship Module, which determines the child's latent code based on the parents' latent codes, and the Age & Gender Manipulation Module that optionally modifies the child's latent code based on the provided age and gender information. After the final child latent code is synthesized, it is passed through a decoder model. ChildNet uses StyleGAN2 [5], which is capable of synthesizing high-resolution face images, as the decoder model.
Following the practice of contemporary models that edit face images using StyleGAN models [17], [18], [29], we base the mapping model in the extended latent space W + , which encompasses multiple 512-dimensional latent vectors. The exact number of vectors depends on the model resolution.
Since we use a model with 1024 × 1024 image resolution, the latent code space consists of 18 (concatenated) 512-dimensional vectors. The extended latent code can be divided into coarse, medium, and fine parts as these have been shown to influence different facial aspects of the synthesis process [7].
The focus of ChildNet lies in the latent space of a pre-trained GAN decoder G, training a mapping model that manipulates the latent codes of the parents to generate the optimal latent code of the child. We denote (w F , w M ) ∈ W + as the father's and mother's latent codes, respectively. To map a parent image to the W + latent space, we use a GAN-inversion encoder as w = E(I ). The child image is synthesized by finding the optimal child code w C ∈ W + given the parental code pair (w F , w M ).

B. KINSHIP MODULE
The Kinship Module is the first submodule of ChildNet. Given the input parent images, the primary aim of this module is to synthesize a child appearance that can be realistically regarded as a genuine descendant of the provided parents. Figure 3 schematically shows the architecture and training regime of the module. Schematic overview of ChildNet inference. The input data consists of a mother and a father image. Optionally, we can also provide the age and the gender values for further control of the final appearance. The father and mother images are encoded with GAN-inversion encoder E , resulting in corresponding latent codes. These latent codes are then processed with the Kinship Module M that predicts the initial child latent code. The age and gender values are converted into their respective embedding vectors, which are concatenated together and repeated. This embedding and the child latent code are then processed with the Age & Gender Manipulation Module D to produce the manipulated child code. Finally, the child latent code is decoded using a GAN decoder G. ChildNet capabilities include fixed-input image sampling and age & gender manipulation, as shown in the bottom right.

1) ARCHITECTURE
We assume that facial features can be predominantly influenced by one of the parents; however, the extent of dominance may vary across individual facial features. Facial shape, for example, may be predominantly influenced by one parent, whereas the other parent may exert a stronger influence on hair color and freckles. We therefore design a structured model that divides the latent codes into coarse, medium, and fine parts, as these parts have been shown to primarily influence only a subset of the facial features [7]. The latent vector parts are processed by a corresponding mapping layer M p , where p ∈ {c, m, f } denotes the coarse, medium, or fine part of the model. The extended latent code indexes that correspond to invidual model parts are denoted as i p , where i c ∈ {1, . . . , 4} for the coarse part, i m ∈ {5, . . . , 8} for the medium part, and i f ∈ {9, . . . , 18} for the fine part. Each latent code is processed by its corresponding model part to obtain the predicted (child) latent code parts, which are then concatenated.
The Kinship Module ensures that a child's latent code w C is positioned close to the latent space subspace, as defined by interpolating between the parental latent codes (w F , w M ). Additionally, the module slightly shifts the predicted code away from the latent subspace in order to disentangle various entangled facial features, encoded in parental latent codes. Each module part M p takes the corresponding latent code parts as input and defines two tasks. First, it attends to individual parental latent components by predicting interpolation coefficients. These coefficients form the basis for predicting the child's latent code, and define a latent subspace that indicates the approximate region where the child code should be located. Second, it predicts mutation (residual) coefficients that slightly shift the predicted child latent code from the latent subspace.
The Kinship Module part M p consists of two sublayers with identical architecture, called BaseNet. The Attention BaseNet sublayer M p att , attends to parental components by predicting latent code interpolation coefficients α i p and serves as a primary prediction for the child latent code. The Mutation BaseNet sublayer M p mut , then slightly shifts the predicted child latent code using the residual coefficients ϵ i p . Concretely, given the latent code of the father and mother (w , the child latent code is defined as: where ⊙ denotes element-wise vector multiplication and α i p elements belong to the range [0, 1]. The final child latent code w C is a concatenation of all w i p C parts. The BaseNet is defined as a multilayer perceptron, as shown in Figure 3(b). It takes two input vectors, denoted as (x 1 , x 2 ). These input vectors can represent a latent code part or a conditioning input in the form of an embedding vector. They are first processed by separate multilayered fully connected networks. The outputs of the two networks are then concatenated and processed further with another multilayered fully connected network, resulting in an output vector y. In the case of Attention BaseNet, the final layer applies sigmoid function, and the output is a vector of interpolation coefficients α that attend to individual parental latent codes. In the case of Mutation BaseNet, no activation function is applied in the final layer, and the output represents a vector of residuals VOLUME 11, 2023 ϵ that are added to a latent code. Additional details on the BaseNet architecture are provided in the Appendix.

2) LOSS FUNCTION
The loss function contains three main terms. The identity loss term ensures that the synthesized child image and the ground truth child image look visually similar. We achieve this with the ArcFace recognition model [45] R by maximising the cosine similarity between the model embeddings: where I C denotes the ground truth child image. Our next objective is to ensure ChildNet synthesizes diverse child appearances. This can be achieved efficiently in the latent space, since the StyleGAN2 training regime aims to map a fixed-size latent code step to a fixed-magnitude image change [5]. Therefore, we propose a triplet loss term in the latent space. We denote an anchor (predicted) child latent code as w C , a positive code as w + C = E(I + C ) and a negative code as w − C = E(I − C ), where I + C denotes the ground truth child image and I − C denotes an image of a non-related child. The loss term is then defined as: Finally, to ensure that the child code remains relatively close to the latent subspace defined by the latent parent codes, we define a magnitude loss term that limits the magnitude of the latent code residual ϵ. The loss term is defined as: The final loss function is defined as a weighted sum of the loss terms described: where each λ denotes a scalar weight of the corresponding loss term.

C. AGE AND GENDER MANIPULATION MODULE
The second submodule of ChildNet is a mechanism for age and gender manipulation that allows for finer control of the synthesized child image. The Age & Gender Module takes a facial image (in the form of its latent code), a target age, and a target gender as inputs. Its goal is to synthesize a facial image that preserves the person's identity while convincingly manipulating their age and gender towards the target values. Specifically, given a latent code w, an age value ρ, and a gender value γ , the Age & Gender Manipulation Module D defines the following mapping: where the predicted code w corresponds to decoded facial image that exhibits the input age and gender semantics. A schematic overview of the architecture and training scheme is shown in Figure 4.  The module uses the provided ground truth age and gender data to synthesize the reconstructed image I rec . Conversely, by providing randomly sampled age and gender data, the module synthesizes the target image I tar . The appropriate target image semantics are determined by utilizing the age L age and gender L gender loss terms. Using the original data, the target image I tar is mapped into the cycle image I cyc . The preservation loss is applied to both the reconstructed image I rec and the cycle image I cyc . to the w embedding: w ← w + ϵ, before being decoded with the StyleGAN decoder, producing the final image.

2) LOSS FUNCTION
The training regime of Age & Gender Module is inspired by the cycle consistency method, as outlined in [46]. The training data consists of images I ∈ R 3×n×n and their associated age class values ρ orig ∈ {1, 2, . . . , 10} and gender class values γ orig ∈ {1, 2}. During training, we synthesize three types of images: the reconstructed image I rec , the target image I tar , and the cycled image I cyc . Additionally, we produce the corresponding residuals: where ρ orig and γ orig denote the original age and gender values, while ρ rand and γ rand denote uniformly randomly sampled age and gender values. To achieve the intended semantics of age and gender attributes on the target image I tar , we use a pre-trained age model A and a pre-trained gender model G. We define an age loss L age and a gender loss L gender as cross entropy loss terms: where y denotes the ground truth, A c denotes the age prediction for the c-th age class, and G c denotes the gender prediction for the c-th gender class.
To prevent adversarial-like solutions, we use a magnitude loss term: Finally, in order to ensure the cycle consistency, we apply a preservation loss term on the reconstructed and cycled images: where LPIPS denotes the perceptual loss term [47]. The final loss term is defined as a weighted sum of the described loss terms: where each λ denotes a scalar weight of the corresponding loss term.

IV. NEXT OF KIN DATASET A. MOTIVATION
As shown in the Related Work section, the existing datasets are limited both in terms of subjects and kinship relationships (e.g., mother-son/daughter, father-son/daughter), while more recent datasets contain a larger number of families with kinship relationships extending beyond just two generations. However, as shown in Table 1 and Figure 5, the existing datasets are still mainly designed for the study of automatic kinship recognition techniques and generally do not contain data that allow development and training of high-quality image synthesis models. We introduce the Next of Kin (NoK) dataset, which contains high-resolution face images suitable for the task of kinship face synthesis.

B. DATA COLLECTION
We collected the dataset using the family relationship data from the FIW dataset. Following the common practice in the established in-the-wild datasets [51], [52], [53], we collected NoK images from the internet that were augmented with additional families not present in the FIW dataset. To ensure high image quality, we manually filtered the images, removing images with occluded faces and poor visual quality. We selected 512×512 as the optimal image resolution, and set the minimum acceptable resolution to 256 × 256. All images were rescaled to 512 × 512 using the CelebAHQ image Overview of the existing kinship datasets in comparison to the proposed dataset Next of Kin. The abbreviations describing the Subjects column stand for parent-child (P-C), grandparent-grandchild (GP-GC) and siblings (Sib). For datasets containing images with different resolutions, we report the median resolution in parentheses. The last column indicates the average image quality as evaluated by MANIQA [48], a no-reference image quality assessment model. preprocessing steps [6]. For the subjects with low image counts, we searched for additional high-resolution images.

C. DATA PROCESSING
The image preprocessing steps are closely related to the ones used in the creation of the CelebAHQ dataset [6]. The first step involved predicting facial landmarks and cropping the face image with the correct orientation. To achieve this, we used RetinaFace [54] as our primary model for predicting facial landmarks. All landmarks were visually inspected and manually corrected for any irregularities. Based on the corrected landmarks, a face image was rotated and cropped. Finally, we added a corresponding w ∈ W + latent code for each image by projecting each image onto the StyleGAN2 FFHQ model. We added information about gender and ethnicity 2 of each subject, as well as information about age and emotion for each image.

D. DATASET CHARACTERISTICS
The NoK dataset consists of 3, 690 high-resolution images from 553 subjects that correspond to 161 families. The average number of images per subject is 6.67 ± 4.48 with a total of 127, 719 triplets. We provide official training, validation, and test splits, taking special care not to overlap images of the same individual between splits. We analyzed the data using DeepFace, a face analysis model, to obtain per-person and per-image metadata in the form of gender, ethnicity, age, and emotion. The distribution of the metadata statistics is shown in Figure 6.
To assess the image quality of the NoK dataset compared to other kinship datasets, we perform an image quality assessment. The assessment is based on the state-of-the-art noreference image quality assessment model MANIQA [48] that estimates a quality score for each image. Since the model predicts scores for 224 × 224 images, the lower-resolution images are upsampled using bilinear interpolation, while higher resolution images are center-cropped to the target resolution. The model results are shown in the last column of Table 1. The NoK dataset has a higher average image quality score than other kinship datasets, and at the same time contains more images than other high-quality kinship datasets.

V. EXPERIMENTAL SETUP A. DATASETS AND EXPERIMENTAL SPLITS 1) KINSHIP MODELLING
We compare ChildNet with existing methods using two datasets. The first dataset, FIW, has traditionally been used to train kinship face synthesis models. The second dataset is the proposed NoK dataset. We use the official training and validation splits for both datasets. The testing procedure for FIW is based on the official test triplet parent-child data, but we extend it by using up to five images per person and creating all possible combinations of triplet images to obtain more test data. Furthermore, we remove the images whose diagonal is less than 90 pixels long. In the end, our test data consists of 15, 815 triplet images for FIW and 12, 265 triplet images for NoK.

2) AGE AND GENDER MANIPULATION MODULE
We use the FFHQ Aging dataset [56], which contains various annotations of the original FFHQ dataset [7], including gender and age. The ages are divided into 10 age classes, each representing a certain age range. We divide the dataset into a training and a validation split, with the training split containing the first 90% of the data and the validation split containing the rest.

1) DECODER
The GAN decoder model G is a pre-trained StyleGAN2 [5] generator model, trained on the FFHQ dataset [7], capable of synthesizing 1024 × 1024 resolution high-quality face images.

2) GAN-INVERSION ENCODER
The GAN-inversion encoder E is a pre-trained E4e [20] model, trained on the FFHQ dataset [7]. E4e predicts extended latent codes w ∈ W + , which closely follow the original latent space distribution.

3) KINSHIP MODULE
Each BaseNet of The Kinship Module consists of five-layered fully-connected networks. We use a dropout with a rate of p d = 0.5, batch normalization, and Leaky ReLU as the activation function in all hidden layers. We train ChildNet on FIW and NoK datasets using the official training and validation splits. The batch size is set to 16. We use the Adam optimization algorithm [57] with a learning rate of η = 3 · 10 −4 . The weighting factors are determined experimentally and are set to λ id = 1, λ tri = 10 −3 , and λ mag = 1. The triplet margin value is set to δ = 0.1.
We evaluate ChildNet against several competing models: DNANet [1], HeredityGAN [2], and StyleDNA [3]. The DNANet and HeredityGAN models are implemented from scratch, whereas the StyleDNA model is based on the officially released code. To ensure stability in training the DNANet model, we maintain the image cropping approach used in DNANet when evaluating the results. We define two types of models: those for unconditional synthesis, where the predicted child appearance is conditioned solely on the parent images, and those for conditional synthesis, where the model is provided with age and gender information in addition to the parent images. Implementation details for each model can be found in the Appendix.

4) AGE AND GENDER MANIPULATION
Our age model A and gender model G are based on the ResNet50 architecture [58]. We use the cross-entropy loss and the Adam optimization algorithm with a learning rate η = 3 · 10 −4 and batch size of 64.

A. COMPARISON TO STATE-OF-THE-ART MODELS 1) VISUAL ANALYSIS
We compare the visual results of the models for the unconditional child synthesis task, i.e., without additional age and gender information, as shown in Figure 7. The DNANet results follow the learned distribution with tightly cropped faces. The image quality of the faces is hindered as it does not use the state-of-the-art decoder models like other methods. Although HeredityGAN synthesizes convincing images, the synthesized faces lack the expected characteristics of a younger child due to the absence of a deaging mechanism. StyleDNA's results are compelling, but the variability of the synthesized images is low compared to other methods, and the facial expression remains relatively constant with respect to the input images of the parents. ChildNet achieves the most visually appealing results, synthesizing images that can be regarded as real child images.

2) IDENTITY SIMILARITY, UNCONDITIONAL EXPERIMENT
To assess the accuracy of the synthesized identity, we evaluate identity similarity using FaceNet [59] and ArcFace [45] face recognition models. These models are used to extract face embeddings from the synthesized and the real child images. We propose an evaluation strategy that does not use the negative child pairs due to the stochastic nature of sampling. Furthermore, we argue that the similarity of the synthesized child to the real child (positive pair) is more important than the similarity to a random child (negative pair). Therefore, our evaluation is based solely on the similarities of positive pairs, i.e., similarities between the synthesized and real child images. Compared to previous kinship face synthesis approaches, our testing procedure also uses a much larger number of testing samples, consisting of 15, 815 triplets for the FIW dataset and 12, 265 triplets for the NoK dataset.
We evaluate the models by constructing a histogram of the predicted cosine similarities and a histogram of perfect cosine similarities (value 1 with 100 % confidence). We measure the discrepancy between these two histograms using the earth mover's distance. Additionally, we report the mean of the cosine similarities. The histogram results for the FIW and NoK databases are displayed in Figure VI-A2, showing that ChildNet produces higher similarity scores than the compared models. The quantitative results in Table 2 confirm this finding, with ChildNet achieving the best results in terms of the earth mover's distance and average identity similarity metrics.

3) IDENTITY SIMILARITY, CONDITIONAL EXPERIMENT
First, we assess the quality of the proposed Age & Gender Manipulation Module. We demonstrate several examples of manipulation on the FFHQ Aging dataset in Figure 9. The input image is encoded into a latent code by the GAN-inversion encoder E and the corresponding (decoded) image reconstructions are shown in the second column. The model is able to convincingly change the facial appearance to match the specified gender and age group.
Next, we design a conditional experiment, where all compared models are given age and gender information, in addition to parent images, as inputs. The results are evaluated in terms of identity similarity. We test the models using both the NoK dataset, which already contains the necessary information, and the FIW dataset, which only includes gender information. To obtain the age information for the FIW dataset, we use the pre-trained age model A. The synthesized results are analyzed using the identity similarity as computed by the ArcFace model. The identity similarities are again compared in terms of the earth mover's distance and average similarity metrics. Table 3 shows that ChildNet achieves the best results  on both datasets. Figure 10 shows example visual results for the NoK dataset, comparing methods with high visual quality. ChildNet achieves the most visually compelling, high-quality child synthesis results that bear high resemblance to the real child images.

4) PERCEPTUAL SIMILARITY
In this experiment, we focus on measuring the perceptual quality between the synthesized and the real images. Higher image quality corresponds to less distortion and fewer image artifacts. We measure the perceptual quality using the LPIPS metric [47] with the AlexNet backend [60], which was shown to align well with human perception of image quality. The results are shown in Table 4. The perceptual differences between the methods using the StyleGAN decoder are relatively small. The StyleDNA model achieves the best results for the FIW dataset, with ChildNet being a very close competitor. ChildNet achieves the best results on the NoK dataset.

B. ChildNet CHARACTERISTICS 1) KINSHIP MODULE ABLATION STUDY
We conduct an ablation study of the Kinship Module model by removing certain loss terms and by altering the Kinship VOLUME 11, 2023  Module architecture. The impact of the individual model setting is measured using the identity similarity as computed by the ArcFace model. The results are reported in terms of the earth mover's distance and average cosine similarity. To accelerate the training convergence, all ablation tests are performed at a reduced model resolution of 256 × 256 and with an increased learning rate of η = 1 · 10 −3 .
We analyze the impact of the loss terms in the final loss function of the Kinship Module in Eq. (8) on the identity similarity of the synthesized child images as measured by the ArcFace model. For this purpose, we train the Kinship Module on the NoK dataset with different combinations of loss terms. The results are shown in Table 5. A randomly initialized network already achieves decent results, as the initialized parameter weights tend to average the latent codes of the parents. The lack of the magnitude loss term is not reflected in the quantitative results; however, the results show a considerable amount of visual artifacts, as shown in Figure 11. This can be attributed to the unbounded values of residual coefficients ϵ, which can push the latent codes outside the learned StyleGAN latent code distribution. Adding the triplet loss term helps achieve a slightly better identity similarity results without compromising the visual quality.
Next, we investigate the impact of the Kinship Module architecture. The following architecture settings are considered: • presence of the Mutation BaseNet, i.e., the impact of the latent residual, • structured model vs unstructured model, i.e., the impact of the model separation to the coarse, medium, and fine part, • the formulation of the BaseNet output, i.e., the impact of a scalar prediction (or a vector with repeated elements) as compared to a vector prediction. The results are shown in Table 6. The largest degradation in results is observed when the BaseNet model outputs a single value instead of a vector. We hypothesize that the improvement in results is due to the increased ability to disentangle variational factors in the GAN latent space when using the vector formulation. The impact of a structured model is most noticeable when used with the Mutation BaseNet, indicating its significance in the Kinship Module.

2) AGE & GENDER MANIPULATION MODULE ABLATION STUDY
Next, we conduct an ablation study on the Age & Gender Manipulation Module Study by analyzing the impact of the 49982 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   Here we compare the impact of the magnitude loss term L mag . The first row shows the synthesis results when the magnitude term is disabled, which corresponds to λ mag = 0. The second row shows the synthesis results when the magnitude term is enabled. The absence of the regularization term affects the synthesis results, leading to the presence of visual artifacts. loss terms in Eq. (18). Concretely, we retain the age loss term L age and the gender loss terms L gender as those are required in order to achieve the desired semantic information on the target image I tar . We analyze the significance of the magnitude loss term L mag and the preservation loss term L pre by zeroing out their corresponding scalar weights (λ mag , λ pre ).
The visual results are shown in Figure 12. We observe that the lack of the magnitude loss causes image artifacts on the The row text indicates the target age and gender class. We observe that the omission of the magnitude loss term L (second column) causes adversarial image artifacts. Zeroing out the preservation loss term L pre (third column) causes undesired changes in the person appearance, such as the change of the skin tone and the hair colour. Our approach ensures high image quality and retention of main facial characteristics. target image. Since the residual vector ϵ is not regularized, the predicted latent code may be outside the learned StyleGAN latent code distribution, causing undesired image artifacts when processed with the StyleGAN decoder. Meanwhile, the lack of preservation loss effectively breaks the cycle model architecture. This, in turn, can cause the target image to exhibit undesired changes in the person appearance, such as a change of the skin tone, hair color, or other characteristics. Including both terms ensures the high image fidelity due to the magnitude loss term and the effective retention of the person's facial characteristics due to the preservation loss term.

3) REPLACING THE PARENT IMAGES
In this experiment, we analyze how ChildNet synthesizes images when presented with different images of the same parent identity. Figure 13 displays the visual results of the experiment, which demonstrate that the input image affects the synthesized image by changing certain facial features, e.g., skin color, hair color, face shape, and background. However, most synthesized images exhibit a similar child identity, showcasing that ChildNet emphasises the parents' identity information to synthesize the child image.

4) CHILD SAMPLING AND VARIABILITY CONTROL
In this experiment, we analyze the mechanism of image variability given fixed parent images. We propose to use the dropout mechanism [61] in the Kinship Module to ensure the variability of the child's appearance. Typically, the dropout VOLUME 11, 2023 49983 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  mechanism is used only during the training phase, where elements are randomly zeroed out to prevent neuron coadaptation, while in the testing phase no elements are dropped out and the elements are scaled appropriately. We use the dropout mechanism in test mode for point estimate image synthesis, which is used for all quantitative experiments. By applying the random dropout mechanism (as in training mode), ChildNet is able to synthesize image variations given fixed parent input images. In Figure 14, example results are shown with the dropout rate set to the default training value p d = 0.5. The image variability mainly affects the stochastic aspects of faces, e.g., facial expression or the exact placement of hair, while the fundamental facial attributes remain largely the same as in the original image. By varying the dropout rate, ChildNet can control the amount of variability present in the synthesized images given a fixed input. As shown in Figure 15, lowering the dropout rate results in reduced amount of image variability while higher dropout rate increases it.

5) ANALYZING THE INTERPOLATION COEFFICIENTS
In this experiment, we analyze the distribution of the interpolation coefficients α as predicted by the Attention BaseNet of the Kinship Module. A distribution of α values centered around 0.5 would indicate that the module tends to synthesize an average face between parents. The opposite case would indicate that the model emphasizes certain latent features of the parents with greater weight during the synthesis process.
We perform the experiment on the test split of the NoK dataset and record the interpolation coefficients for each latent code part (coarse, medium, and fine). The histogram of the coefficients is shown in Figure 16. Interestingly, we find that the interpolation coefficients are mostly distributed near their extreme values of 0 and 1, especially at the coarse level, which affects the most fundamental face appearance. The results suggest that the model is able to find certain latent components of the parents that should be inherited almost exclusively from one of the parents.

6) CONTROLLING THE DOMINANT PARENT IMAGE
We propose a mechanism to control the child appearance to resemble one of the parents more closely. This is implemented by modifying the interpolation coefficients α i p in Eq. (2) by introducing a scalar δ d ∈ R [−1,1] that controls the dominant parent image. A more mother-like appearance is where α is a concatenation of α i p parts, 1 and 0 denote vectors of ones and zeros of the same size as α, and lerp denotes element-wise linear interpolation. Visual examples of varying the δ d value are shown in Figure 17, demonstrating that our mechanism indeed allows us to synthesize a more motheror father-like appearance while retaining the high image quality.

7) RESULTS FOR PARENTS WITH DIFFERENT ETHNICITY
In this experiment, we visually analyze the synthesized child images when the model is presented with parent images of different ethnicities. According to genetic studies [62], a child's skin color tends to follow the average skin color of parents belonging to different ethnic groups. Ideally, a trained kinship synthesis model would also incorporate such genetic regularity. We choose facial images with African, Indian, Latino, and Caucasian ethnicities and input them into Child-Net model. Figure 18 shows that ChildNet synthesizes faces with skin color that resembles a skin color blend between the two parent, or closely resembles a skin color of one of the parents.

VII. SOCIAL IMPLICATIONS OF AUTOMATED KINSHIP FACE SYNTHESIS
The development of automatic kinship face synthesis methods could have certain beneficial and detrimental social impacts. The positive impacts may include: • The technology can be used for exploring personal family history and genealogy, by synthesizing visual representations of both ancestors and descendants.
• The system could be employed in cases where separated family members are seeking biological relatives. Generating images of potential family members offers a means to potentially identify long-lost relatives.
• The methods have a potential to open up new avenues in the entertainment industry, allowing for creation of unique characters by merging facial features of different actors or synthesizing images of fictional characters. This could lead to increased storytelling possibilities, and a more diverse representation in film and television. The negative social impacts may manifest in the following ways: • If the method's training data is not diverse and representative, the technology could perpetuate existing social biases. This could result in unfair representation or exclusion of certain ethnicities and cultures. Effors should be made to ensure that the kinship synthesis methods are trained on diverse and inclusive datasets.
• The discovery of previously unknown relatives could have psychological effects on individuals. The emotional consequences of such discoveries should be considered. It is crucial for researchers to carefully consider these effects to ensure ethical development and use of this technology.

VIII. CONCLUSION
In this paper we present ChildNet, a novel model for kinship face synthesis. It consists of two main modules: the Kinship Module and the Age & Gender Manipulation Module. The Kinship Module utilizes an attention mechanism to predict the child latent code, focusing on parental latent codes and adjusting them using a mutation mechanism. The Age & Gender Manipulation Module serves to control the age and gender of the synthesized child image. ChildNet achieves the state-of-the-art performance with respect to identity similarity while exhibiting high perceptual quality. Furthermore, it demonstrates versatility in terms of synthesizing multiple images per input, controlling the synthesis variability and the dominant parent influence. Finally, due to the limitations of existing kinship datasets for the task of kinship face synthesis, we introduce a novel kinship dataset Next of Kin, which contains high-resolution face images and metadata on kinship relationships along with other facial image characteristics.

APPENDIX A ADDITIONAL VISUAL RESULTS
In the main manuscript, we demonstrated several visual results showcasing the excellent ChildNet synthesis capabilities and presented several mechanisms for precise manipulation of the synthesized child appearance. In this section, we present additional visual results.
Replacing the Parent Images: Figure 19 displays additional ChildNet results using different NoK input instances of parent images from the same subjects. The synthesized images inherit certain undesired image characteristics from the parent images, e.g., background characteristics. Nevertheless, ChildNet synthesizes child facial images that exhibit similar identity, regardless of the input images.
Results for Parents With Different Ethnicities: Figure 20 shows the results on the FIW dataset, where the parents belong to different ethnic groups. The visual results demonstrate that the skin complexion of the synthesized child images tends to reflect the skin complexity of one of the parents or a combination of the parents' complexions.
Age and Gender Manipulation: Figure 21 shows the conditional results on the FIW dataset, comparing ChildNet to the compared models. ChildNet synthesizes the most visually convincing, high-quality facial child images that closely resemble the identity of the real child image.

APPENDIX B IMPLEMENTATION DETAILS OF COMPARED MODELS
DNANet: DNANet is based on CAAE encoder and decoder model [63]. The decoder model requires age and gender information in addition to the latent code to synthesize an image. For a fair comparison with other models in the task of unconditional kinship face synthesis, we retrain the CAAE model without any age and gender conditioning. Apart from changing the weight dimensionality of the first fully connected decoder layer, the model architecture remains intact.
When processing images with the DNANet model, we crop the input images to reflect the facial cropping used in the original work. DNANet provides two selection rules for gene shuffling, maximum or random selection. To obtain deterministic results, we use the maximum selection rule.
During DNANet training, we experimented with different model settings, but we could not achieve the visual quality of the examples presented in the original work. Due to the adversarial nature of the DNA training scheme, simple training with validation loss monitoring is not possible. Therefore, we analyzed the image reconstructions at the end of each Providing additional information about age and gender helps ChildNet synthesize a child image, that has a strong resemblance to the real child.

FIGURE 22.
Comparison of ChildNet to DNANet results. The DNANet results are taken from the original paper [1]. Note that DNANet is trained on images with tight crops. We can observe that ChildNet synthesizes more convincing image results.
epoch. The visual results indicated that later checkpoints contained less visual artifacts at the expense of image crispness. The DNANet results presented in the main part of the paper are based on the model checkpoint that performs best visually given the two tradeoffs. In Figure 22, we further analyze DNANet performance by reverse searching the FIW dataset for the original DNANet paper images before processing them with ChildNet.
HeredityGAN: HeredityGAN defined semantic directions in the W + latent code space that correspond to facial characteristics related to the size of facial regions, e.g., size of eyes, nose, and jaw. To generate the required data for semantic vectors, we synthesize 100, 000 StyleGAN2 images and process them with landmark prediction model [64]. During the micro fusion step, where the semantic vectors are moved based on dominant and recessive features of the facial attribute, we design a deterministic algorithm instead of a stochastic one; we always choose the projections of the parent that has the dominant features.
HeredityGAN proposed a gender disentanglement mechanism where the father's latent code is moved towards the target gender latent direction to match the child's gender. We use the InterFace method [25] to identify the gender semantic latent vector. For age disentanglement, we use a similar method as for gender disentanglement, but due to the continuous nature of age data, we modify the control mechanism. Specifically, our goal is to determine the magnitude of the latent code step required to synthesize the desired age. To do this, we take an age scalar ρ ∈ N [1,100] , normalize it to the range R [−1,1] , and multiply it by the trainable parameter P age . The resulting scalar is then multiplied by the InterFace-identified age vector, and added to the original latent father vector: where norm defines the age normalization operation and v age is the semantic age vector obtained from InterFace. To determine the P age value, we set it as a trainable parameter and train it on the NoK dataset using Eq. (20) to synthesize facial images. The training objective was set to the mean squared error loss between the predicted ageρ of a pre-trained age model and the true age ρ. When changing the child gender, the father latent vector is moved as follows: where P gender ∈ {−1, 1} is a value that depends on the desired child gender and v gender is the InterFace [25] identified gender semantic vector. The P gender value is obtained by experimenting with several different values and choosing the most appropriate one given the trade-off between expressed target semantics and identity preservation of the target image. StyleDNA: We use the official StyleDNA online code implementation to evaluate the results on FIW dataset. To evaluate performance on the NoK dataset, we re-train the DNANet-based mapping module of StyleDNA. StyleDNA uses a DNANet gene shuffling mechanism for which we implement the maximum selection rule instead of the randomized rule to achieve deterministic results. StyleDNA inherently supports age and gender manipulation, so no other significant changes had to be made.

APPENDIX C MATHEMATICAL SYMBOLS
This section presents the mathematical symbols frequently utilized throughout the manuscript for easier comprehesion. The mathematical symbols and their brief descriptions are detailed in Table 7.

APPENDIX D ARCHITECTURE DETAILS
In this section, we provide more details about the architecture of the Kinship Module M and the Age & Gender Manipulation Module D. Both modules are split into three parts, namely the coarse, medium, and fine part.
The Kinship Module part consists of two BaseNets: the Attention BaseNet and the Mutation BaseNet. Both Attention and Mutation BaseNets have the same basic architecture. They both take two vectors of the same dimensionality as input and produce a new (single) vector of the dimensionality. The difference between Attention and Mutation BaseNet lies in their respective aim and their final activation function. The Attention BaseNet is used to prediction the interpolation coefficients α, which are used to attend to the parental latent codes, so it uses the sigmoid function in the last layer. On the other hand, the Mutation BaseNet is used to predict the residual coefficients ϵ and does not implement any non-linear activation function in the last layer. The exact architecture of the Kinship Module BaseNet is shown in Table 8.
The  does not feature any batch normalization and dropout layers. Table 9 shows detailed look at the Age & Gender Manipulation Module BaseNet architecture. where he is currently the Head of the Laboratory for Machine Intelligence, Faculty of Electrical Engineering. In the last ten years, he has been involved as a key team member in eight national and international ICT projects, and as the project manager and a principal researcher in two EU FP7 projects in the field of biometrics, speech technologies, pattern recognition, machine learning, and smart surveillance technology ethics. He has published popular journal articles and peer-reviewed scientific and conference papers on biometrics, ambient intelligence technology, spoken language technologies, and privacy protection. His research interests include artificial intelligence, pattern recognition, biometrics, smart surveillance systems, and spoken language technologies. He is a member of the IAPR and ISCA, and the former President of the Slovenian Language Technologies Society. He was the Local Organizing Committee Chairperson of the JTDH 2018 and TSD 2019 conferences.