UvA-DARE (Digital Academic Repository) A survey on kinship verification

In this survey, kinship veriﬁcation is deﬁned as the automatic process of verifying whether two or more persons are blood relatives (kin) by analyzing images of their faces. Kinship veriﬁcation is an important research ﬁeld in computer vision with many applications such as ﬁnding missing persons, family album organization, and online image search. Although substantial progress has been made in kinship veriﬁcation in the past decade, there are still challenges such as intrinsic (face i.e. , differences in facial appearance) and extrinsic (acquisition i.e. , varying imaging conditions) problems. And there is still a demand for more diverse datasets. Therefore, this paper provides a survey on kinship veriﬁcation methods and datasets. The survey starts with the deﬁnition of kinship veriﬁcation and its corresponding intrinsic and extrinsic challenges. Then, an overview of kinship veriﬁcation methods and datasets is given. Finally, a new multi-modal dataset (Nemo-Kinship Dataset) is proposed as a benchmark dataset addressing large inter-subject age variations consisting of 4216 videos of 248 persons from 85 families. The newly collected dataset is used to systematically test and analyze state-of-the-art methods. (cid:1) 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
In this survey, kinship verification (kV) is defined as the automatic process of verifying whether two or more persons, represented by images of their faces, are blood relatives i.e., kin or nokin [1][2][3][4][5]. Image-based kinship verification assumes facial resemblances between genetic-related persons [6][7][8]. Fang et al. [9] are among the first to study kinship verification based on images of faces. Since then, kinship verification attracted a lot of attention in computer vision and related research fields such as historical and genealogical studies [10,11], social media [12,11,13], behavior analysis [8,7,14,15], and inheritance [16]. Kinship verification is a challenging task mainly due to intrinsic (face i.e., differences in facial appearance) and extrinsic (acquisition i.e., varying imaging conditions) challenges. Intrinsic challenges are related to changes in age, gender, expression, ethnicity, and types of genetic relationships [4]. Extrinsic challenges correspond to the image acquisition process, such as changes in illumination, camera viewpoint, and face occlusion.
Kinship verification can be divided into three groups based on the process of feature extraction and learning: (1) (hand-crafted) feature extraction, (2) metric learning, and (3) deep learning. Early kinship verification methods focus on extracting features at facial landmarks such as eyes and noses. Hand-crafted descriptors include HOG [17], LBP [18], PEM [19], and Gabor [20] features. Later, metric learning methods are proposed to exploit (distance) metrics by maximizing inter-class and minimizing intra-class distances. More recently, deep learning is proposed to learn features and metrics simultaneously [21,22]. Different image datasets are proposed. The CornellKin dataset, proposed by Fang et al. [9] in 2010, is the first widely used image dataset. Then, the KinFaceW-I & II [23,24] public datasets are proposed containing four different kinship types (father-son, motherson, father-daughter, and mother-daughter). Robinson et al. propose the Families In the Wild (FIW) dataset [1,25] to study kinship verification in more challenging and dynamic environments. In addition, a number of video datasets are provided [26][27][28]. Unfortunately, the major problem with these datasets remains the limited age range between subjects.
To this end, in this paper, we propose a multi-modal dataset for kinship verification containing a wider range of age variations than existing datasets. The newly collected Nemo-kinship dataset consists of 4216 videos of 85 families with 248 individuals.
This survey: provides a large survey on kinship verification methods and datasets.
studies the challenges of existing kinship methods and discusses future directions. proposes the Nemo-Kinship dataset containing a large range of age differences between subjects.
This survey is organized as follows. In Section 2, kinship verification is discussed including kinship definition, biological background, and potential applications. An overview is given of different kinship datasets and methods in Sections 3 and 4, respectively. In Section 5, the Nemo-kinship dataset is presented. Evaluation protocols for kinship verification are given in Section 6. In Section 7, a benchmark is conducted on both public datasets as well as on the Nemokinship dataset. Conclusion, discussion, and future directions are outlined in Section 8.

Motivation and Background
2.1. Biological background

Kinship verification by humans
Facial information is the most commonly used identification cue in genetic similarity [7,8,[29][30][31]. Images of faces contain important identification cues to determine, for example, the age, identity, gender, and ethnicity of a person [12,8]. In 1982, Daly and Wilson [6] propose to use facial similarity as a physiological cue for kinship detection, providing a basis for human kinship detection [32]. Moreover, kinship verification is used to measure direct (breeding behavior) and indirect fitness (altruistic behavior) [8]. For instance, the paternal resemblance [13] has a positive effect on family relationships, and the facial resemblance enhances the corporation as well as trust [33,29].

Significance and applications
The above factors indicate that kinship verification is beneficial for genealogical studies, but it also has important implications on other applications such as arranging and managing hundreds of thousands of images online [34], historic lineage and genealogical studies identifying inaccessible people based on their kinship similarity. Moreover, in forensic and criminal studies, kinship verification is used to reduce the number of suspects by narrowing down the search space e.g., in the case of the Boston Marathon bombing [35]. Hence, kinship verification may have a positive influence on different domains such as genealogical studies, social media, and forensic investigation, with many applications such as automatic photo tagging and management, missing children, crime scene investigation, and surveillance. However, improper use of kinship verification can lead to privacy violations. Moreover, the verification system's security may fail in case of adversarial attacks [36][37][38][39][40] and fake facial images [41,42].

Kinship verification, recognition and identification
Mohammed et al. [43], In general, kinship may indicate similarity, familiarity, or closeness between entities on the basis of some or all of the basic traits or features . . . biology, kinship typically refers to the degree of genetic relatedness or coefficient of relationship between individual members of the same species [44,17,45] . . ..
In general, there are two kinship types: kinship with blood (consanguineal kinship) and marriage ties (affinal kinship) [46]. Kinship with blood ties corresponds to blood-based relationships with overlapping genes [47] and kinship with marriage ties addresses the connection based on marriage. This paper focuses on blood ties.
According to the degree of similarity between family members, kinship is classified into three groups [43,48,49]:  [49].

Kinship Verification: definition
According to [43,1], kinship recognition is the task of studying blood relationships based on facial image information. Kinship verification is one of the subtasks of kinship recognition. The three major subtasks of kinship recognition are [1,43]: (1) kinship verification defined as a binary classification task determining whether two or more persons are blood-related, (2) kinship identification with the aim to estimate the kin-type, and (3) kinship/family classification identifying to which family an individual belongs to [3,1,50]. These three tasks are interrelated and influence each other [43]. As shown in Fig. 1, kinship verification is based on the results generated by kinship classification. Furthermore, kinship verification analyzes different types of kinship relationships [43]. Hence, kinship verification plays a central role in kinship recognition.

Formulation
As discussed in Section. 2.2.2, kinship verification is a binary classification task determining whether two (or more) people are kin or not. We now briefly discuss the formalized kinship verification task [51]. Most of the existing research focuses on bi-subject (one-versus-one) kinship verification. A canonical definition of the task is as follows: g denote the training set of images pairs containing kin relationships for each kin-type. N is the number of positive pairs, and X i and Y i are parent and children images respectively. Then, let the negative training set be denoted by . . . ; N; ijg, representing the image pairs without kin relation. To verify the kin-types, the binary classifier f ðÁÞ and feature extractor gðÁÞ are used. Then, the final output is formulated by: where 1 represents kin and 0 non-kin. There are special cases where two parents and a child are used as input. For this tri-subject (oneversus-two) kinship verification task, the positive training set is given by fðX fi ; X mi ; Y ci Þji ¼ 1; 2; . . . ; Ng and the negative training set is denoted by fðX fi ; X mi ; Y cj Þji; j ¼ 1; 2; . . . ; N; ijg . Then, the final output is given by: where X fi ; X mi ; Y ci denote the ith sample of father, mother and child.

Kinship information
Obviously, f ðÁÞ needs to make full use of kinship information from X i and Y j . However, how to effectively extract kinship information is still a question. Martello and Maloney [31] conduct experiments with 220 participants. They show that the upper half of the face contains more relevant kinship information than the lower half. Furthermore, eye regions contain slightly more useful cues than the rest of the upper half of the face. Hence, enhancing the eyes, nose, and mouth areas may improve the accuracy of kinship verification. Studies [9,52,53,32] also show that cues, related to kinship information, can be based on machine-based kinship verification. However, there are conflicting findings between different studies. Gao et al. [52] show that mouth regions contain higher similarities between children and parents. In contrast, Martello and Maloney [31] show that people are better at predicting kinship without mouth regions. In addition, DeBruine et al. [11] show that the degree of similarity may vary between samegender and different-gender pairs. The same-gender pairs usually obtain higher similarities. Fig. 2b shows the important facial cues for different kin-types [54]. Features may vary for different kintypes [16].

Challenges on kinship datasets
As mentioned in Section 1, there are different intrinsic and extrinsic challenges. Compared to datasets for face recognition [21,55], kinship datasets are much smaller in size. Hence, new kinship datasets are required according to: Large-scale video-based kinship datasets. Kinship datasets for solving specific kinship-related problems.
In general, current kinship datasets consist of still images. However, video-based datasets contain dynamic facial and head cues, including head motion (gait), expressions and mouth movement. Videobased datasets may increase the accuracy and robustness of kinship verification algorithms. Another important aspect of a kinship dataset is that it can be used for solving specific kinship relationships. For example, the face of a person changes over time (i.e., aging) and may negatively influence existing kinship verification methods. Therefore, a dataset containing pictures/videos of the same person over time is an important addition.

Architecture of Kinship Verification Systems
Studies [47,4,43,56] show that an automated kinship verification system can be divided into four phases: (1) face detection, (2) feature extraction, (3) similarity computation, and (4) verification. The pipeline of a kinship verification system is illustrated in Fig. 3.

Pre-processing phase
The pre-processing phase locates, detects, and segments the facial regions and separates them from the background. It ensures that the kinship verification system focuses on valuable regions to extract features. This phase also includes the normalization of head pose, illumination, and scale.

Feature extraction phase
Feature extraction methods are proposed based on hand-crafted descriptors such as texture, appearance, and geometry features. Other feature extraction methods employ deep neural networks.

Similarity measurement phase
This phase measures the similarity between image pairs based on the extracted features. It includes selecting the best subset from the obtained feature or mapping the extracted features to a more prominent manifold. Different distance calculations (e.g., Euclidean and cosine distance) together with metric learning are used.

Verification phase
The verification phase outputs the final result i.e., kin or nonkin. Commonly conventional machine learning methods that are sued are SVM and KNN. For deep learning methods, the classification results are usually obtained through the fc layer or MLP.

Datasets
Fang et al. [9] collect the first kinship dataset. Since then, different datasets are collected to narrow the distribution discrepancy between training and real-world data. Increasingly larger datasets are proposed to support data-driven methods. Based on the kintype numbers, existing datasets can be divided into three categories: 4-types, 7-types, and 11-types (non-kin is not considered). The development of public kinship datasets is shown in Fig. 4. As depicted, the blue box represents image-based datasets and green boxes correspond to video-based datasets. It is shown that, in the early days, kinship datasets are mainly image-based. Recently, video-related datasets are collected, and their labels are becoming more diversified. Table 1 lists the similarities and differences between datasets in a checkerboard manner. The darker the block is, the more similar the datasets are. Most datasets contain four kin-types, and most of these images are unconstrained. The number of images is usually less than 1000. (See Fig. 5).    ) with different race (around 50% Caucasians, 40% Asians, 7% African Americans, and 3% others), gender, and age.
UB KinFace (2011) [59,64]: Different from CornellKin, UB KinFace 2 contains three images for each positive set with 270 images collected in total and separated into 90 groups. Each group contains three types of images: child, young parent, and old parent. This dataset is updated into the so-called UB KinFace Ver2.0 in which groups are extended from 90 to 200. For UB KinFace Ver2.0, the influence of ethnicity is considered. There are four kin-types (F-S, F-D, M-S, M-D). To our knowledge, UB KinFace is the first database collecting children, young parents, and old parents for kinship verification. However, Yan et al. [65] show that there is a large imbalance in the dataset because nearly 80% of UB KinFace are father-son relationships.
Family101 (2013) [57]: Family101 3 is collected based on the family trees. It contains 101 different family trees with 206 nuclear families. Each family tree contains 1 to 7 families. It consists of renowned (public) families. Each family contains 3 to 9 family members. This dataset contains 72% Caucasians, 23% Asians, and 5% African Americans with different gender or age. There are 607 individuals and 14816 images in total. Family101 is organized by a family structure providing a more structure-related task for kinship recognition. Table 1 Checkerboard of existing datasets. The darker the box is, the more similar.

Video-based datasets
As opposed to still images, videos contain face dynamics including changes in head movements, expressions, and illumination conditions [28]).   KIVI(2019) [62]: KIVI 6 is collected from the Internet to include realistic in-the-wild variations. It contains 503 videos of individuals from 211 families. There are 355 positive kin pairs. The videos' duration is around 18.78 s, with a frame rate of 26.79 frames per second (fps). The total number of still frames in the database is over 250,000 [62].  [68], over 13000 family photos of 1000 families are collected. The number of pairs increased from 418000 to 656954. In [69], existing labels are used for each family as side information to add more data to under-represented families.

Others
In addition to the datasets mentioned above, datasets used for other applications also contain kinship information. The Family Face Database (FF-Database) [70] is used for the face prediction of children. It consists of 7488 parent and 8558 child faces with 128 Â 128 resolution. Six facial attributes are labelled: expression, gender, age, glasses, moustache, and skin colour. The People in Social Context (PISC) [71] dataset is collected for the task of social relation recognition. It consists of common social relationships, including commercial, couples, family, friends, etc.. The People in Photo Album (PIPA) dataset [72] is collected from Flickr photo albums and can be used for both person recognition and social relation recognition. 16 finer relationships are labelled, including the grain kinship relationships, such as father-child. Although these datasets are created for other tasks, they can also be used for kinship verification.

Discussion
In Fig. 4 and Table 1, it is shown that image-based kinship datasets are well-developed for image-based kinship verification. In contrast, there is still a demand for video-based kinship datasets. According to Table 1, most of the datasets are collected in unconstrained settings, causing many external interference factors, and making it difficult to study kinship verification systematically.
Several kinship datasets can be used to study specific kinship problems. For example, TSKinFace can be used for the tri-subject kinship verification task. UB KinFace is suitable for kinship verification of elderly people, and TALKIN for multi-modal and soundbased kinship verification. In contrast to specific kinship problems, general-purpose kinship and generic datasets are required. To this end, we collected the Nemo-kinship dataset for the purpose of child-adult kinship verification. This dataset is discussed in Section 5. Fig. 6 shows the challenges summarized for kinship verification and corresponding approaches. There are six internal subchallenges i.e., age, race, gender, facial expression, posture, and kin-type. There are four for extrinsic i.e., data imbalance, data size, unconstrained, and multi-modal. We select and list the corresponding approach. Many challenges have their corresponding methods, but some of them lack a specific solution. For example, there are currently no kinship verification methods proposed to deal with racial bias or low-quality facial images. Fig. 7 shows the development of existing methods. According to the type of input, kinship verification methods can be divided into image-based and video-based methods. Among each of them, we divide the methods into three categories according to their feature representation: (1) hand-crafted feature-based, (2) metric learning-based, and (3) deep learning-based. The hand-crafted feature-based category includes traditional hand-crafted descriptors. The extracted facial features are used by standard discriminators such as KNN and SVM. The metric-learning-based category mainly focuses on projecting latent features onto more prominent spaces. The goal of these methods is to decrease the intra-class distance of the projected features and to increase the inter-class distance. The third category is based on deep learning, such as CNNs, GANs, GCNs, and auto-encoders.

Handcrafted feature descriptors
The first kinship verification method is proposed by Fang et al. [9]. The method uses 22 hand-crafted (facial) features to represent the geological information between parents and children. These features are low-level features such as color, facial geometry, and texture. Then, K-Nearest-Neighbors and SVMs are trained based on these features. The top 14 factors are selected based on the classification accuracy. It shows that most of the informative parts are around the eyes. Since these features correspond to local parts, glo-bal features are also included. Later, Fang et al. [57] use the dense SIFT (dSIFT) descriptor for kinship verification. After this first publication, different hand-crafted feature extracting methods are proposed [73][74][75]46,76,58,[77][78][79][80]. Low-level features such as HOG [81], LBP [82], LPQ are used for kinship verification.
Gabor [20,89]: Zhou et al. [90] utilize a Gabor wavelet and propose a Gabor-based gradient orientation pyramid feature for kinship verification. Xia et al. [91], and Shao et al. [92] partition the face into regions in multiple layers and then compute Gabor filters in each region to extract genetic-invariant features.
According to Yan and Lu [84], due to the large variations of faces caused by varying imaging conditions, low-level feature descriptors such as LBP and SIFT may fall short. Therefore, new approaches are proposed, such as the spatial pyramid learning-based (SPLE) feature descriptor [74] to automatically exploit both local appearance and global spatial information. The SPLE obtains improved results compared to PCA, LBP, HOG, and LE [93]. An extension of the method is provided using a new Gabor-based Gradient Orientation Pyramid (GGOP) [90].
Other methods focus on combining feature detectors such as Alirezazadeh et al. [94] targeting a combination of local and global hand-crafted features resulting in improved results (81.3% and 86.15% on dataset KinFaceW-I and KinFaceW-II, respectively). Later, Boutellaa et al. [88] use spatio-temporal features based on a combination of hand-crafted LBP, LPQ, and BSIF, and deep learning features.

Metric learning
Different from handcrafted-feature-based methods, metric learning-based methods focus on the similarity measurement itself i.e., decreasing the intra-class and increasing the inter-class distance of the facial features (samples) [98,84,99,100]. It learns a distance metric to measure the similarity between samples [95]. Metric learning can be divided into two categories [24,84]: unsupervised and supervised. Unsupervised methods use principal component analysis (PCA) [101], linear discriminant analysis (LDA) [102], and Laplacian eigenmaps (LE) [93]. For supervised methods, the Mahalanobis distance metric is often utilized. The distance function of an image pair P ¼ x i and y j are feature vectors of X i ; Y j extracted from gðÁÞ. The target is transformed from a learning distance metric M to seeking a linear transformation W which projects the input x i ; y j into a more suitable subspace. Ensemble metric learning [103], neighborhood repulsed metric learning (NRML) [24], large margin multi-metric metric learning (LMMML) [96], and discriminative multi-metric learning (DMML) [67,104] are representative methods.
NRML [24]: Neighborhood repulsed metric learning (NRML) is proposed by Lu et al. [23]. An extension is provided by [24]. Lu et al. propose NRML to ensure that the intra-class samples are close to each other and repulse the inter-class samples as far as possible. Previous metric learning methods consider the samples equally, whereas NRML determines more informative samples as follows: where y it 1 represents the t 1 th k-nearest neighbor of y i , and x it 2 denotes the t 2 th k-nearest neighbor of x i . The optimization function is solved by determining k-nearest neighbors of x i and y i based on the Euclidean metric and then solve d sequentially.
DMML [65]: Yan et al. [65] propose discriminative multi-metric learning (DMML). DMML aims to extract multiple features to exploit more complementary information by jointly learning multiple distance metrics. Unlike NRML, DMML tries to maximize the probability instead of directly minimizing the intra-class distance and maximizing the inter-class distance. In this method, each pair of positive samples has the highest probability of having a shorter distance than the most similar negative sample. In addition, the correlation between different features is also maximized. The DMML method can be formulated as a constrained optimization problem as follows: , where x k i and y k j represent kth feature of X i and Y i . The first term augments the probability that a negative pair distance is larger than the positive pair distance. The second term ensures that different features reach as much complementary information as possible. Since the equation has no closed-form solution, Yan et al. firstly initialize W k and a, and update W k sequen-tially by using the gradient descent method, where a is updated accordingly.
DML [67]: Discriminative metric learning uses a linear projection [104] defined by: where F ¼ ½x i ; . . . ; x n ; y i ; . . . ; y n denotes the training data, and L w and L b are Laplacian matrices. Wang et al. [67] use denoising autoencoder-based robust metric learning by combining denoising auto-encoding (DAE) and metric learning. The projection matrix is constrained simultaneously by both DAE and metric learning to obtain a nonlinear transformation. The loss of DML is given by: where Þ , and B 1 and B 2 are the offset matrices. The projection matrix W is used as an encoded hidden layer. The DML encodes the feature non-linearly while maximizing the inter-class distance and minimizing the intra-class distances.
DDML [95]: Deeper non-linear representations are preferred, since linear transformations are shallow and may not be powerful enough. Similar to DML, discriminative deep metric learning (DDML) uses a deep neural network to learn a set of hierarchical nonlinear transformations to project pairs into an optimized feature space. Hu et al. [95] propose a deep neural network f ðÁÞ to generate representations of sample pairs. Sample pairs are fed into the network non-linearly. The Euclidean distance of these representa- A margin framework is used to separate positive and negative pairs. As illustrated in Fig. 8a, a threshold s (s > 1) is used to enforce the distance of a positive pair (l ij ¼ 1) to be smaller than s and the distance of a negative pair (l ij ¼ À1) to be larger than s. The optimization function is defined by: arg min where gðzÞ ¼ 1 b logð1 þ expðbzÞÞ is a logistic loss function and b is a sharpness parameter. k Á k F represents the Frobenius norm of the matrix. k is a regularization parameter.
In conclusion, the combination of deep learning with the discriminative ability of metric learning is one of the promising directions of current methods.
CNNs: Zhang et al. [7] use, for the first time, a deep learning model. The proposed convolutional neural network consists of three convolutional layers and one fully connected layer. To use local information, images are cropped into different patches based on their facial landmarks. Then, the aligned patches are fed into the matched sub-models. A significant improvement is obtained compared to earlier methods [76,73,23,24]. From that moment on, different CNN-based methods [126,127,53,128,129,75,130,131,118,[132][133][134][135]116,115,120,136,137] are proposed.
In contrast to Zhang et al., Yan et al. [53] focus on attention mechanisms. They design a part-aware attention network to extract local facial information. Moreover, key point masks are added to the input images for a better guidance. The architecture is illustrated in Fig. 9. Furthermore, Chen et al. [120] propose a twostream convolutional neural network to learn parent-specific and child-specific features. Yan et al. [119] suggest a deep relational network, utilizing multi-scale features from different convolutional layers. Wang et al. [116] propose a reinforcement learningbased network. They design a negative example sampling network to select more suitable samples for learning discriminative features.
In addition to kinship information, other face-related information can be used. Zhang et al. [118] propose a two-stream adversarial convolutional network (AdvKin) model based on family ID information. A self-adversarial strategy is exploited to reduce feature distribution discrepancy. Hormann et al. [133] focus on opposite-gender pairs and propose a comparator framework with kinship relation information. Song et al. [115] propose a KinMix method to generate positive samples in the feature space. They assume that the linearly combined kinship features yield similar clustering.
Auto-encoders, GANs, and Graph neural networks are used for kinship verification: Auto-encoders: Due to the nature of preserving identical information, auto-encoders are often used to extract genetic information. Generally, the encoder obtains the latent representation by deterministic mapping e ¼ f h ðxÞ ¼ sðWx þ bÞ. Here, x denotes the input vector. The latent representation y is mapped back to recon- The auto-encoder can be optimized by [139]: Here h ¼ fW; bg, where the loss function is Lðx;xÞ ¼ kx Àxk 2 . Liang et al. [140] use auto-encoders to learn deep relational fea-tures. Dehghan et al. [54] propose to use gated auto-encoders with a discriminating neural network layer. Wang et al. [141] propose a deep kinship verification (DKV) model and utilize metric learning methods to extract features. Firstly, they use a stacked autoencoder network to select nonlinear low-dimension features. Then, deep kinship verification is combined with a stacked auto-encoder network and metric learning.
GANs: Although genetic-related information is used [140,54,57], these methods may fall short to deal with (test) pairs with large age differences yielding a performance drop in kinship verification accuracy [114,59,64]. To mitigate age and identity divergences, Wang et al. [114] propose a towards-young crossgeneration model with a Sparse Discriminative Metric Loss (SDM-Loss). As shown in Fig. 10, the aged parents are generated to a young age while keeping the same identity. Then, the image pair is extracted through a convolutional neural network constrained by SDM-loss. The derived discriminative metric minimizes the feature gap among aged parents and children, alleviating the intrinsic side effects.
Graph neural networks: Li et al. [51] propose a graph-based kinship reasoning (GKR) network that performs relational reasoning on the extracted features. The overall framework of the GKR network is shown in Fig. 11. Features are extracted by the same convolutional neural network and built into a Kinship Relational Graph. A recursive message passing scheme is employed. The final results are computed by a predefined MLP.
Meta-learning: Deep learning-based methods show good performance in solving extrinsic challenges. One of the extrinsic challenges lies in that "Kinship verification databases are born with unbalanced data" [117]. A kinship dataset of N pairs of positive samples contains NðN À 1Þ potential negative pairs leading to a large unbalance. However, most of the current methods only use N negative pairs. Recently, Li et al. [117] propose a Discriminative Sample Meta-Mining (DSMM) approach to exploit all possible pairs and learn discriminative information. As depicted in Fig. 12, a meta-miner is deployed to mine the distinctive samples by reweighting the sample ratios in the training batch with a meta-

Others
There are two types of transfer learning: inductive and transductive transfer learning. For inductive transfer learning, the distribution of learning targets can be different [142,143]. In contrast, transductive transfer learning always keeps the learning target identical, but the embedded data distribution is often changed. Two transductive transfer learning methods are proposed by Xia et al. [59,64] aiming to improve the representation of latent features.
Transductive transfer learning: Xia et al. [59] where the FðWÞ is a general subspace learning (e.g., PCA [101], LDA [102] and DLA [144]). The distribution of the source, intermediate, and target set correspond to P U ; P L and P V , respectively. D W P L kP U ð Þ and D W P L kP V ð Þ are Bregman divergence-based regularization. In fact, the intermediate set becomes the bridge to connect the other two sets. Xia et al. simplify the method by using two distributions   based on pairwise differences instead of transferring three distributions together to a general subspace, as defined by: As shown in Fig. 13, the task corresponds to finding a subspace, where the different distribution of the two pairs (child-young parent and child-old parent) has a similar distribution while keeping distinctions. Inductive transfer learning: Inductive transfer learning is often used in deep learning methods, exploiting the featureextracting capability of the pre-trained neural net model. Robinson et al. [1] use several methods and benchmark them on the FIW data. The pre-trained convolutional neural network is taken as an on-the-shelf feature extractor. Specifically, the layers of the pretrained VGG-Face model are frozen, except for the second-to-last fully-connected layer.

Video-based
By the end of 2017, existing kinship verification methods are mainly based on static images. However, important kinshiprelated information can be derived from facial dynamics/motion. For example, children may have similar facial expressions as their parents such as smile, angry, astonishment, etc.) [26]. Research [145] also shows that parents and children have genetic similarities in facial dynamics. Obviously, static images do not provide such information i.e., pose variations, facial expression changes, dynamic movement, adequate 3D estimation, etc. Hence, videobased kinship datasets are required.

Handcrafted descriptors
Dibeklioglu et al. [26] is the first to use a video dataset for kinship verification. They exploit dynamic information from smiling using the UvA-NEMO Smile dataset. First, the displacement of eye-

Metric learning
Yan et al. [28] evaluate a number of metric learning-based methods using the KFVW dataset. One hundred frames are randomly extracted from each video with a cropped face region. Then, all images are converted to gray-scale. LBP features and HOG features are extracted for comparison. Information-theoretic metric learning (ITML), side-information-based linear discriminant analysis (SILD), KISS metric learning (KISSME), and cosine similarity metric learning (CSML) are evaluated. The final results show that the LBP feature obtains better performance than using HOG.

Deep learning
DeepFeat: Inspired by [26], Boutellaa et al. [88] use spatiotemporal information for video-based kinship verification. Instead of using handcrafted features, they use pre-trained VGG-Face in a off-the-shelf way to extract features. The spatio-temporal features are extracted by three different handcrafted methods: LBPTOP, LPQTOP, and BSIFTOP. The results show that combining shallow with deep features obtains the best results.
SMNAE: [27] Kohli et al. [62] propose a deep learning framework for kinship verification in unconstrained videos using a Supervised Mixed Norm Autoencoders (SMNAE). This autoencoder formulation introduces class-specific sparsity in the weight matrix. The Mixed Norm Auto-encoder (SMNAE) combines l 2;p norm and a pairwise class-based sparsity penalty with loss function J SMNAE formulated by: where W is the weight matrix and / is the activation function. L is the Laplacian matrix, which can be taken as L ¼ D À M. D is the diagonal matrix, and M is the adjacency matrix. They use this formulation to develop a three-stage framework. The framework of SMNAE is illustrated in Fig. 14. In the first stage, the video pair is split into non-overlapping vidlets. These vidlets are fed into a stacked SMNAE to yield a spatial representation. In the second stage, the learned spatial representations are concatenated pairwisely and then fed into the second stage's stacked SMNAE. The third stage mainly receives the global Spatio-temporal information.
The encoded representation is used by an SVM for the final classification. The aim of the approach is to obtain spatial and temporal information by using an auto-encoder, resulting in a discriminative but sparse representation. (See Fig. 15)

Multi-label methods
Audio: As mentioned in [60], the University of Nottingham's experiment shows that the human voice contains heritable information. Other research also shows that the voice of people with close kin relationship results in a performance degradation of automatic speaker verification (ASV) [146,147]. Inspired by this observation, assuming that the human voice contains kin-related cues, Wu et al. [60] fuse face and voice modalities to improve the accuracy and robustness of kinship systems verification. They propose a Siamese fusion network with a contrastive loss utilizing the finetuned VGG-Face CNN cascaded and an LSTM network. To extract voice features, they pre-train a ResNet-50 [148] on VoxCeleb2 [149] and fine-tune it with TALKIN. These two models are trained using a contrastive loss to learn intra-class similarity and interclass dissimilarity. After feature extraction, PCA is used to reduce the feature dimensions for both face and audio. Facial and vocal features are reduced to 130 and concatenated together to form a 260-dimensional feature. After the fc layer, the outcome is evaluated by using the cosine similarity. The results show that the vocal information improves the accuracy by around 3 percent.
Age: Previous studies [8,150,31,59,64] show that changes in age may negatively affect the accuracy of kinship verification. Because of the age gap, the parents' face structures are deformed compared to the face when they were young [151]. It indicates the possibility of improving the accuracy of transforming people's facial information by age. Similar ideas are proposed by Xia [59,64] to transfer the distribution of children and parents to a general subspace, which indirectly utilizes the age information. In addition, Wang et al. [114] generate parent images to their younger ages, and Dehshibi [152] propose an age-aware facial kinship verification to fill the gap of aging effects in asserting kin-relation. Graph based: Xia et al. [64] assume that people have a higher kin likelihood when they are located together in an image. For example, in a family photo, senior people often sit in the middle surrounded by their family members. The paper utilizes the information to improve kinship verification by combining relative distance, gender relation, age difference, and kinship score. In [105], the potential relationships of people in one photo are transformed into a set of candidate graphs with all possible relationships. Then, they accumulate the scores of each candidate graph. The graph with the highest score corresponds to the final kinship prediction.

Discussion
Details of the different methods are listed in Table 2 and Table 3. Most methods focus on solving the 4-types of kinship verification task using public datasets collected online in an unconstrained environment. Only a few methods focus on kinship types with two generation skips (11 types). The Family 101, KinFaceW-I&II, and CornellKin and the metric-learning-based methods are mostly used. Many of the metric-learning-based methods, obtaining high accuracy, follow a similar strategy: they have multiple descriptors with different ranges of scales and deeper descriptors. On the other hand, some methods focus on specific challenges. As shown in Fig. 6, the challenge of unconstrained-images-based kinship verification is often approached by data-driving methods. However, only a few methods focus on solving the unconstrained challenges based on a specific design. For example, to our knowledge, there is no approach to adjust the methods to deal with posevariations and occlusion problems. As for intrinsic challenges, expression changes are mostly taken into account for videobased kinship verification. Recently, many methods focus on utilizing gender information, whereas ethnicity's side-effect is largely ignored so far. Also, for the age differences, there are several methods that focus on solving larger age differences and old-parentsrelated tasks. However, a few methods focus on how to deal with children-related pairs. The graph of methods shows that there are still many unsolved problems. From the milestone in Fig. 7, it is shown that more and more deep learning methods are used. There is also a trend to combine metric learning with deep learning.

Potential directions for kinship verification
Kinship verification is a challenging but promising task. Currently, there are still many open directions, see Fig. 16. For example, most of the current kinship verification methods are close-set approaches. Both testing and training data are from the same kin-type set. However, this evaluation protocol ignores many unknown relationships in real-world scenarios and omit the influence of other kin-type samples. Conducting kinship verification in an open-set environment is a promising direction. Another point is that there is a racial imbalance in the data collection and construction process. Positive and negative samples also do not match realworld scenarios. Debiasing kinship verification is of great importance. In addition, there are many interference factors. Research in cross-age kinship verification, cross-expression kinship verification, make-up-based kinship verification, and partial-face-based kinship verification is required. Due to the development of deepfakes [153], anti-spoofing becomes important. Increasing the kinship system's stability is a promising direction. How to combine various types of data and features at different levels is still an unsolved problem.

Motivation
Kinship verification based on child-adult-related pairs is useful as child-adult pairs often occur in many applications such as children's adoption and missing children searching. The performance of kinship verification on child-adult pairs is negatively influenced by large variations among children and adults. As shown in Fig. 17, the facial outlines of the same person during childhood and adulthood change drastically.
However, only a few researchers focus on child-adult kinship verification. One reason is the shortage of child-adult imagesbased kinship datasets. Considering the commonality of public datasets, this specific task for kinship verification cannot be studied systematically.

Data collection
The aim of the Nemo-Kinship Dataset is to collect child-adultbased kinship-related videos with multiple labels. To this end, we collect the kinship-related data from a deception-testing experiment at Nemo Museum as part of the scientific experiments of NEMO Science Live Program 8 .

Recording conditions
During the data collection process, the participants are divided into different groups based on family or friend relationships, and 8 https://www.nemosciencemuseum.nl/nl/wat-is-er-te-doen/activiteiten/sciencelive/ Fig. 15. Architecture of an audio-based method, cited from Wu et al. [148]. the language they speak. Participants in each group take turns to undergo the experiment as test subjects. According to the allocated questions, all the participants' answers during the experiments are recorded and divided into 13 different video clips. The entire experiment is recorded by a web camera connected to a computer.
The web camera records video information together with audio. The video has a resolution of 1920 Â 1080 pixels at a speed of 60 frames per second, and the audio codec format is MPEG À 4AAC with stereo channel, 48000hz of sample rate and 320kbps of Bitrate. During the entire experiment, the camera's angle and the position are kept the same, so all test subjects are taken with a frontal view in a controlled environment. Incandescent lamps are arranged around the entire interview room to ensure that the light is as stable as possible to eliminate the interference caused by environmental changes.

Data annotation
After collecting all the practical information of participants, 248 participants with kinship-related information are kept. An autoclipping tool is created to divide the video into separate clips according to the interview questions and answers. Each video is kept along with audio information containing the speech.  Fig 19).

Data statistics
The dataset consists of a large proportion of children's videos, making it easier to focus on child-adult-related kingship verification tasks. It contains 4216 videos of 248 family members from 85 families. It contains 11 kinship types and the age of each family member. The ages of the family members vary from 7 to 71. A child is a person whose age is under 16. The age distribution of the dataset compared to KinFaceW-I is depicted in Fig. 18. Statistics of the Nemo-Kinship dataset are shown in Table 4. A comparison between the Nemo-Kinship dataset and other related datasets is listed in Table 5.
In conclusion, the Nemo-Kinship dataset: Kin-based detection.

Protocols
The kinship verification task is a binary classification problem. The mostly used evaluation protocol for kinship verification is K-  fold cross-validation. Because the number of samples of the training dataset is limited, the test results may vary. K-fold crossvalidation provides relatively more stable results. The mostly used cross-validation is 5-fold cross-validation. Using the same crossvalidation fold provides a comparison between the proposed and methods in the literature.

Metrics
The mostly used evaluation metrics are classification accuracy [98] and EER. Each kinship's result is calculated and obtained for classification accuracy by dividing the correct number by the total test number. Finally, an accuracy metric is obtained with an aver-     Table 6. It can be concluded that the video dataset is more complicated than the image dataset, since the highest accuracy of the KFVW dataset is 59.3%, and for TALKIN it is 74.1%. UB KinFace yields a lower performance considering image-based datasets when testing this 4-type using the same method. MNRLM achieves an accuracy of 67.1 for UB KinFace and 71.6, 69.9, and 76.5 for CornellKin, KinFaceW-I, and KinFaceW-II, respectively. KinFaceW-I and KinFaceW-II are the mostly used public datasets. The traditional metric learning methods MNRLM and DMML show relatively low performance. Compared to MNRLM, DDMML performs better by utilizing multiple neural networks and using the commonality of multiple feature descriptors. Based on the comparison, deeper features usually results in higher accuracy. For instance, fcDBN uses a convolutional neural network and SMNAE a mixed norm auto-encoders. it can be concluded that multi-descriptors provide better results. Compared to NRML and DDML, which only use one specific feature descriptor, MNRM and DDMML achieve better results on the KinFaceW-I&II datasets. A partial-aware feature extractor is also helpful. The Attention Network uses a part-aware attention module. fcDBN uses hierarchical representations with local and global facial regions. Both of them show good performance. In conclusion, deeper feature extractors, multi-descriptors, and part-aware extractors are useful for kinship verification.
7 types: Table 7 shows the performances of different methods on the 7-type kinship datasets. Compared to 4-type datasets, fewer methods focus on 7-type datasets. To our knowledge, fcDBN is the only method tested on the WVU kinship dataset. As for the videobased dataset, the UvA-NEMO Smile dataset is the most widely used. The best performance on this dataset is 93.6%. Among these methods, Dibeklioglu et al. [26] reach an average accuracy of 72.9. Boutellaa et al. [88], achieve an average accuracy of 89.8. Dibeklioglu et al. [26] use traditional descriptors by combining facial dynamics and spatio-temporal appearance. Boutellaa et al. [88] use the spatio-temporal information and use deep features from VGG-face. (See Table 8, Table 9, Table 10).
11 types: FIW is the largest image-based dataset. All methods, conducted on the FIW dataset, are based on neural network architectures.

Human evaluation
The evaluation of kinship verification by humans often occur in the domain of social analysis related topics [12,8,11,13,10,14,15,33,29]. The participants assessing the image pairs are usually divided by age, gender, race, career etc.. In a number of surveys, the participants generally tend to be specialists or students with basic psychological knowledge. In [8], 59 undergraduate students with an average age of 21.6 take part in the kinship verification test. These students all obtain partial credits in an introductory psychology course. In early research, human's evaluation of kinship verification is questionnaire-based. Researchers show the different images to the participants without any labels. The participants write down the judgment of the kinship. In some experiments, the judgment time is recorded.
In recent years, machines are used to make experimental data more accurate. In [8], random stimuli appear on the screen. The participants need to judge the kinship between the pairs shown in the stimuli. The response time is limited to 20s. The response of the participants is also recorded. A degree of relatedness is finally recorded with the parameters of kinship assessment. In [56], the researchers used the Amazon Mechanical Turk service (MTurk) crowd-sourcing service to evaluate a set of kinship verification pairs. In these experiments, the MTurk participants are anonymous. Like the previous experiment, the pair of face images are displayed on the screen, and the participant's answers are recorded by clicking the corresponding button. The final evaluation is the average score of all correct answers by all participants. Lopez et al. [56] show that for the dataset KinFaceW-I, KinFaceW-II, a human can reach a performance within a range of 75% to 85%. Both [8,56] show that humans are better, especially for M-D (mothers and daughter) relationships. Fig. 19. Crowd-sourced human evaluation of kinship using Amazon Mechanical Turk, cited from Lopez et al. [56]. Table 4 Statistics of Nemo-Kinship dataset.  pairs  34  30  42  46  15  15  31  5  6  3  3  Children related  33  26  40  44  12  14  28  5  6  3  3  Family numbers  26  25  34  37  15  13  26  2  5  3  2  Enlish speaker  20  15  18  20  10  6  8  0  0  2  0  Dutch speaker  40  40  57  64  20  21  49  6  10  4  5  Male  60  25  40  0  0  27  28  4  0  3  5  Female  0  30  35  84  30  0  29  2  10  3  0  individuals  60  55  75  84  30  27  57  6  10  6  5  Total individuals 248  Different methods are re-implemented. All videos are preprocessed by a face detector and aligned with the same eye position.

Data post-processing
The pipeline of the post-processing of the Nemo-Kinship dataset is shown in Fig. 21. Firstly, we extract the members and divided them into 11 categories according to their kin-types. Since the number of samples, having the secondary kinship, is small, we only use seven kin relations as the testing relationship: M-D, M-S, F-D, F-S, B-B, B-S, and S-S. We extract one video from the Nemo-Kinship dataset with the ''yes" answer. Secondly, we convert the video of each person into 100 frames. The faces are cropped into 160x160 pixels according to the bounding box of the detected face. Then, we align each image according to the landmarks. We adjust each face and fixed all the eye positions. Thirdly, all the family members are re-arranged into seven kinship-type folders. The entire dataset is trained and tested by 5-fold-cross-validation. Therefore, we generate a cross-validation list of five folders for each kinship for training and testing.  Table 8 Results of the methods on 11-type relation datasets.   [73], CNN-points [7], Attention Networks [53], Sphereface-baseline [16,173], and Vuvko [134] [134] are used. NRML is the traditional and widely-used metric learning method. CNN-points is the first deep learning method. Attention Network and Vuvko are more recent methods. Sphereface-baseline is the benchmark method for Recognizing Families In the Wild Data Challenge (RFIW) in 2020 and 2021.
Vuvko reaches the state of the art results on the kinship verification track for RFIW2020. The performance of different methods are listed in Table 11. Among these methods, Vuvko shows the highest accuracy. Vuvko utilizes the information of the face recognition task and selects arcfacer_100_v1 [174] as the backbone. The results show that face verification information helps to improve kinship verification. Comparing Attention Network (with masks) and Attention Network, it can be concluded that using masks improve the results. For video-based methods, the Deep + Shallow Table 9 Best reported performance on different kinship verification datasets.

Feature representations
To study the influence of different features, SIFT, LBP, HOG, VGG-face, and Facenet are selected as basic descriptors. SIFT is one of the widely used feature descriptors in image recognition and classification. We follow [25,24]. The images are divided into 16 Â 16 blocks with a stride of 8. Then, the SIFT feature with 128D is extracted from each block and concatenated together. The LBP [85] features are extracted following the implementation of [23]. The image is divided into 16 Â 16 non-overlapping blocks at first. The radius is set to 2, and the sampling number is set to 8. The extracted features are represented by 256D histograms, forming a 2096D(256 Â 16) feature. Unlike traditional descriptors, VGG-face and Facenet are used as off-the-shelf face encoders following the settings of [25]. The similarities of image pairs based on different features are calculated by the cosine similarity with a certain threshold. The ROC curves of different features are shown in Fig. 20. It shows that Facenet features and VGG-face features provide the best results. It also shows that test pairs with the same generation (Brother-brother, Sister-sister, and Brother-sister) obtain more distinguished features.

Discussion
The results of different methods on the public datasets and our newly proposed Nemo-Kinship dataset, show that the current methods (NRML, CNN-basic, CNN-points, Attention Network, Vuvko) provide better results on public datasets. This can be attributed to the fact that the Nemo-Kinship dataset containing more samples of children and adults. These samples show larger differ-

Conclusion
This survey provides a comprehensive review of public datasets and representative methods for kinship verification. Representative methods are categorized and compared based on their feature representations: (1) hand-crafted feature-based, (2) metric learning-based, and (3) deep learning-based. Also, this review studies current kinship challenges according to intrinsic factors (face i.e., differences in facial appearance) and extrinsic factors (acquisition i.e., varying imaging conditions). New promising directions are discussed based on current advances in kinship research. Open-set kinship verification and debiasing kinship verification are largely ignored so far. They are promising for the kinship verification task in the future. Through the analysis of current kinship verification datasets, we believe that there is still a need for more kinship datasets for specific problems. More video-based kinship datasets are in demand. Therefore, a new video dataset is presented as a benchmark for a child-adultbased kinship verification task. This dataset consists of 248 subjects from 85 families. It contains age, gender, and audio information. This benchmark is used to systematically test and analyze current state-of-the-art methods. Theo Gevers is a Professor of computer vision with the University of Amsterdam. He is the Director of the Computer Vision Laboratory and Co-Director of the Atlas Laboratory and Delta Lab in Amsterdam. He is the Co-Founder of AI technology companies 3DUniversum and Sightcorp. His research area focuses on artificial intelligence with the focus on computer vision and deep learning, and in particular image processing, 3D (object) understanding and human-behavior analysis. He has authored or co-authored more than 230 papers and three books.