Few-shot short utterance speaker verification using meta-learning

Short utterance speaker verification (SV) in the actual application is the task of accepting or rejecting the identity claim of a speaker based on a few enrollment utterances. Traditional methods have used deep neural networks to extract speaker representations for verification. Recently, several meta-learning approaches have learned a deep distance metric to distinguish speakers within meta-tasks. Among them, a prototypical network learns a metric space that may be used to compute the distance to the prototype center of speakers, in order to classify speaker identity. We use emphasized channel attention, propagation and aggregation in TDNN (ECAPA-TDNN) to implement the necessary function for the prototypical network, which is a nonlinear mapping from the input space to the metric space for either few-shot SV task. In addition, optimizing only for speakers in given meta-tasks cannot be sufficient to learn distinctive speaker features. Thus, we used an episodic training strategy, in which the classes of the support and query sets correspond to the classes of the entire training set, further improving the model performance. The proposed model outperforms comparison models on the VoxCeleb1 dataset and has a wide range of practical applications.


INTRODUCTION
With the widespread application of information technology, there are more and more scenarios that require user identity verification, such as online payments and application logins. In biometric verification methods, speaker verification (SV) (Sarkar & Tan, 2021) technology has incomparable advantages of convenience and non-contact over other verification methods, such as fingerprint recognition. The goal of SV is to verify whether a speaker given test sample is the enrolled speaker given a few utterances for each speaker. However, existing SV methods need to use long speech of more than 15 s or tens of utterances to perform more accurately, which limits the wide application of the SV method. Therefore, researching for short utterances within 10 s, or even short utterances of 2 to 5 s, is of great significance to SV technology (Das & Prasanna, 2018;Poddar, Sahidullah & Saha, 2018;Liu et al., 2022).
One of the most popular meta-learning methods is prototypical networks (Ko, Chen & Li, 2020), which learns an embedded network that transforms original input into metric space representation. In the metric space, classification is performed by calculating the distance from the prototype center of each class to be tested (the classification loss function in this process is called prototypical network loss). Kumar et al. (2020) have used prototypical network (PN) as a generalized learning method for speaker embedding. Ko, Chen & Li (2020) have used PN for the first time for SV tasks. When the number of samples for each speaker is limited, the performance of PN is better than traditional methods. Kye et al. (2020) used PN and global classification over the whole samples that achieved significant performance for speaker recognition with imbalance length pairs.
The existing short utterance SV methods based on deep learning depend on large-scale datasets with thousands of speakers or tens of thousands of utterances (Xie et al., 2019;Nagrani et al., 2017). In addition, the number of speakers in task is usually large, while the classification objective of deep learning represents a single task, limiting the diversity of training tasks. Unlike deep learning methods, meta-learning aims to enhance the learning algorithm itself by considering the experience of multiple tasks. By training different metatasks, meta-learning achieves fast generalization ability (Kumar et al., 2020). However, optimizing only for classes in given meta-tasks may not be sufficient to distinguish speakers. Thus, we perform a process called global classification (GC) in an episodic manner, using the classes of the support set and the query set that correspond to the classes of the entire training set. Kye et al. (2020) used global classification to solve the problem that speaker recognition models perform poorly in real-world scenarios when the length of the enrollment utterance and the test utterance is imbalanced. Their model was trained to match long-short utterance and achieved significant performance gains. We used PN and global classification with episodic training for few-shot short utterance speaker verification (SV). It is worth noting that a good embedding model can adjust the distance between class prototypes, making it easier to classify prototypes. ECAPA-TDNN has good feature extraction capabilities for either SV task with channel and context-dependent attention mechanisms, Squeeze Excitation (SE), multi-layer feature aggregation, and residual blocks. Therefore, it is used to learn meta-task embeddings for few-shot short utterances SV. The distance between a query and its prototype is closer than the distance between the unknown speaker and the prototype in the metric space.
In summary, our main contributions are as follows: (1) We formulate a meta-learning approach with episodic training for few-shot short utterance SV. Meta-learning considers the experience of many meta-tasks, which helps distinguish speakers.
(2) ECAPA-TDNN is used to implement a nonlinear mapping of the original input to the embedding space on the meta-tasks, making the class prototypes far apart from each other in the embedding space, while each query sample clusters toward the same class prototype group. We call ECAPA-TDNN-inspired Prototypical network as ETP.
(3) An episodic training strategy is designed to optimize the model for generating discriminative speaker features, which combines prototypical network and global classification.

PRELIMINARY
In this section, we introduce meta-learning, focusing on how it differs from machine learning methods in terms of definition and speaker verification protocol. Meanwhile, metric-based meta-learning is discussed. To make the narrative clearer, the frequently used notations in Section 2 are illustrated in Table 1.

Meta-learning
Meta-learning is usually understood as ''learning to learn'', which aims to learn from the experience of historical tasks, so that the model can learn how to better acquire knowledge and learn new tasks quickly, while ensuring the accuracy of the algorithm (Kumar et al., 2020;Hospedales et al., 2020). In short, learn how to learn across tasks.
To further explain the concept of meta-learning, machine learning and meta-learning are compared. Machine learning learns a model from a dataset D = {(x 1 ,y 1 ),(x 2 ,y 2 ),...,(x N ,y N )}. Given inputs and labels, a predictive modelŷ = f θ (x) with hyperparameters θ is trained, in order to get the predicted values as close to the true value as possible. The optimal model parameters are as follows: where L (•) is the loss function that computes the error of the true and predicted values, and φ is pre-specified. Table 1 The mathematical notions and parameters used in the Section 2 are summarized.

D, D t
The whole dataset for machine learning, the t -th meta-task (or episode) dataset.

S, Q
The support set S, the query set Q.
x, y Sample, label.
The learning algorithm F φ ( · ) that can learn the base model, φ is learnable hyperparameters.
Meta-learning transfers knowledge across tasks, rather than learning from scratch for each task (Baik et al., 2021). It is assumed that φ is learnable rather than pre-specified. Figure  1 shows the meta-train phase. Image is more intuitive than speech, so image classification is used as an example. Given T meta-tasks (or called episodes) denoted as {D t } T t =1 , researchers train a learning algorithm F φ (•) that can learn the base modelŷ = f θ * (x), by solving: Each meta-task (or episode) dataset is denoted as D t = (S, Q) (t) , consisting of a training set and a test set, also known as the support set S and the query set Q. The support set is used for learning and training F φ (•). The query set is used to calculate the loss of model f θ (· ) learned by F φ (•). According to the loss value, the model parameters are updated by backpropagation. The t -th meta-task base model parameters are as follows: In summary, in the base learning process, base tasks such as speaker recognition defined by a single task dataset and training objectives are solved. In the meta-learning process, the meta-task based on the meta-objective and meta-task datasets is to update the base model (Sun et al., 2019;Lang et al., 2022). Most meta-learning methods are applied to few-shot tasks (Chang et al., 2022). The model trained by a small number of samples can quickly adapt and master the new few-shot task. The architecture of the meta-learning model is similar to the deep learning model. It is logically divided into classifier and feature extractor. The feature extractor is composed of a deep neural network.

Metric-based meta-learning
Metric-based meta-learning aims to learn an embedding network that transforms the raw input into a metric space representation. In the metric space, the class is predicted by comparing the similarity between query set samples and support set samples. The most popular metric-based meta-learning methods include prototypical networks, siamese networks (Koch, Zemel & Salakhutdinov, 2015), relation networks (Sung et al., 2018), and matching networks (Vinyals et al., 2016     The predicted probability over a set of known labels y is a weighted sum of labels of support set samples. The weight is generated by the metric function d θ (· ) that computes the similarity between two samples.

Speaker verification protocol
The SV based on deep learning process can be divided into three phases: During the training phase, a large number of speaker utterances are fed into the neural network, which learns a predictive model to classify the speakers. During the enrollment phase, the new speaker (different from the speaker in the training phase) utterances are inputted into the trained model without the classification for generating a new speaker model. Each new speaker has its speaker model. During the evaluation phase, the utterance to be verified is inputted into the trained model to obtain its embedding representation. Then, we calculate the similarity between the embedding of the utterance to be tested and the target speaker model, judging whether the speaker is the target speaker according to the similarity score and the preset threshold. If the score exceeds the threshold, it is confirmed that the speaker of the utterance being tested is the target speaker, and vice versa. The SV process based on meta-learning is different from the SV based on deep learning, including meta-train SV phase and meta-test SV phase. During the meta-train phase, a large number of training meta-task sets are inputted into the neural network. In each episode, the support set is used for training model F φ (•). The query set is used for calculating the loss of model f θ (· ) learned by the learning algorithm F φ (•). The loss values of all meta-tasks are added to obtain the model loss value. According to the loss, the model parameters are updated by backpropagation until convergence, and thus the model is successfully trained. During the meta-test phase, in each episode, the support set is used for adapting the new SV meta-learner. The query set is used for evaluating the performance of the meta-learner for fast adaptation to unseen SV tasks.

METHOD Problem setup
Suppose that D is the entire training set, which is divided into several episodes to mimic few-shot SV task. In each episode, N speakers are randomly selected from the training set. K+M samples are randomly selected for each speaker. Meta-tasks include support set S = {S 1 ,...,S N } and query set n )} respectively represent the labeled sample set of the n-th speaker in the support set and the query set. K, M is respectively the number of utterances of S n , Q n .x n,i represents the i-th utterance of the n-th speaker. y is the corresponding label of x n,i , y n = n.

Learning embedding for few-shot short utterances SV
The key to the metric-based meta-learning approach for few-shot SV task is to learn metatask embeddings (Ye et al., 2020), where embeddings from the same speaker are closer than embeddings from different speakers. Therefore, we learn meta-task embeddings to modify the prototypes to make them easier to distinguish. The overall architecture of ETP is shown in Fig. 2. The raw utterances in the support set and query set are pre-processed (pre-emphasis, frame addition, short-time Fourier transform and Mel-filterbank filtering operations are performed sequentially.) to obtain the Mel-filterbank (MFB) features (Ohi et al., 2021). One utterance corresponds to one MFB feature matrix with 80 rows and H

Conv1D ASP ASP FC FC
...  columns. 80 is the dimension of a frame of MFB features, and H is the number of frames. MFB feature matrix is used as the input of ETP for feature extraction. We propose ETP, which integrates ECAPA-TDNN into the prototypical network to implement a nonlinear mapping of the original input to the metric space on the meta-tasks. The distance between a query and its prototype is closer than the distance between the unknown speaker and the prototype in the metric space. ECAPA-TDNN contains the advantages of x-vector and ResNet architecture, adding residual connections between frame-level layers to enhance speaker characteristics and avoid gradient degradation. The convolution kernel of CNN has a fixed height, which is the same as the dimension of the speech frame, to perform convolution along the direction of the frame. We built three SE-Res2Blocks, using one-dimensional dilated convolution with the dilated factors of 2, 3, and 4. The outputs of the three SE-Res2Blocks are connected. The ASP is to introduce an attention mechanism in the statistical pooling layer to calculate the importance of each frame. Then, the attention pooling layer is combined with the standard deviation for aggregation, which can represent the features of any distance in the context to capture the long-term characteristics of speakers more effectively. The output features of ASP are mapped to 256-dimensional features through a fully connected layer (FC).

ECAPA-TDNN
SE-Res2Block consists of two convolutional layers, Res2 Dilated Conv1D module and SE, which are used to effectively learn feature information. As shown in Fig. 3, the size SE block of the convolution kernel of the two convolution layers is set to c ×1, and the size of the convolution kernel of Res2 Dilated Conv1D is set to (c/s) ×3. Dilated convolution layers with different dilated factors in Res2 Dilated Conv1D can effectively expand the receptive field of the convolution layer without additional computation complexity. We use batch normalization BN and activation function ReLU between layers. In addition, to avoid the gradient vanishing or exploding, a residual connection is constructed. Res2Dilated Conv1D with the scale dimension processes multi-scale features by hierarchical residual connections internally, which is beneficial to extract local and global information (Gao et al., 2019). It also uses one-dimensional dilated convolution to expand the receptive field and obtain more useful information without changing the size of the convolution kernel (Zhang, Wang & Jung, 2018). The definition of time dilated convolution can be written as: where X represents the speech signal, w represents the convolution kernel. l represents the dilated convolution factor, which is the interval when the convolution kernel processes the data, l ∈ Z + . In the Res2 Dilated Conv1D module, the number of the frame x is H, x i ∈ R c ×H . We divide x into s subsets x i , where i ∈ {1,2,...,s}, which replace c-channel convolution kernels with a set of c -channel convolution kernels (c = s×c ). It changes the number of channels. The convolution kernel group is connected layer by layer. This process is expressed in mathematical form, that is, except for x 1 , each feature subset x i has its corresponding convolution kernel w i . We add the current feature subset x i and the output result of the previous convolution operation w i−1 x i−1 , and then perform the convolution operation with the current convolution kernel. The output after convolution is y i , until all the feature data is processed. y i is shown in formula (6): All the features y i are spliced and sent to a set of convolutional layers with the convolution kernel of c ×1 for information fusion to obtain feature data. Since the convolutional layer does not effectively use the channel information of the features, SE is introduced to obtain the channel relationship and improve the performance of the task system. First, global average pooling is used to compress global spatial information to channel-level statistical information (Hu, Shen & Sun, 2018). The squeeze operation reduces the time dimension to generate statistics z ∈ R C . The c-th channel of z is given by: where u c represents the c-th channel characteristic of U. Secondly, two FCs are used to capture the interdependencies between the channels and assign corresponding weights to each channel feature. This process is an excitation operation as follows: where σ is the sigmoid function, δ is the ReLU function, and F ex (· ) represents an excitation operation. Finally, the weight information of each feature channel is multiplied by the feature information, so that the network can selectively focus on important features and suppress unnecessary features, to achieve adaptive calibration of feature channels. The multiplication of feature u c and scalar s c is shown in formula (9):

Episodic training
We use prototypical network to be trained in an episodic manner. Metric space of the prototypical network is shown in Fig. 4. The prototypical network learns a metric space, calculating the distance from the prototype center of each speaker to be tested speech. Firstly, the prototype center p n of the speaker is calculated, which is the average of all samples in each type of support set, as shown in formula (10): where n = 1,2,...,N , i = 1,2,..,K ,f (·) is the model required for SV, which is inputted MFB features to extract speaker features. Then the distance distribution between each query sample and the prototype center of the N speaker is calculated as shown in formula (11): where, d (· ) is a cosine distance measurement function for measuring between the query sample and the center of the class prototypes. Finally, the loss of calculating the sub-task is: Given a support set containing the target class, we calculate the prototype center of each target class and classify it according to the closest metric distance. However, only optimizing the meta-task model cannot be sufficient to distinguish speakers. Therefore, it is necessary to globally classify each sample of each meta-task against the whole dataset, so that the model can better recognize the speaker. Assume that each class has a set of global prototypes ω = {w n ∈ R d |n = 1,...,N }, N is the number of speakers in the entire training set. d is the dimension of the speaker feature. Then the probability that the utterance x is the class y: .
Then, the global loss is calculated as Eq. (14): Finally, the loss of the meta-task and the global loss are added as follows:

Data representation
This article uses 80-dimensional MFB features with a 25 ms window and a 15 ms frame shift as the input of the model. We normalized the speech frame by subtracting the average value and dividing it by the standard deviation of all frequency components, without performing any voice activity detection (VAD) operation and data augmentation. During the training process, we set 1-shot 100-way in each episode and the number of query samples to 2. Set the length of the utterance to 2 s. If the duration of the utterance is less than 2 s, this utterance segment is copied to a duration of 2 s.

Implementation details
We implement a model with 512 channels in the convolutional layers using PyTorch. When only the global classification objective is used, the mini-batch size is 256. When combining PNL and GC optimize model, the episode size is 100. We use the SGD optimizer with the momentum set to 0.9 and use the weight decay set to 2e−4. Set the initial value of the learning rate to 0.1 and its decay rate to 10 until convergence. The experiment was done with NVIDIA V100 and T4 GPU.

Baseline models
x-vector. The pre-trained x-vector model except for the final layer is used as an initialization model (Kumar et al., 2020). The Adam optimizer with an initial learning rate of 1e−3. The learning rate is reduced to 1e−6. Dropout and batch normalization are used at all layers for regularization. ThinResNet-34. The model is trained using the Adam optimizer. The initial learning rate of 0.001 is reduced by 10 after every 36 epochs until convergence. The mini-batch size is 160.

Evaluation metrics
Equal error rates (EER) and detection cost function (DCF) are applied to evaluate the performance of speaker verification systems (Xu et al., 2021). The evaluation metrics EER and DCF refer to two parameters, which are False Acceptation Rate (FAR) and False Rejection Rate (FFR). FAR is the percentage of acceptance in the sample that should not be accepted. FRR is the percentage of rejection in the sample that should not be rejected. The EER is equal to the value when FAR and FRR are equal (Avila, O'Shaughnessy & Falk, 2021). The lower the EER value, the better the performance of the system is required.

RESULT The impact of feature dimensions
The ETP is trained on the VoxCeleb1 dataset and tested on the original test set of VoxCeleb1 which contains 37,720 full utterances from 40 speakers. To evaluate the impact of feature dimensionality on the SV task, we select and compare 40-dimensional MFB features and 80-dimensional MFB features.
The experimental results in Table 2 show that the performance of the model trained with 80-dimensional MFB performs slightly better than that trained with 40-dimensional MFB features, regardless of which episodic training strategy PNL or GC or combining PNL and GC is used to optimize the model. It indicates the effectiveness of increasing data dimension. Data with larger data dimensions contain more speaker information, taking up more disk space and requiring more computation, but the model performance is not significantly improved, which may represent that data with larger dimensions are sparser than data with smaller dimensions.

Verification on VoxCeleb1
The model is trained on the VoxCeleb2 dataset and evaluated on three different test lists from the VoxCeleb1 data set and eval core-core trial pairs of SITW dataset: (1) the original test list; (2) the expanded VoxCeleb1-E list contained training sets and VoxCeleb1 test set; and (3) the challenging VoxCeleb1-H list. In addition, there are a few errors in the VoxCeleb1-E and VoxCeleb1-H lists. Xie et al. cleaned up the errors and publicly released the cleaned test lists. We do not add any speech time, which may result in performance improvement. Table 3 shows the performance of models on the original test set of VoxCeleb1. We use short utterance training our models to evaluate the performance of the model on  full utterances. ETP exceeds the ThinResNet-34 (Xie et al., 2019) and ResNet-50 (Chung, Nagrani & Zisserman, 2018) models (EER is 2.36% vs 3.22% and 4.19%). ETP and x-vector are both meta-learning methods. ETP with episodic training strategy PNL is comparable to the x-vector (EER is 3.46% vs 3.48%). When combining PNL and GC jointly to optimize the model, ETP outperforms the x-vector (EER is 2.36% vs 3.48%), indicating the effectiveness of GC. GC enhances information transfer across meta-tasks by each sample of each meta-task against the whole dataset, improving the performance of the model. Similarly, the last two rows of Table 3 show that the combination of PNL and GC to train the model outperforms the single PNL. Table 4 shows the comparison results of model performance on VoxCeleb-E, VoxCeleb-H test sets, the cleaned test sets and SITW eval dataset. VoxCeleb1-E contains a large number of expanded utterances, which can be used to fully test the performance of models. It is difficult to evaluate the model on the VoxCeleb1-H list, due to it contains speakers from the same gender and nationality, which the similarity between speakers is high. ETP outperforms ThinResNet-34 and ResNet-50 in all cases. ETP can be generalized for target tasks and further enhance performance during the testing phase of SV.

Verification based on the length of short utterances
We randomly sample 100 positive sample pairs and 100 negative sample pairs in the VoxCeleb1 dataset to obtain test sample pairs, testing the performance of models. Randomly cut the test speech for 1 s, 2 s, and 5 s. If the length of the test utterance is shorter than the required length, copy the utterance segment itself and set it as the target length.   Table 5 shows the effect of the length of short utterances on the performance. ETP outperforms baseline models in all cases. The episodic training manner is helpful for mining the novel speaker information from few-shot SV tasks to improve the discriminative ability of prototypes. Meanwhile, when the utterance length is 5 s, all models achieve the lowest EER value. There is a strong correlation between model performance and utterance length. With the increase of utterance length, more relevant speech signals from speaker are captured so that the EER value is lower. To prove the effectiveness of the episodic training strategy of PNL and GC, the ablation experiment is implemented. The experiment on 1 s, 2 s, and 5 s utterances or full utterances (Table 3) shows that the method with the prototypical network and global classification is more effective than the method using the prototypical network or global classification alone. The episodic training manner can make the distance between a query and its prototype closer than between the unknown speaker and the prototype in the metric space, effectively distinguishing speakers.

Ablation experiment
In order to measure the effectiveness of the Res2 Dilated Conv1D module in the few-shot short utterances speaker verification task, ablation experiments are performed on the models. The Res2Dilated Conv1D module is replaced by a common one-dimensional convolutional layer.
As shown in Table 6, the Res2 Dilated Conv1D module significantly improves the performance of the model when it is testing the full utterances in the three different datasets of VoxCeleb1. The results in the second column of Table 7 show that the parameters of the ETP are reduced by 23.2%. The results in the fifth column of Table 7 show that when the prototypical network loss is combined with the global classification, the EER of ETP is relatively reduced by 14.5% than that of the NR-ETP; when using only the PNL, the performance of the ETP is relatively 0.96% higher than that of the model the NR-ETP. It is proved that the multi-scale features extracted by Res2 Dilated Conv1D represent the personality information of the speaker, which improves the performance of the model. ETP and NR-ETP take around 4 days to train.

CONCLUSION
In this article, we used the meta-learning method for solving the few-shot short utterances SV task. We sampled from the training set to construct a large number of new subtasks to  mimic few-shot scenario. ECAPA-TDNN was applied to the prototypical network to learn meta-task embeddings for either meta-task, where embeddings from the same speaker are closer than embeddings from different speakers. We used global classification and prototypical network in an episodic manner to train a model to obtain discriminative speaker features. The SV task was tested on the VoxCeleb1 dataset. The experimental results show that the performance of this model is better than the comparison model.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This research work was supported by the National Science Foundation of China (No.62166025); the Science and Technology project of Gansu Province (No.21YF5GA073); and the Gansu Province Department of Education: Outstanding Graduate Student ''Innovation Star '' Project (No.2021CXCX-512, 2021CXCX-511). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors: