Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems

This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a Bayesian estimation of the class token. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings.


Introduction
The performance in speaker verification (SV) tasks has improved greatly in recent years thanks to the deep learning (DL) advances in signal representations and optimization metrics [1,2,3,4,5] that have been adapted from the state-of-the-art face verification, image recognition, or text-modelling systems. In these systems, Convolutional Neural Network (CNN) or Time Delay Neural Network (TDNN) [2] are still the most employed approaches to obtain the signal representations or embeddings. Nevertheless, self-attention mechanisms are becoming a dominant approach in many fields beyond text-related tasks. For example, Transformers [6] are spreading to many tasks [7,8,9,10] where large scale databases are available. In SV tasks, this kind of architecture has started to be successfully applied in text-independent SV [11,12,13,14] where there are no constraints in the uttered phrase and big databases are available. However, in text-dependent SV, there is still room for improvement since the amount of public data is not very large. Besides, text-dependent SV consists of deciding whether a speech sample has been uttered by the correct speaker pronouncing the fixed passphrase selected. So, the phonetic information of the signal is relevant to determine the identity. Therefore, keeping the temporal structure is needed to obtain representations that encode correctly both phrase and speaker information.
In the context of text-dependent SV tasks, our previous works [15,16,17] showed the advantages of replacing the traditional pooling mechanism based on averaging the temporal information with an external alignment mechanism to obtain a supervector embedding. This supervector allowed to keep the temporal structure and represent both phrase and speaker information, but the temporal alignment had to be performed by using an external method as a phone decoder, a Gaussian Mixture Model (GMM) [18,19] or a Hidden Markov Model (HMM) [20]. As an alternative approach, in [21], we introduced Multi-head Self-Attention (MSA) mechanisms [6] combined with memory layers [22] to substitute the alignment mechanisms. The use of MSA allowed the model to focus on the most relevant frames of the sequence to discriminate better among utterances and speakers. However, the proposed architecture based on MSA employed an average pooling mechanism to obtain the final representation embedding.
In this work, to substitute the global average pooling, we have introduced a learnable vector known as Class token, which is inherited from Natural Language Processing (NLP) [7], and recently, many image recognition systems [8]. However, this approach has not yet been applied to SV tasks. To introduce this vector into the system based on DNN with MSA and memory layers, the class token is concatenated to the input before the first MSA layer, and the state at the output is employed to perform the class prediction. During training, the temporal information is encoded in the token, and this token interacts with the whole input sequence through self-attention and learns a global description similar to a supervector approach [16,23] since the multiple heads act as slots of the supervector. A similar mechanism has also been used recently in [10]. Therefore, the average pooling mechanism is not needed to obtain a representation. The multiple heads can encode more details about the sequence order than the average, playing the role of the states and improving the results as shown in [16], [17] with the use of external alignment mechanisms based on HMM and GMM. In addition, the information encoded in these multiple heads can be represented and analyzed, which improves the interpretability of the results of this kind of approach. To improve the performance obtained with the class token approach, we also introduce a novel multiple initialization sampling mechanism to reduce possible initialization problems and give more robustness against the lack of data to model predictions. Since it is a case of use in the industry to develop custom specific systems with small in-domain datasets and this kind of approach could be a possible solution.
Moreover, this work contributes with another approach based on Transformer architecture and Knowledge Distillation (KD) [24,9]. We propose a teacherstudent approach combined with Random Erasing data augmentation [25,26] which allows modelling the uncertainty in the parameters of a teacher model with a compact student model and get more reliable predictions. Following the idea proposed in [9], we have also introduced the Distillation token in the student network to replicate the predictions of the teacher network, while the class token is trained to reproduce the true label as Fig.3 depicts. Unlike the objective in [9], in our work, the distillation process is not intended to compress the teacher model, but rather both models are trained together and the student model learn to better capture the intrinsic variability of the teacher predictions.
To summarize, the main contributions are: • We replace the global average pooling mechanism by a learnable class token to obtain a global utterance descriptor associated to the concept of supervector in speaker verification.
• We propose a new approach based on a sampling approximation to estimate the class token.
• We introduce a teacher-student architecture with an additional token known as distillation token which is combined with the class token to provide robustness to the learned student model. This paper is organized as follows. In Section 2, we show an overview of the MSA and memory layers. Section 3 explains the strategy of introducing a learnable class token using sampling. In Section 4, we introduce the approach based on KD combined with the tokens employed to develop our system. Section 5 describes the system used. In Section 6, we present the experimental data, and Section 7 explains the results achieved. Conclusions are presented in Section 8.

Overview of Transformer Encoder
The original transformer architecture [6] is composed of two main parts: the encoder and decoder parts. However, in many tasks, the transformer encoder is the only part used to create the DL systems. The core mechanism of each encoder block is Multi-head Self-Attention (MSA) layer which is composed of multiple dot-product attention. As we only employ the encoder part, the input to this attention mechanism is the same for the query, key and value signals (Q, K, V): where x is the input to this layer, and W Q h , W K h , W V h are learnable weight matrices to make the linear projections. After these projections, a softmax operation is performed over the temporal axis, which allows each head to focus on certain frames of the input sequence. The result of this softmax operation is known as the selfattention matrix for each head and can be defined as: where d k is the number of dimensions of the query/key vector, and denotes transpose. This self-attention matrix learns the most relevant information among the different data. Using this information, the value V feature vectors are aggregated to obtain the output of each head. The final output of each head can be calculated as, Thus, MSA is defined as the concatenation of the outputs from each head H h : where X is the input to the attention layer, W head is a learnable weight matrix to make a final linear projection, and d head is the number of attention heads in the h − th layer. The transformer encoder alternates the MSA layer with a second layer which is the feed-forward (FF) layer. However, in [21], we proposed the replacement of FF layers by memory layers as in [22]. With this layer, the input data is compared with all the keys using a product key-attention, and the scores obtained are used to select the closest keys, which have the highest scores. After that, the associated weight vectors are computed with the following expression: where x is the input to the layer, U K is the keys matrix, and the softmax is computed over the memory index axis to focus on certain contents of the memory that will be used to provide the output. Once these vectors are obtained, these weights are combined with the memory values of the selected keys, and the output is concatenated with the previous attention output: where w are the weights of the selected keys obtained with (5), and U V are the memory values associated with the keys. After the encoder blocks are applied, an average pooling mechanism is usually employed to reduce the temporal information and represent variable-length utterances with fixed-length vectors. However, this averaging may neglect the order of the phonetic information, which is relevant for text-dependent SV tasks.

Representation using Class Token
In many tasks of NLP and computer vision, the transformer architecture uses a learnable vector called Class Token (x CLS ), as in the original BERT model [7] or Vision Transformer (ViT) [8], instead of a global average pooling. To employ this token in the transformer encoder, the vector is concatenated to the input of the first MSA layer to perform the classification task. With this token, the self-attention is forced to capture the most relevant information with the class token to obtain a representation as a global utterance descriptor similar to the supervector approach. Instead of mixing all the information with an average pooling mechanism, the temporal structure can be kept since the attention mechanism acts as a weighted sum of the temporal tokens for each layer. The output vector is the concatenation of different head subvectors and each of them is the result of a different attention outcome. Thus, the mechanism can be seen similar to those used in our previous work [16], where the heads play the role of the states and the supervector in [23]. The supervector mechanism is also similar to [27] but in that case, the task was textindependent SV and MSA layers were not used. Besides, this type of mechanism allows to enhance the interpretability of what the neural network learns through the self-attention layers.
In [23], this mechanism to obtain the supervector is defined similar to a conventional GMM supervector with the following expression: where w tc are the weights obtained by a softmax function on the output of a learnable layer, s c are vectors per state/component C of dimension D that summarize the information associated along the sequence of feature vectors x t of dimension D, andw tc are the normalized weights defined as w tc / t w tc . The final supervector is built by the concatenation of these vectors S = {s 1 , ..., s C } and is used to represent the whole sequence. In this work, the output feature vectors for each head H of the MSA layer are obtained with (3) as a weighted sum equivalent to (7), wherew tc corresponds to the rows of the matrix of self-attention weights A h obtained with (2). In particular, for the class token, the normalized weights would be obtained from the last row of A h . Therefore, the final class token obtained with this mechanism is the concatenation of the different head subvectors corresponding to the class token position, which can be expressed as the supervector presented previously S CLS = {s 1−CLS , ..., s H−CLS }.
To introduce the class token in the system, one trainable vector parameter with the dimension of the feature vectors is defined when the network is initialized. For each batch, it is replicated and concatenated at the end of each input feature sequence in the training batch as Figure 1: Evolution of the number of vectors in the token matrix that are available for sampling from the beginning of the training process (iteration 1) to the final iteration (iteration N). In each iteration, the dark vectors represent the enabled class tokens, while the light vectors are the disabled tokens.
an additional token. Hence, a single shared vector is trained to learn the final embedding representation.
In this work, we propose the use of a new sampling approach [28], and instead of having a single class token shared for the whole batch, we assume this sensitive parameter is the result of sampling a list of several vectors to be selected during the training by sampling them. In order to do that, we define a matrix of R vectors (T oken Matrix) and sample it to take one of them for each example in the batch introducing uncertainty in the class token (CLS T oken). However, the use of this approach leads to a complex and slower evaluation process since a sampling inference would have to be carried out to obtain the representations. For this reason, to avoid making the sampling inference, we have scheduled a forced reduction of the available vectors in the T oken Matrix throughout the training process. Thus, at the end of this process, only one weight is different to zero, and the class token vector parameter is fixed. This strategy allows us to start the training (Iteration 1) with a matrix of several vectors to sample from and, gradually, we reduce the number of vectors as the training progresses to finish (Iteration N) with only one as the original class token as Fig.1 depicts. Therefore, the training leads the system progressively to focus the rel-evant information on the first vector of the matrix. In addition, using this sampling approach, the system is trained to capture the uncertainty introduced by initially having a T oken Matrix with R vectors to combine with the training batch data. Thus, each example from the batch is combined with a random vector from the matrix which is reduced in size after each epoch until only one vector remains at the end, so more variability has to be modelled which helps to improve the robustness of the system.
To carry out this process, we define the following vector, which indicates to the neural network the number of tokens available at each iteration of the training process: where R is the number of tokens defined in the matrix, and N is the total number of iterations for training process. Among the number of tokens available at each iteration, a random selection of the batch size is made to select the index of the vectors. These vectors are selected from the distribution (T oken Matrix) and used as class tokens (CLS T oken) in the batch to concatenate to the input of the first MSA layer. The overall process is described in Algorithm 1. Besides, Fig.2 shows a graphical example of how this sampling process is made in Token Matrix -Iteration n

Knowledge Distillation with Tokens
Motivated by the benefits obtained when the training databases are not very large with Teacher-Student architecture based on CNNs [26], we have implemented this architecture using two transformer networks as Fig.3 depicts. Using a Bayesian approach similar to [29], the teacher-student architecture allows providing robustness to the system. In this approach, the teacher and student networks are trained at the same time, unlike previous works [30,31] in which the teacher network is usually a pre-trained model to reduce complexity. Whether the teacher network had been a frozen model, negative training examples that obtain high posterior values in the teacher network would be learned as positive examples by the student network. Besides, different sources of distortion are applied to each of the input signals of both networks, so we have employed a data augmentation method called Random Erasing (RE) [25] to provide more variability to the input training data. With this kind of architecture, the teacher network has to predict augmented unseen data and the student network tries to mimic the label predictions produced by the teacher network using the class token output. This training strategy allows the student network to capture the variability in the predictions produced by the first network and model this uncertainty in the parameters during the training process. However, inspired by [9], we have also included an extra learnable token in the student network which is known as Distillation Token (Distill T oken). The introduction of this extra token allows to implement of a multi-objective optimization by using the class token to reproduce the true label while the distillation token is trained to mimic the predictions of the teacher network. To achieve this, the Kullback-Leibler Divergence (KLD) loss between the student and teacher distributions is minimized. The KLD loss can be formulated as, where i and j are the speaker and utterance indices, x j is the input signal, p T (y cls i |x j ) is the output posterior probability of the label y cls i from the class token of the teacher model, p S (y dist i |x j ) is the output posterior probability of the label y dist i from the distillation token of the student network for the same example, and const is defined in [29]. Hence, to train the teacher-student architecture  Fig.3, we employ the following two loss expressions for teacher and student networks: Loss S = KLD(y dist S , y cls T ) + CE(y cls S , y), where CE is the cross-entropy loss, y cls S is the class token output from the student network, and y are the ground truth labels.

System Description
In this section, we describe the system architecture used in this work for text-dependent SV. Fig.3 depicts this architecture where a teacher-student approach is employed. Both architectures follow the structure described in [21] with the same backbone and pooling parts. The backbone is based on two Residual Network (RN) [32] blocks with three layers each block. Additionally, these architectures need embeddings with positional information to help guiding the attention mask in the MSA layers. In this work, these embeddings (e ph ) are extracted by a phonetic classifier network instead of using temporal position information [27]. For the pooling part, two MSA layers of 16 heads combined with two memory layers are employed. Moreover, before the first MSA layer, the class token is concatenated to the input. In the case of the student network, the distillation token is also included. Thanks to the self-attention mechanism, these tokens learn to obtain a global representation for each utterance without applying the global average pooling. These representations, similar to supervector, are more convenient for text-dependent SV task since these global representa-

Student Network
Distill Token tions do not neglect the sequence order and are obtained automatically thanks to the self-attention mechanism. So external alignment mechanisms are not necessary to obtain them as in [15,16,17], where GMM or HMM posterior probabilities are needed to align speech frames to supervectors. Besides, the use of memory layers increases the amount of knowledge obtained by the network that can be stored. After training the system, a cosine similarity over the token representations is applied to perform the verification process. Note that this kind of system based on teacher-student consists of training of two architectures at the same time. Therefore, this process may involve a higher computational cost. However, during inference, only the student network is employed to extract the embeddings, so there is no extra inference time.

Datasets
For the experiments, two text-dependent speaker verification datasets have been employed. The first set of experiments has been reported on the RSR2015 textdependent speaker verification dataset [33]. This dataset comprises recordings from 157 males and 143 females. For each speaker, there are 9 sessions with 30 different phrases. This data is divided into three speaker subsets: background (bkg), development (dev) and evaluation (eval). In this paper, we develop our experiments with Part II, which is composed of short control command with a strong overlap of lexical content, and we employ only the bkg data for training. The eval part is used for enrollment and trial evaluation. This dataset has three evaluation conditions, but in this work, the most challenging, which is the Impostor-Correct case, is the only one that has been evaluated and employed in the text-dependent SV. Note that there are other systems that obtain relevant results for this dataset, similar to those presented below. Nevertheless, such systems are based on traditional models such as Hidden Markov Models (HMMs) [33,34] or neural network architectures focused on two different streams for speaker and utterance information [35,36].
The second dataset used is the DeepMine database [37]. This corpus consists of three different parts of which we employ the files selected for the Shortduration Speaker Verification (SdSV) Challenge 2020 [38] from Part 1. Part 1 is the text-dependent part which is composed of 5 Persian and 5 English phrases and contains 963 females and males speakers. This data is divided into two subsets: train with 101.063 audio files and evaluation with 69.542 audio files. Finally, the phonetic classification network [27] has been trained using LibriSpeech [39] to extract phonetic embeddings. Unlike other works presented in the challenge [40,41], we have not used VoxCeleb 1 and 2 datasets [42,43] in the neural network training process. Motivated by the fact that in some situations and applications is required the implementation of custom systems with the few available in domain-data. For this reason, we have developed systems only with the in-domain data.

Experimental Description
To carry out the experiments with the RSR2015 dataset, a set of features composed of 20 dimensional Mel-Frequency Cepstral Coefficients (MFCC) with their derivates are employed as input. While for the experiments using the DeepMine dataset, we have employed a feature vector based on mel-scale filter banks. With this feature extractor, we obtain two log filter banks of sizes 24 and 32, which are concatenated with the log energy to obtain a final input dimension of 57. Moreover, phonetic embeddings of 256 dimensions have been used as positional information. As the optimizer for the experiments in this work, the Adam optimizer is employed with a learning rate that increases from 10 −3 to 5 * 10 −3 during 60 epochs and then decays from 5 * 10 −3 to 10 −4 . In addition, training data is fed into the systems with a minibatch size of 32.

Results
In this paper, two sets of experiments have been carried out to evaluate the proposals with both databases. We compare the different approaches to obtain the representations with a single neural network using the same architecture as the teacher network: the use of the traditional global average pooling (AVG), the attentive pooling (AT T ) and the introduction of the learnable class token (CLS ). For the class token approach, we evaluate our proposal of sampling a matrix of R vectors and reducing it until having a single vector (S ampling). This parameter is also swept for different values of R, including R = 1 that corresponds to the original idea of having a single token and repeating it. Moreover, we analyze the effect produced by the fact of using a teacher-student architecture with an extra distillation token (CLS − DIS T ).
In order to evaluate these experiments, we have measured the performance using three metrics. Equal Error Rate (EER) which measures the discrimination ability of the system. NIST 2008 and 2010 minimum Detection Cost Functions (DCF08, DCF10) [44,45] which measure the cost of detection errors in terms of a weighted sum of false alarm and miss probabilities for a decision threshold, and a priori probability.

Class Token Study
A first set of experiments was performed to compare the use of a class token to obtain global utterance descriptors with the use of a global average pooling method or the attentive pooling proposed in [46]. Thus, we study the two approaches to introduce this vector explained during this work and the effect of the number of vectors chosen for the sampling approach. Table 1 presents EER, DCF08 and DCF10 results for the experiments with RSR2015-part II dataset. Regardless of the number of vectors in the sampling for class tokens, if we apply our proposed strategy to introduce the tokens with a sampling alternative, the obtained performance is better. In addition, the results show how employing a learnable token outperforms the use of an average embedding or an attentive pooling embedding. Note that the token is trained through self-attention and keeping the temporal structure to obtain a global utterance representation, while the average embedding neglects this information that is relevant to the SV task. As we can also observe with the sweep of R value, the use of several vectors to create the token matrix is better than using a single vector and repeating it for the whole batch, which corresponds to the original way of applying this approach. The case of having a single vector and repeat it corresponds with the experiments with R = 1. However, when the number of available tokens is too large, the performance begins to degrade. This degradation could be caused by the introduction of too much variability that the system is not able to model as the architectures employed are not so large, which means that there are a limited number of different tokens to carry out the training process.
In Table 2, the results obtained in DeepMine-part 1 database are shown. Unlike the other dataset, the training data in DeepMine is larger, which indicates that the lack of data is not so critical to train a powerful and robust system. Therefore, the replacement of the average embedding or attentive pooling embedding by a class token improves the performance only slightly. Besides, the sweep of R value shows that the evolution of the female and male results separately do not follow the same trend as occurs in the RSR-Part II results.

Effect of Knowledge Distillation using Tokens
In this section, we analyze the effect of introducing an approach based on Knowledge Distillation philosophy which consists of a teacher-student architecture. Furthermore, in this approach, an extra distillation token (CLS − DIS T ) is incorporated [9]. This approach has been employed to compare the performance obtained in  the case of the average global pooling as well as in the proposed sampling approach to use the class token. In this second case, we have developed the teacher-student architecture using the R value of the best configuration obtained in the previous section, and also, the case of R = 1 as it is the usual way to apply this class token approach in the literature.
Results of these experiments in RSR-Part II are shown in Table 3. Regardless of the kind of approach to obtain the representations used, we can observe that the use of an architecture based on a teacher-student approach improves the robustness and achieves better performance in all the alternatives to extract the representations. Moreover, the best performance is obtained applying our proposed strategy to introduce the tokens with a sampling alternative with more than a single vector.
On the other hand, Table 4 presents the performance of systems with DeepMine-part 1. In this case, the results show that the application of only the teacherstudent architecture does not improve the systems. However, the use of the teacher-student architecture and the extra distillation token (CLS − DIS T ), combined with the sampling strategy with several token vectors also allows achieving a more robust system and a significant improvement in the results.

Analysis of Class Token Self-Attention Representations
In view of the relevant results obtained, we have also conducted an analysis to interpret what the selfattention matrix A is learning in each system. To perform this analysis, we have employed the system with the best performance from each database, and within these systems, the last MSA layer of the student model has been selected to make the representations. In addition, we have chosen different utterances to analyze in Fig.4 and Fig.5. For each utterance, three figures are plotted: the spectrogram of the utterance, the matrix of attention weights corresponding to the class token for each of the 16 heads of the MSA layer, and the sum of the weights of these class token attentions.
In Fig.4, two examples of utterances of different phrases ("Call sister", "Call brother") pronounced by the same speaker are shown. These examples are obtained from the evaluation set of the RSR-Part II database. Whether we look in the middle and bottom figures, we can observe the relevant information learned by the self-attention weights to correctly determine the phrase and speaker of each utterance using the class token. Note that these two phrases of example begin exactly the same with the word Call, so focusing on the beginning of the figures, we observe how the self-attention gives similar relevance in both cases to the areas of same phonemes. Moreover, we can also see that the weights do not pay attention to the area at the beginning and end   of the utterances that correspond to moments of silence. Fig.5 represents two examples of utterances of the same phrase ("OK Google") pronounced by different speakers. In this case, the examples are obtained from the evaluation set of the DeepMine database. Note that since these figures are of the same phrase, self-attention is focused on the same areas, but different relevance is given to some of them. Besides, the effect of not focusing on the beginning and end of the utterance also occurs in these examples.

Conclusion
In this paper, we have presented a novel approach for the SV task. This approach is based on the use of a learnable class token to obtain a global utterance descriptor instead of employing the average pooling. Moreover, we have developed an alternative to create the class token with a sampling strategy that introduces uncertainty that helps to generalize better. Apart from the previous approach, we have also employed a teacher-student architecture combined with an extra distillation token to develop a more robust system. Using this architecture, the distillation token in the student network learns to replicate the predictions from the teacher network. Both proposals were evaluated in two text-dependent SV databases. Results achieved show in RSR2015-part II that each of the approaches introduced to obtain a robust system and reduce potential underperformance due to the lack of data improves the overall performance. However, in DeepMine-part 1, the results obtained replacing only the average embedding by the class token present a small improvement, while the use of a teacher-student architecture achieves a great improvement and confirms the power of this kind of approach to train the systems.