Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-supervised Speaker Verification

In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertainty-aware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In the back-end stage, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker, which provide not only speaker representations but also data uncertainty. Experimental results show that the proposed bootstrap equilibrium training strategy can effectively help learn the speaker representations and outperforms the conventional methods based on contrastive learning. Also, we demonstrate that the integrated two-stage framework further improves the speaker verification performance on the VoxCeleb1 test set in terms of EER and MinDCF.


I. INTRODUCTION
S PEAKER verification (SV) is the task of verifying whether the claimed identity from given the speech samples is true or not. SV has become one of the key technologies for authentication in e-commerce applications, general business interactions, and forensics [1]. Generally, the SV system consists of two stages: front-end encoding and back-end scoring. In the front-end encoding stage, the utterance-level fixed-dimensional feature vector is extracted by summarizing speech samples with varying frame lengths. The speaker's voice patterns and characteristics can be condensed in this feature vector. In the back-end scoring stage, the similarity or likelihood (e.g., cosine similarity and probabilistic linear discriminant analysis) between the feature vectors from the enrollment and test utterances is measured to decide the acceptance or rejection of the claimed identity.
In recent years, inspired by the success in various fields such as vision and natural language processing, various deep learning-based methods for the speaker verification have been proposed [2]- [6]. Among them, the most popular approach is to use utterance-level bottleneck features called speaker embeddings. These deep speaker embedding techniques have significantly boosted the performance when a large amount of training data is available [7]- [9]. However, most of these methods are based on fully supervised learning techniques, which requires a huge amount of manually labeled data. Furthermore, the construction of labeled corpus in real-world scenarios is labour-intensive and sometimes limited by privacy issues.
Self-supervised learning (SSL) is a promising alternative approach that reduces the need for labeling burden. SSL is a branch of unsupervised learning, and it utilizes input data itself as the target for supervision [10], [11]. Recently, the most prevalent methods in SSL are based on contrastive learning [12]- [14]. The core idea in these methods is to pull together two representations jointly sampled from the same class (i.e., positive pairs) while pushing apart those independently sampled from different classes (i.e., negative pairs) VOLUME  through contrastive loss such as the InfoNCE [12] or NT-Xent [13]. Also, it is proven that minimizing the contrastive loss is equivalent to maximizing the lower bound of the mutual information between latent representations [12], [15].
Over the last couple of years, several works have been attempted for self-supervised speaker verification. Stafylakis et al. [16] exploited a pretext task reconstructing the frames of a target speech segment, given a speaker embedding inferred from another part of the same utterance. Authors in [17], [18] learned speech representations that capture the speaker's identity by maximizing the mutual information between the local embeddings extracted from the same utterance. Furthermore, to generate more robust positive pairs, data augmentation with different additive noises and simulated room impulse responses (RIR) was applied in [19]- [22]. Huh et al. [20] proposed an augmentation adversarial training strategy that penalizes the ability to predict the augmentation types so that the embeddings can be optimized to be channelinvariant. Instead of adversarial training, Zhang et al. [21] adopted a joint training approach using a channel-invariant loss formulated as the distance between the embedding of the augmented segments and its clean version. Xia et al. [22] utilized a queue to maintain many negative pairs as in MoCo [14] and proposed a prototypical memory bank to compensate for the samples wrongly verified as negative.
Although the aforementioned methods have shown successful results, the contrastive learning techniques generally require careful handling of the negative pairs, e.g., large batch sizes [13], additional memory banks [14], [22], etc. [10], [11], [15], [23], [24]. Moreover, the framework of contrastive learning assumes that every utterance in a mini-batch (or memory bank) contains only one speaker's speech. This could lead to a class collision problem [15], [22]; the actual positive sample may be misclassified as a negative. In addition, their performance greatly depends on the augmentation strategies [13].
To reduce the dependency on how to choose the negative samples, we introduce a bootstrap mechanism for learning speaker representations. The bootstrap approach has shown meaningful outcomes in the fields of reinforcement [25], vision [26], [27], graph [28], [29], sentence [30] and user-item representation learning [31]. They learned the representations by predicting the target latent embeddings of a positive pair, where the asymmetric prediction tasks make a bootstrapping effect in the latent space [25], [26]. In our work, the speaker representations are learned via two distinct networks, namely the online and target networks. The parameters of both networks are asymmetrically updated with positive pairs. The online network is optimized by back-propagated gradients to predict the outputs of the target network, while the target network is updated through an exponential moving average (EMA) of the online network weights.
The algorithmic components for the bootstrap framework, such as the stop-gradient, EMA, and the asymmetric update, help prevent the representations of all samples from being similar, which is known as the problem of collapsed solutions [26], [27]. Nevertheless, we empirically found that learning the speaker embeddings with only a bootstrap prediction task is insufficient to avoid collapsed representations (shown in Section IV-B.1). In order to mitigate this difficulty, we increase the entropy of the nuisance factors via a uniformity regularization term. Optimizing the uniformity regularization term forces the embeddings to be in the equilibrium state (i.e., uniformly distributed over the unit hypersphere), which leads to a maximum entropy in the latent space [32]. This can improve the inter-class separability between speaker representations, which was not considered in the previous bootstrap techniques. To formulate the uniformity regularization loss, we exploit the total pairwise potential based on a Gaussian kernel function, which is closely related to the universally optimal point configurations [33]- [35].
On top of that, we leverage a mutual likelihood score based on an uncertainty-aware probabilistic speaker embedding for the back-end stage. Probabilistic embedding has been introduced in natural language processing [36], computer vision [38]- [40], and other areas [37] to represent the feature (mean) and uncertainty (variance) simultaneously. Instead of a deterministic point embedding, we estimate the distribution of each speaker embedding, i.e., a Gaussian distribution, where the mean represents the identity-salient features, while the variance explains data uncertainty. We make use of the bootstrap speaker representations for the mean vectors and learn the covariance matrices to quantify the data uncertainty in the self-supervised learning fashion. To optimize the backend system, we maximize the mutual likelihood score (MLS) between the probabilistic speaker embeddings. After training, the MLS is used for verification.
Experimental results demonstrated that learning the speaker representations via the bootstrap prediction loss and the uniformity regularization loss could avoid collapsing to a trivial solution and further improve the speaker verification performance. Additionally, we investigated that this bootstrap speaker representation can be more resilient to the batch size compared to its contrastive counterpart. Finally, by using the MLS between the estimated probabilistic speaker embeddings in the back-end, the proposed framework outperformed the existing self-supervised speaker verification methods in terms of the equal error rate (EER) and minimum detection cost function (MinDCF) on the VoxCeleb1 test set. The contributions of this paper are as follows: • We propose the bootstrap training strategy to learn the speaker representations in a self-supervised manner. The front-end networks are trained via the objective to predict the target embeddings of the positive pairs through the asymmetrical updates. Furthermore, we introduce the uniformity regularization term to prevent the speaker representations from collapsing into a trivial solution. This regularization term increases the entropy on nuisance factors inherent in the speaker embeddings, which can lead to the enhancement of the inter-speaker separability. By minimizing the combined loss, we learn the bootstrap equilibrium speaker representation in the front-end stage.
• We incorporate the data uncertainty in the verification step of the back-end scoring stage. To represent the data uncertainty on the input speech, we introduce the uncertaintyaware probabilistic speaker embedding approach. Unlike the conventional deterministic point embedding, the probabilistic speaker embedding follows the Gaussian distribution where the mean and variance provide the speaker identity and data uncertainty, respectively. The parameters of the probabilistic speaker embeddings are estimated by maximizing the MLS. • Experiments investigate the training analysis and the speaker verification performance in terms of EER and MinDCF on the VoxCeleb 1 evaluation set. Also, we compare the proposed methods with the conventional techniques. By integrating the two proposed stages, we achieved outstanding results in the self-supervised speaker verification task, outperforming the existing techniques. The rest of this paper is organized as follows: In Section II, we introduce the conventional self-supervised speaker verification framework. In Section III, the proposed training strategies are described. The experimental results are shown in Section IV. Finally, Section V concludes the paper.

II. CONVENTIONAL SELF-SUPERVISED CONTRASTIVE SPEAKER REPRESENTATION LEARNING
Let X = {x 1 , . . . , x N } represent the N -utterances in each mini-batch B randomly sampled from an unlabeled training dataset. In the contrastive speaker representation learning framework shown in FIGURE 1, to obtain two segments from each utterance x i , the non-overlapping segments x i,1 and x i,2 are randomly cropped with the same frame length T . Under the assumption that every utterance within a mini-batch B has a single speaker's identity, (x i,1 , x i,2 ) denotes a positive pair, i.e., having the same speaker identity.
A data augmentation policy is then randomly sampled for each of the segments x i,1 and x i,2 , e.g., (N , R ) ∼ T and (N , R ) ∼ T , as follows: where N , N and R , R are random noises and RIR filters, respectively. x i,1 and x i,2 denote two differently augmented segments from x i,1 and x i,2 . f ϑ : R D×T → R d is a siamese front-end encoder that maps the speech segments of dimension D with the frame length T to the embeddings of dimension d, and notation * is the convolution operator. y ϑ;i,1 and y ϑ;i,2 represent the speaker embeddings. In order to ensure the embeddings of the positive pairs to be similar while pushing those from the negative pairs apart, an angular prototypical (AP) loss function [5] can be used. AP loss serves as contrastive loss and has been shown to perform well in self-supervised speaker verification tasks [20]- [22], which is defined as follows:

Augmentation1 Augmentation2
Encoder Encoder Segment1 Segment2 FIGURE 1: Self-supervised contrastive speaker representation learning framework. S(y ϑ;i,1 , y ϑ;j,2 ) = w y ϑ;i,1 T · y ϑ;j,2 In the above expressions, S(·) : R d × R d → R is the affine transformation of the cosine similarity between two speaker embeddings of dimension d. w and b are trainable parameters for scale and bias, respectively.

III. PROPOSED FRAMEWORK
Our proposed training strategies are composed of two stages: bootstrap equilibrium speaker representation learning and probabilistic speaker embedding training. For the front-end stage, we learn the speaker representations via a bootstrap training strategy. And then, in the back-end stage, the distribution of each speaker embedding is estimated by maximizing a mutual likelihood score (MLS). With the MLS calculated via the estimated distribution, enrollment and test utterances are evaluated. All methods in our proposed framework operate in a self-supervised learning fashion.

A. FRONT-END: BOOTSTRAP EQUILIBRIUM SPEAKER REPRESENTATION LEARNING 1) Bootstrap training strategy and prediction loss
Unlike the contrastive methods requiring a careful design of informative negative samples, we learn the speaker representations via a bootstrap mechanism with only positive samples. Analogous to the previous works [25], [26], [28], [29], the proposed framework contains two neural networks: the online and the target networks. Let f : The projector g transforms a representation y into a d g -dimensional representation z g(y) on the smaller space via non-linear projection layers. The predictor q outputs a d q -dimensional representation q(z), which is used for a regression task.
The online network is defined by a set of trainable parameters θ and includes the online encoder f θ , the online projector g θ and the online predictor q θ . The target network VOLUME 4, 2016 is parameterized by a set of weights ξ distinct from the online network. It contains the target encoder f ξ and the target projector g ξ , which have the same architectures as f θ and g θ , respectively. The predictor network exists only the online branch, which forms an asymmetrical structure between the online and target pipeline (shown in FIGURE 2).
The online parameters θ are optimized by following the gradients of the euclidean distance between the 2normalized online predictions and target projections. We define a bootstrap prediction loss as follows: where q θ (z θ;i,1 ) and z ξ;i,2 denote the 2 -normalized online prediction of x i,1 and the 2 -normalized target projection of x i,2 , respectively. We also symmetrize the loss of equation (4) as˜ pred θ,ξ by predicting the target projection of the input x i,1 using the online prediction of x i,2 . Finally, the bootstrap prediction loss L pred θ,ξ is obtained by the sum of two symmetrical losses as in equation (5).
At the training step, the online parameters θ are updated via back-propagated gradients to minimize L pred θ,ξ with respect to θ only, where the parameters ξ are fixed by the stop-gradient. The target weights ξ are optimized via the exponential moving average (EMA) of the online parameters θ. The overall dynamics are as follows: where optimizer is an optimizer such as SGD or Adam, and η is the learning rate. τ ∈ [0, 1] is a target decay rate for the momentum-based exponential moving average. By gradually increasing a value of τ for each training step, the target network slowly approximates the online encoder. This momentum-based update makes ξ develop more smoothly than θ, which allows to bootstrap the representations by providing enhanced but consistent targets to the online network [26]. In our work, we set τ 1 − (1 − τ base ) · (cos(πk/K) + 1)/2 where τ base is an initial base value, k is the current training steps, and K is the maximum number of training steps. The coefficient τ increases from τ base to 1 according to the above schedule during training.

2) Uniformity regularization loss
To force the embeddings to reach an equilibrium state, namely a state of minimal energy (i.e., the distribution optimizing this metric should converge to the uniform distribution on the hypersphere), we leverage the pairwise Gaussian potential kernel as follows: Algorithm 1: Bootstrap equilibrium training strategy where h : R D×T → R d is an encoder function, and t > 0 is a fixed parameter. Similar to [33]- [35], the uniformity regularization loss is defined as the logarithm of the average pairwise Gaussian potential as follows: where minimizing equation (9) leads the embedding vectors to have on uniform distribution [32], [34]. We apply the uniformity regularization loss to the predictions and projections of all pairs as follows: In practice, the uniformity regularization loss within the mini-batch can be calculated as follows: As in equation (5) of the bootstrap prediction loss, the symmetrical form of equation (11), i.e.,˜ unif θ,ξ , is computed by putting the input x i,1 into the target network and x j,2 into the online network. Total uniformity regularization loss is the sum of the two symmetrical losses.

3) Bootstrapped equilibrium speaker embedding
For the front-end stage, the online and target networks are trained using both the bootstrap prediction loss and the uniformity regularization loss. The total objective and dynamics of the front-end are as follows: After the training, we only keep the online encoder f θ and use it as the front-end encoder. The representations y θ f θ (x) are extracted as the speaker embeddings. Algorithm 1 summarizes the bootstrap equilibrium training strategy.

B. BACK-END: UNCERTAINTY-AWARE PROBABILISTIC SPEAKER EMBEDDING TRAINING 1) Probabilistic speaker embedding and MLS
The speaker representation in the front-end is learned as a deterministic point embedding for each speech segment. In contrast to the conventional deterministic speaker embedding, we estimate the distribution of each speaker embedding in the back-end stage. In several fields [36]- [40], there have been trials using probabilistic representations to estimate data uncertainty. Similar to [39], [40], we model each speaker embedding as the Gaussian distribution as follows: where µ i and σ 2 i are the d-dimensional mean and variance vectors. We consider the diagonal covariance matrix to reduce the complexity. Given the probabilistic speaker embeddings of two speech segments, the mutual likelihood score (MLS) can be measured as follows: where y i ∼ N (y; µ i , σ 2 i I) and y j ∼ N (y; µ j , σ 2 j I). The MLS expresses the likelihood of two representations belonging to the same speaker [11]. By taking the log-likelihood form, MLS can be formulated as follows: where const = d 2 log 2π. µ i denote the l th dimension element of µ i and σ i , respectively. µ i of the probabilistic speaker embedding represents the identity-salient features, while σ i explains the data uncertainty.
For the mean of the probabilistic speaker embedding, we leverage the bootstrap equilibrium speaker representation of Algorithm 2: Probabilistic speaker embedding the front-end stage, i.e., µ i f θ (x i ). With its parameters θ fixed, we train the auxiliary uncertainty estimator network to quantify the data uncertainty using the MLS loss. MLS and MLS loss are formulated as follows: where m θ : R D×T → R dm is the multi-level representation fusion module with fixed weights θ, and u ϕ : R dm → R d is the uncertainty estimator with trainable parameters ϕ.
In the module m θ , the intermediate layers of the online encoder f θ are concatenated into the d m -dimensional multilevel representation for including different level features of the input speech. The mean vector of the probabilistic speaker embedding is obtained by feeding the input segment into the fixed front-end encoder f θ (shown in FIGURE 2).

2) Uncertainty constraint loss
To prevent the network from being over-confident in estimating the data uncertainty, we constraint the dynamic range of estimated uncertainty via the following uncertainty constraint loss: where u ϕ,avg (m θ (x 1 )) is equal to 1 N i u ϕ (m θ (x i,1 )) and cnst ϕ,θ (x i,2 ) is the formulation of the segments x i,2 . This constraint ensures that the uncertainty does not deviate too far from its average, making the estimated uncertainty more reasonable [40].

3) Uncertainty-aware probabilistic speaker embedding
For the back-end stage, the uncertainty estimator network is optimized via the following total objective and dynamics: In the evaluation, MLS is measured for the verification score between the trial utterances. Algorithm 2 sums up the training process, where refers to an element-wise divide operator.

A. EXPERIMENTAL SETTINGS 1) Datasets
To evaluate the performance of the proposed speaker representation learning framework, we conducted experiments based on the VoxCeleb1 and VoxCeleb2 datasets [7]- [9]. VoxCeleb dataset is one of the most popular corpus for largescale text-independent speaker verification and composed of development and test sets with no overlapping speakers. The speech samples were extracted from YouTube video clips, degraded with real-world noises, including background chatter, laughter, overlapping speech, room acoustics, etc. For learning the proposed speaker representations, we used the development sets of VoxCeleb1 and VoxCeleb2, which consist of 1,092,009 and 148,642 utterances from 5,994 and 1,211 speakers, respectively. All networks were trained in a fully self-supervised learning manner without using any speaker labels. The evaluation was performed on the test set of VoxCeleb1, which is composed of 4,874 utterances spoken by 40 speakers. We followed the original VoxCeleb1 test list, including 37,720 trial pairs.

2) Model architectures
The proposed framework comprises three networks: the online, the target, and the uncertainty networks. In the frontend stage, the online and target networks were used for learning the bootstrap equilibrium speaker representations. They were asymmetrically updated via the Adam optimizer and EMA respectively. For the back-end stage, the online and uncertainty networks were leveraged to estimate the probabilistic speaker embeddings (shown in FIGURE 2). The detailed configurations of the networks are as follows: • Online network: Three modules were included in the online network: the online encoder, the online projector, and the online predictor. For the online encoder, we adopted Fast ResNet34 proposed in [5]. In the speaker verification tasks, Fast ResNet34 has shown competitive performance with fewer parameters than the original ResNet of 34 layers. We used 16, 32, 64, and 128 channels in the convo- lutional layers and 40-dimensional log Mel-filterbanks as the input. The outputs were aggregated through the selfattentive pooling layer, and the 2048-dimensional speaker embeddings were extracted (shown in TABLE 1). In the online projector, the speaker embeddings were projected to the smaller space through the multi-layer perceptron (MLP), consisting of the FC layer with 4096-dimensional output, batch normalization (BN), rectified linear units (ReLU), and the final FC layer with 512-dimensional output (i.e., FC-BN-ReLU-FC). The 512-dimensional outputs were acquired via the final layer. For the online predictor network, the regression to predict the target projections was performed. The online predictor took the 512dimensional online projections as the input and had the same architecture as the online projector. • Target network: In the target network, the target encoder and predictor were contained. All modules of the target network have the same architecture as the online network, except for the predictor. The Fast ResNet34 frontend backbone was used for the target encoder, and the MLP module (i.e., FC-BN-ReLU-FC) was employed for the target projector. The target network was composed of parameters separated from those of the online network. • Uncertainty network: We leveraged two networks to estimate the data uncertainty: the multi-level representation fusion module and the uncertainty estimator. The uncertainty of the speech samples can contain several factors from the low-level (e.g., frame-level local details) to highlevel features (e.g., utterance-level global nuisances). To effectively process these various level features, we con-

3) Implementation details
Our implementation was based on the PyTorch toolkit [41] using a single NVIDIA Tesla M40 GPU with 24GB memory. During training, we randomly cropped an input utterance to two 1.80-sec segments, and then two crops were differently augmented with MUSAN noises [42] and the room inverse response (RIR) filters. The MUSAN noises consisted of music, noises, and babble: 42 hours of music, 900 hours of noises, and 60 hours of babble speech. SNR of the noises was randomly selected in the range of 0-15dB for the noise, 5-15dB for the music, and 13-20dB for the speech. Also, for the reverberation via RIR filters, we utilized the simulated RIRs [43] where the filter gain was randomly sampled between -3.0 and 7.0. The 40-dimensional log Mel-filterbanks were extracted with a hamming window of 0.25-sec length and 0.10-sec hop-size with the 512-size FFT. The mean and variance normalization (MVN) was applied to the extracted acoustic features [44]. All the experimented networks were trained with a base batch size of 200. The online and uncertainty networks were optimized using the Adam optimizer [45] with β 1 = 0.9, β 2 = 0.999, and initial learning rate 0.001 decreasing by 5% every 10 epochs. The parameters ξ of the target network were fixed by stopping gradient while optimizing the θ of the online network. Then, they were updated via the momentumbased EMA with the target decay rate τ = 1 − (1 − τ base ) · (cos(πk/K) + 1)/2, where τ base = 0.996 and K is the maximum number of iteration steps. For the verification, the cosine similarity and MLS were used for the score measure, and two performance metrics were evaluated: the equal error rate (EER) and the minimum detection cost function (MinDCF). The EER indicates the error when the false alarm rate (FAR) and the false reject rate (FRR) are the same, and the MinDCF is defined as the minimum value of the weighted sum of the FAR and FRR. The parameters of MinDCF were set as C miss = 1, C f a = 1, and P target = 0.05.

B. RESULTS: SPEAKER VERIFICATION VIA BOOTSTRAP EQUILIBRIUM TRAINING STRATEGY 1) Training analysis
In order to investigate whether the proposed training strategy prevents speaker representations from collapsing to a trivial solution, we conducted the ablation on the original VoxCeleb1 test set using the VoxCeleb2 training set. Since we argued that the uniformity regularization loss L unif θ,ξ can help avoid the collapsed representations by enhancing the missing speaker-variability in the latent space, we validated the effectiveness of the uniformity regularization term by changing its weight λ in the range of 0, 1, 2, and 5.
The training dataset was based on the development set of VoxCeleb2, and the front-end networks were trained via the bootstrap equilibrium training strategy with batch size of 200 for 300 epochs (total 1,638K steps). The evaluation was done on the original VoxCeleb1 test set with the cosine similarity as the back-end scoring measurement. The frontend objective function was L front θ,ξ = L pred θ,ξ + λL unif θ,ξ where λ ∈ {0, 1, 2, 5}, and the rest of the experimental settings remain the same as Section IV-A.  FIGURE 3 (b)). On the other hand, jointly training with the uniformity regularization loss λL unif θ,ξ prevented the bootstrap prediction loss from naively approaching a value of zero as shown in the results of FIGURE 3 (b). The results of EER and MinDCF in FIGURE 3 (d) and (e) showed that leveraging the proper uniformity regularization can enhance the speaker verification performance while minimizing the bootstrap prediction loss alone (i.e., λ = 0) does not escape from the performance degradation. The proposed regularization strategy can prevent representations falling into a trivial solution. Using the proposed bootstrap equilibrium training strategy, we achieved the best performances of 6.75% in EER with λ = 2 and 0.395 in MinDCF with λ = 5, which outperformed the conventional self-supervised speaker verification methods. The comparison of results with conventional methods using the VoxCeleb2 training set will be presented in TABLE 5 of Section IV-C.2.

2) Comparison over the conventional methods
In this section, we compared the performance using the VoxCeleb1 training dataset with the existing self-supervised speaker verification methods. The previous works contain NPC [18], MoCoVox [46], and Channel-invariant training (Chnl) [21] with prototypical (Prot) or angular prototype Prot [46] VoxCeleb1 N | R COS 17.42 -AProt [46] VoxCeleb1 N | R COS 14.69 -MoCoVox [46] VoxCeleb1 N | R COS 13.48 -AProt [21] VoxCeleb1 N +R+S EUC 11.07 0.700 AProt + Chnlcos [21] VoxCeleb1 N +R+S EUC 9.94 0.683 AProt + Chnlmse [21]  (AProt) loss. NPC employed a short-term active-speaker stationarity hypothesis assuming that two temporally close speech segments belong to the same speaker and learned representations to discriminate positive and negative speaker pairs. MoCoVox applied momentum contrast [14] on speaker representation learning using the VoxCeleb1 training set and analyzed similarity distribution for verification pairs under various augmentations. Channel-invariant training was based on a joint training approach using the conventional angular prototypical objective and a channel-invariant loss formulated as the distance between the embedding of augmented segments and its clean version. The distance for the channelinvariant loss was computed by either the cosine similarity (Chnl cos ) or the mean squared error (Chnl mes ). Furthermore, we included the result of the joint training for the angular prototypical and the uniformity regularization loss (AProt + Unif) with batch size of 200. Finally, we reported the performance of the typical supervised learning method (supervised AProt [5]) in which AProt loss and Fast ResNet34 backbone with 512-dimensional embeddings were employed. In this experiment, the training dataset was based on the VoxCeleb1 development set, and the data augmentation with both noise addition and reverberation (N +R) was used; the notation N | R denotes the data augmentation with either noise addition or reverberation, and S is SpecAugment [47] in TABLE 2. The speaker representations were learned via the bootstrap equilibrium training strategy with batch size of 200 for 200 epochs (total 148.6K steps). The evaluation was performed on the original VoxCeleb1 test set, and the cosine similarity (COS) was used as the back-end scoring metric; EUC denotes the euclidean distance in TABLE 2. The rest of the experimental setups followed those of Section IV-A.
The results are given in TABLE 2. The bootstrap equilibrium speaker representations (Boot+Unif) showed the improvement of the performance compared with the conven-   tional methods. When the uniformity regularization loss weights were set to λ = 5 and λ = 2, the best performances of 9.20% in EER and 0.485 in MinDCF were achieved, respectively. These results outperformed the conventional methods based on the contrastive learning framework.

3) Effectiveness of batch size changes
By learning the speaker representations through the bootstrap training framework, we could reduce the dependency on the negative pairs, which is the essential factor in contrastive learning methods. To demonstrate this effect, we compared the performance degradation as the batch size decreases with the contrastive learning method.
Two models were utilized for the comparison. First, for the bootstrap training, we learned the speaker representations via the bootstrap equilibrium training strategy with uniformity regularization weight λ = 5. The experimental setups were the same as Section IV-A, IV-B.1, and IV-B.2.
Next, to check the performance of the contrastive learning method, we used the angular prototypical loss [20]- [22] and further added the uniformity regularization term. This model has shown the competitive results on both VoxCeleb1 and VoxCeleb2 (shown in Section IV-C.2) training set with relatively limited resources (e.g., augmentation types, smaller batch size, and front-end architecture with fewer parameters) VOLUME 4, 2016 compared with the previous works.
We used the VoxCeleb1 training set to train the frontend networks with a batch size of 100, 200, 300, and 400 during 148.6K iterations. The results were evaluated on the original VoxCeleb1 test set with the cosine similarity scoring measurement. The experiments were repeated three times, and we reported the mean and standard deviation. The results of TABLE 3 indicate that the performance of both models gradually deteriorates as the batch size decreases. However, the bootstrap model showed less performance drops than the contrastive method regarding EER and MinDCF. In particular, in terms of MinDCF, consistent performances over the batch size changes were achieved. The relative performance deterioration rate and the summary for the batch size changes can be seen in TABLE 4 and FIGURE 4, respectively.

C. RESULTS: UNCERTAINTY-AWARE PROBABILISTIC SPEAKER EMBEDDING TRAINING STRATEGY 1) Estimation of data uncertainty via the back-end network
The uncertainty estimator is trained to capture the data uncertainty (e.g., ambient noises, reverberant environments, distortions, etc.) from the input speech samples. To check the back-end network's ability to model the data uncertainty, we confirmed the distribution of the estimated variance output from the uncertainty estimator. Through the trained uncertainty estimator, we inferred the uncertainty on the speech samples from the development sets of VoxCeleb1 and 2. Furthermore, we applied the five-strength random noises and reverberations to the speech samples; the ranges of 0 to 25 dB SNR for noises, 5 to 30 dB SNR for music, 13 to 28 dB SNR for babble, and -5 to 10 gains for the reverberation were divided into the five ranges in the linear scale, respectively. FIGURE 5 shows the distribution of estimated uncertainty on VoxCeleb1 and 2 development datasets where the different colors of distribution indicate the strength of data augmentation. It was observed that the output estimated from more strongly augmented input speech samples had greater uncertainty values and deviations.

2) Results of the two-stage framework and comparison with the conventional methods
This section compares the self-supervised speaker verification results on the VoxCeleb2 training dataset between the conventional methods and the proposed two-stage framework. The conventional methods contain Disent [48], CDDL [49], GCL [19], I-vector [50] and AAT [20], Chnl [21], ProNCE [22] techniques with the corresponding contrastive loss (i.e., Prot, AProt, SimCLR, MoCo, and ACont). Disent. and CDDL leveraged the cross-modal synchrony between faces and audio in a video for learning the speaker representations. I-vector was a popular method in speaker recognition before the emergence of deep learning and was commonly used with probabilistic linear discriminant analysis (PLDA) back-end [51]. In this benchmark reported in [20], the cosine similarity back-end was used for the unsupervised setting.  Prot [20] VoxCeleb2 N | R EUC 12.42 0.623 Prot [20] VoxCeleb2 N +R EUC 10.16 0.524 AProt [20] VoxCeleb2 N | R EUC 11.60 0.620 AProt [20] VoxCeleb2 N +R EUC 9.56 0.511 Prot + AAT [20] VoxCeleb2 N | R EUC 10.54 0.544 Prot + AAT [20] VoxCeleb2 N +R EUC 9.36 0.482 AProt + AAT [20] VoxCeleb2 N | R EUC 9.03 0.512 AProt + AAT [20] VoxCeleb2 N +R EUC 8.65 0.454 AProt [21] VoxCeleb2 N +R+S EUC 9.23 0.646 AProt + Chnlcos [21] VoxCeleb2 N +R+S EUC 8.31 0.615 AProt + Chnlmse [21] VoxCeleb2 N +R+S EUC 8.28 0.610 SimCLR [22] VoxCeleb2 -COS 18.14 0.810 MoCo [22] VoxCeleb2 -COS 15.11 0.800 MoCo [22] VoxCeleb2 S COS 15.50 0.810 MoCo [22] VoxCeleb2 N +R COS 8.63 0.640 MoCo (ProNCE) [22]   introduced a prototypical memory bank to the speaker embedding training for treating the negative samples efficiently. Also, we included the joint training methods using the contrastive losses (AProt and ACont) and the uniformity regularization loss (Unif), where ACont loss is a symmetrical form of AProt in terms of a query and support. These models were trained with batch size of 200 for 300 epochs using the Fast ResNet34. Finally, we contained the performance of the fully supervised method (supervised AProt) reported in [5]. In our front-end stage, the speaker representations were learned via bootstrap equilibrium training with batch size of 200 for 300 epochs, same as Section IV-B.2. Moreover, in the back-end training stage, we fixed their parameters and utilized them as mean vectors of the probabilistic speaker embeddings. Finally, the uncertainty estimator was trained with batch size of 200 for 30 epochs (total 163.8K steps). All networks were trained using the VoxCeleb2 training set, and the objective was set as equation (23), i.e., L back ϕ,θ = L mls ϕ,θ + γL cnst ϕ,θ , where γ was set to 1. The results were evaluated on the original VoxCeleb1 test set, and the mutual likelihood score (MLS) was used as the measurement with the mean and variance of probabilistic speaker embeddings learned in the back-end stage.
The results are shown in TABLE 5. First, the performance of the bootstrap equilibrium speaker representations with cosine similarity back-end outperformed the conventional self-supervised speaker representation methods. The best performing model achieved the 6.75% in EER and 0.395 in MinDCF with λ = 2 and 5, respectively. Next, by performing the evaluation through MLS between the estimated probabilistic speaker embeddings, the performance was further improved. The best performance showed the 6.42% in EER and 0.345 in MinDCF with λ = 2.

3) Performance analysis of the back-end stage
In order to investigate the speaker verification performance in the back-end stage, we conducted ablation on the original VoxCeleb1 test set. First, we checked the speaker verification results on the VoxCeleb1 and VoxCeleb2 training datasets, respectively. The best performing models of the bootstrap equilibrium speaker embeddings in the front-end, i.e., λ = 5 for the VoxCeleb1 and λ = 2 for the VoxCeleb2, were used as the mean vector of the probabilistic speaker embeddings. Also, we analyzed the performance on the different uncertainty constraint loss weights γ in the range of 0, 0.5, 1, 2, and 3. In the front-end stage, the online encoder was trained with batch size of 200 for 200 epochs on the VoxCeleb1 dataset. In the VoxCeleb2, the network was trained for 300 epochs. Also, the uncertainty network was trained with the same batch size for 30 epochs in both training sets. As shown in TABLE 6, The speaker verification performances on both VoxCeleb1 and 2 training sets were further enhanced, which showed the best results of 8.84% in EER and 0.489 in MinDCF on the VoxCeleb1, and 6.38% in EER and 0.345 in MinDCF on the VoxCeleb2. Compared to a corresponding cosine similarity back-end performance, we could achieve the relative improvements of 3.91% in EER and 3.93% in MinDCF on the VoxCeleb1, and 5.48% in EER and 16.67% in MinDCF on the VoxCeleb2, respectively.

V. CONCLUSION
In this paper, we proposed self-supervised speaker representation learning strategies, consisting of the bootstrap equilibrium speaker representation learning in the front-end and the uncertainty-aware probabilistic speaker embedding training in the back-end. For the front-end stage, we learned the speaker representations via the bootstrap training scheme with the uniformity regularization term. Then, in the backend stage, the probabilistic speaker embedding was estimated by maximizing the MLS. Finally, we computed the MLS between the estimated probabilistic speaker embeddings and utilized them for the verification. The integrated two-stage framework showed outstanding results, outperforming the conventional methods based on contrastive learning. VOLUME 4, 2016