ASVtorch Toolkit: Speaker Veriﬁcation with Deep Neural Networks

The human voice diﬀers substantially between individuals. This facilitates automatic speaker veriﬁcation (ASV) — recognizing a person from his/her voice. ASV accuracy has substantially increased throughout the past decade due to recent advances in machine learning, particularly deep learning meth-ods. An unfortunate downside has been substantially increased complexity of ASV systems. To help non-experts to kick-start reproducible ASV development, a state-of-the-art toolkit implementing various ASV pipelines and functionalities is required. To this end, we introduce a new open-source toolkit, ASVtorch, implemented in Python using the widely used PyTorch machine learning framework.


1
Automatic speaker verification (ASV) systems [1] compare a pair of speech 2 utterances (enrollment and test utterance) to decide whether or not the same 3 speaker is present in the two. Modern ASV systems involve three broad tasks: 4 (i) extraction of features from short segments of speech (frames); (ii) forming 5 a fixed-dimensional vector representation (speaker embedding) per utterance; 6 and (iii) comparison of the enrollment and test embeddings to assess the 7 degree of speaker similarity. 8 To help non-experts to develop ASV systems and to make research results 9 reproducible, a number of ASV bundle software has been introduced in the 10 Figure 1 shows the processing pipeline in modern ASV systems. The 20 three main components are feature extractor, speaker embedding extractor, 21 and scoring back-end. A brief description of these components are given 22 below. We refer the interested reader to [17,18,19], and references therein, 23 for further details. 24 Feature extraction. The function of a feature extractor is to produce a 25 meaningful, compact representation of the input speech signal. The signal is 26 first segmented into overlapped frames of 20 to 30 ms. From each frame, rel-27 evant features are extracted, for instance, mel-frequency cepstral coefficients 28 (MFCCs) [20] in ASVtorch. 29 Speaker embedding is a fixed-dimensional representation of variable-length 30 utterances. Utterances of the same speaker are close to each other in the em-31 bedding space. The idea is similar to word embeddings [21,22] in natural 32 language processing, but in our case for acoustic data. Popular examples are 33 x-vector [12] and i-vector [11] embeddings. 34 Scoring back-end. Given a pair of enrollment and a test utterance em-35 beddings, φ e and φ t , a similarity score is computed. It might be a simple 36 cosine similarity, or a statistical back-end like probabilistic linear discriminant 37 analysis (PLDA) [23,24]. It is also common to pre-process the embeddings 38 through dimension reduction with linear discriminant analysis (LDA) [25], 39 whitening and length normalization [26]. ASVtorch implements both embed-40 ding pre-processing and PLDA based scoring.  The front-end package contains Python wrappers for Kaldi's shell scripts [3] 47 to extract MFCCs [20] and to perform voice activity detection (VAD) and 48 data augmentation. The package contains a feature loader to load the fea-49 tures to NumPy arrays, to compute delta features, and to perform cepstral 50 mean and variance normalization of the features. The back-end consists of an embedding processor, PLDA and score nor-53 malization. The embedding processor can be used to center, whiten, and 54 length-normalize the speaker embeddings, as is the common practice. The network package contains all deep learning related functionalities, in-57 cluding many architectures for speaker embedding extraction, training loop 58 for DNNs, and many custom neural network components. It also contains 59 two data loaders; one to load short-term features during DNN training and 60 the other to load features of full-length segments to perform validation and 61 embedding extraction. The i-vector package implements the i-vector pipeline. This includes the fast 64 computation of GMM frame alignments and training of i-vector extractors, 65 both of which utilize GPU acceleration detailed in [27]. It also contains 66 necessary feature and Baum-Welch statistics data loaders needed in frame 67 alignment computation, i-vector extraction and model training.     data to perform the primary computation work. This subsection describes 103 the functionality of each data loader.

104
The data loader for DNN training crops training utterances within a 105 minibatch to enforce the same duration before feeding the minibatch to the 106 network. Although duration within a minibatch is fixed, it may vary across 107  As a result of random sampling, no two epochs use exactly the same data.

113
For accelerated embedding extraction with GPU, we designed separate 114 data loader that organizes enrollment and testing utterances into minibatches.

115
The requirement of constant duration within a minibatch is handled by  In ASVtorch, the training process of deep embedding extractors is mon-125 itored in a number of ways. The training and validation losses as well as 126 training and validation classification accuracies are periodically reported. 127 Additionally, ASVtorch reports losses and accuracies on full-duration be-128 sides the cropped training segments. This allows monitoring potential issues 129 caused by duration variation.

130
The above metrics are not often enough to reliably determine the quality 131 of the speaker embeddings, as the losses and accuracies are computed from 132 the output layer of the network, whereas the embeddings are extracted from 133 one of the preceding layers. Thus, ASVtorch runs a scaled-down version of 134 an ASV system with a PLDA backend after every N th epoch to monitor the 135 progress in terms of EER and minDCF metrics. In ASVtorch, the parameters of a DNN are optimized using the minibatch 138 stochastic gradient descent (SGD) algorithm. In SGD, the magnitude of the 139 updates to the network parameters is controlled by the learning rate param-140 eter. In practice, it is beneficial to begin training with a high learning rate 141 and decrease it as training progresses. This allows the network to converge 142 fast at the beginning while reaching closer to minimum of the loss landscape 143 at the final stages of training. To lessen the need of manual tuning of the 144 learning rate schedule, we use a learning rate scheduler based on the training 145 loss. The scheduler operates as follows: if the relative decrease in training 146 loss between two consecutive epochs is less than a predefined percentage, the 147 learning rate is halved. The training stops when the learning rate was halved 148 twice in a row.  Speaker embedding uses temporal pooling of features to form a fixedsized representation. This is implemented via an equal-weight averaging (or pooling) of transformed features across all time steps. A fundamental concern is to ensure that the gradient is properly back-propagated through the temporal pooling layer. The authors are unaware of such analysis being reported in earlier ASV studies. In specific, we are interested in the backward propagation of gradient through where is the transformed feature vector at the t-th time step. Here, W, b, and f t 162 are the weight matrix, bias vector, and input to a feed-forward layer followed 163 by a non-linear activation function g. This could be a rectified linear unit 164 (ReLU), leaky ReLU, or some other activation function.

165
Let ∂L ∂a be the gradient at the output of the temporal pooling layer, where L denotes the loss function (typically, a multi-class cross-entropy loss) used to optimize the network. Propagating the gradient backward through the averaging operation essentially divides the gradient to T equal parts such that Since the same set of weights W is used to process the features f t , for t = 1, 2, ..., T , the gradients at all the T time steps are summed, as follows From (3) and (4), we see that the net effect of gradient back-propagating 166 through the temporal pooling layer is the smearing of gradients to frames 167 and then summed to update the weight matrix and bias vector. From (4), it 168 is also clear that weighted average could be used to give more attention to 169 certain frames deem to be more important for speaker recognition task. We 170 refer interested readers to [29,30] for further details.  The VoxCeleb1 and VoxCeleb2 datasets used for VoxCeleb evaluation 178 were collected from YouTube by the authors of [14] and [15]. VoxCeleb1 179 consists of more than 150 000 utterances from 1251 speakers and VoxCeleb2 180 consists of over 1.1M utterances from 6112 speakers. The average utterance 181 duration is about eight seconds. In the VoxCeleb repice, we train the ASV 182 systems using VoxCeleb2 dataset, whereas VoxCeleb1 is used for testing. 183 For training, the utterances originating from the same YouTube video are 184 concatenated together. Testing is done using three trial lists introduced in 185 [14] and [15]: SITW core-core EER point MinDCF point speech data required to develop (and evaluate) the systems, and the number 212 of model parameters. Crafting usable ASV systems used to be special activ-213 ity reserved for experts in audio processing. This often involved combining 214 and interfacing different scripts and tools across programming languages or 215 computer environments; and part of the 'art' (as in 'state-of-the-art') was 216 about being aware of hidden implementation details (not always transparent 217 in publications), and spending substantial time to design and clean-up file-218 lists. ASVtorch provides functionalities and recipes aimed at lowering the 219 barrier for non-experts, especially those from other disciplines and industries, 220 to quickly kickstart building ASV systems.

221
Whereas it took relatively long time for deep learning to outperform clas-222 sic modeling approaches in ASV, deep speaker embeddings are now con-223 sidered the state-of-the-art and are under active studies by many research 224 groups. While providing the state-of-the-art components, ASVtorch also im-225 plements accelerated variants of classic methods, like i-vector. This enables 226 systematic comparison of new and existing algorithms on the same platform, 227 and is consistent with our aim to promote reproducible research. In the 228 example, we showed how to use ASVtorch in to train, test, and evaluate a 229 speaker verification system. Users can use the toolkit as-is, make modifica-230 tion on top of current functionalities -or to even build commercial ASV 231 systems. 232

233
We introduced the ASVtorch toolkit for automatic speaker verification 234 (ASV) consisting of functionalities ranging from feature extraction to speaker 235 embedding and scoring. These functionalities have been carefully crafted, 236 fine-tuned, and tested on large-scale ASV tasks. Constructing a complete 237 ASV pipeline is always a major undertaking as it involves substantial domain 238 knowledge. Our aim is to make the ASVtorch toolkit available to a wide 239 audience, especially those from other fields and industry, and to encourage 240 reproducible ASV research. While providing a complete ASV pipeline, the 241 toolkit can be used independently following library-like design. It allows a 242 flexible, robust way to trial different ASV methods. Additionally, we believe 243 that various functionalities provided in the toolkit are applicable for other 244 audio, speech, and time series processing tasks, though the efficacy is yet to 245 be tested.

Conflict of Interest
We wish to confirm that there are no known conflicts of interest associated 248 with this publication and there has been no significant financial support for 249 this work that could have influenced its outcome.