BELMASK—An Audiovisual Dataset of Adversely Produced Speech for Auditory Cognition Research

: In this article, we introduce the Berlin Dataset of Lombard and Masked Speech (BELMASK), a phonetically controlled audiovisual dataset of speech produced in adverse speaking conditions


Introduction
The ubiquitous use of medical-grade face masks during the Coronavirus Disease 2019 (COVID-19) Pandemic highlighted the challenges face masks pose for both human interlocutors and various technologies, such as automatic speech/speaker recognition (ASR) systems [1,2] and biometric authentication algorithms [3], which rely on unhindered speech and visual signals.In that sense, face masks can be regarded as an adverse condition and a hindrance to communication.Even after the pandemic, medical-grade face masks continue to be used by vulnerable populations and in various scenarios, including medical settings, industrial environments and regions with high levels of air pollution.

Summary of Research Findings on Acoustic and Perceptual Effects of Face Masks
There is a consensus in the literature that face masks affect the acoustic properties of speech by attenuating high-frequency components and thereby altering the spectral characteristics of the spoken signal; see [4] for a review.They also impede voice transmission and radiation due to the physical obstruction they pose [5].This can lead to reduced intelligibility and speech recognition accuracy in both human listeners and machine-based systems, especially in noisy backgrounds and with increasing spatial distance.However, the extent of these effects varies, depending on the face mask type and on whether the speaker employs adjustment strategies.
For human listeners, the drop in intelligibility performance ranges from 2.8% to 15.4% [6][7][8][9][10], but the effects are not always significant [11] or consistently negative [12].For machine-based systems, speech recognition accuracy was reported to be, on average, 10% lower with face masks in the presence of a Signal-to-Noise Ratio (SNR) of +3 dB [2].In quiet conditions, the impact of face masks on speech recognition accuracy seems to be negligible for both human listeners and machine-based systems.However, automatic speaker recognition and classification of mask/no mask speech were found to be highly variable [1], suggesting that face masks may hamper speaker identification in critical practical scenarios that require high precision, e.g., in forensic contexts.Overall, filtering facepiece masks are generally more disruptive than surgical masks due to their material characteristics [13] and efficacy [14].
Face masks also obscure visual cues such as lip movements, which are crucial for audiovisual speech processing and for individuals who rely on lip-reading, namely hearingimpaired listeners and cochlear implant (CI) users [15,16] or non-native speakers [9].As a consequence, face masks can lead to increased listening effort [7,8,17,18], even when no competing sounds are present.This finding is consistent across ages.Listening effort is defined as "the deliberate allocation of mental resources to overcome obstacles in goal pursuit when carrying out a [listening] task" [19].
Simultaneously, face masks can induce vocal fatigue [20][21][22], especially when speakers adapt their speaking style to compensate against the physical constriction [23].This is the case when clear speech mechanisms are triggered, e.g., by hyperarticulating or speaking more slowly.Such adaptation strategies also include Lombard speech [24], a speaking style in which speakers increase their volume and pitch to become more understandable in noisy settings [25].While clear speech has been shown to quite efficiently counteract the barriers imposed by a face mask, improving discrimination scores to a level similar or even higher to those without face masks [9,26,27], upholding such speaking styles can be effortful for speakers and impacts voice quality, leading to hoarseness, volume instability and strain [28].The implications of voice quality in the context of cognition have received less attention than other acoustic factors, but findings confirm that deviations from typical modal phonation can increase listeners' reaction times, listening effort and perceived annoyance, impair recall of spoken content and influence attitude towards speakers, as summarized by [29].In our own earlier work, we observed that Lombard speech negatively impacted recall performance despite a face mask's attenuation, which we hypothesized may be attributed to increased listener annoyance [30].
Besides immediate processing, face masks also appear to affect memory.Studies indicate that wearing face masks can reduce recall performance for audio-visually presented spoken material [9,31,32].The reasons for the drop in recall performance are not yet fully understood, but the aforementioned studies postulate that the effect may likely be attributed to increased processing demands caused by signal deterioration, which in turn reduces the resources available for encoding speech in memory.
Face masks also seem to disrupt metacognitive processes, affecting the accuracy of confidence judgments [18].Metacognition refers to the ability to monitor and assess one's own cognitive processes while engaged in a task [33]; an imperative skill during social interactions.Face masks furthermore influence the speed and accuracy of age, gender, identity and emotion categorization [34,35].The diminished confidence monitoring and the heightened difficulties in identifying emotions are explained by the absence of visual cues, which are crucial for holistic processing.Given that compensatory strategies in adverse conditions are triggered by the subject's ability to correctly assess the quality of the communicative exchange, a disruption of these skills may be particularly problematic.

Datasets of Face-Masked Speech
To facilitate the study of face mask effects on acoustics, cognition and perception, extend the current knowledge base and address the described challenges, it is essential to develop and utilize datasets that include speech recorded with face masks, i.e., facemasked speech, preferably in diverse contexts and languages.Though several of the studies referenced in Section 1 of this article have used human recordings of face-masked speech, only two of them have released their datasets for further exploration.In addition, the overwhelming body of literature on the subject of face-masked speech employs English samples, which raises the problem of over-reliance on a single model language, as explained by, e.g., [36,37].One of the few datasets available for the German language is the Mask Augsburg Speech Corpus (MASC) [38], which was originally collected as part of the Mask Sub-Challenge (MSC) mask/nomask classification task of the INTERSPEECH 2020 Computational Paralinguistics Challenge (ComParE) [39].A second dataset recorded for similar feature extraction and classification purposes is the MASCFLICHT Corpus [40].
While these datasets encompass a variety of guided, read and free speech tasks with and without a face mask, they were not originally conceived for research in the field of auditory cognition and therefore have a few shortcomings when applied to this domain.Both the MASC and MASCFLICHT corpus are restricted to audio-only recordings, which limits their use for research questions aimed at exploring the role of auditory and visual cues, as well as their interactions.In addition, the MASC corpus only contains samples from participants wearing surgical masks, possibly due to the pre-pandemic data collection period.Given that FFP2-type face masks are considered the gold standard and have been evidenced to impact speech acoustics more heavily, the need to extend datasets to include this mask type arises.The MASCFLICHT corpus was recorded with a smartphone microphone.The suitability of mobile communication device (MCD) recordings for acoustic voice analysis is an ongoing subject of debate.While some studies report comparable results between high-quality recording systems and MCDs [41], others indicate that the robustness of some voice measures is compromised due to the limited dynamic range and uneven frequency responses of the inbuilt microphones [42,43].Lastly, the principle limitation of both datasets for cognitive auditory research is the lack of controlled and standardized test material, e.g., matrix sentences, which is a necessary prerequisite for many research questions in this field.
Regarding Lombard speech, although prominent corpora exist for a variety of languages, e. g., the Audio-Visual Lombard Grid Speech Corpus [44] for native English or the Dutch English Lombard Native Non-Native (DELNN) Corpus [45], the authors have only been able to identify two readily available datasets for German.These are the Bavarian Archive for Speech Signals (BAS) Siemens Hörgeräte Corpus (HOESI) [46] and the Lombard Speech Database for German Language [47].However, neither dataset comprises standardized test material and both datasets are limited to audio.The HOESI corpus features spontaneous, casual dialogues in diverse noisy environments, whereas the Lombard Speech Database for the German Language contains a collection of read sentences.As evidenced by [48], the Lombard effect is a multimodal phenomenon, characterized by increased face kinematics.Audiovisual datasets of Lombard speech are therefore particularly useful to further explore these aspects.

Data Description
The Berlin Dataset of Lombard and Masked Speech (BELMASK) was collected to facilitate research in the field of auditory cognition and to extend the resources available for the German language.It allows for the analysis of the effects of face masks on specific cognitive domains, such as memory or selective auditory attention, while considering related voice quality changes, i.e., Lombard speech, which commonly occurs when wearing a face mask, especially in ecologically valid and therefore inherently noisy settings.Given the nature of the dataset, the effects of face-masked and Lombard speech can also be studied independently.
The Berlin Dataset of Lombard and Masked Speech (BELMASK) is a phonetically controlled, multimodal dataset, containing, in total, 128 min of audio and video recordings of 10 German native speakers, uttering matrix sentences in cued, uninstructed speech in four conditions: (i) with an FFP2 mask in silence, (ii) without an FFP2 mask in silence, (iii) with an FFP2 mask while exposed to noise, (iv) without an FFP2 mask while exposed to noise.Noise consisted of mixed-gender six-talker babble played over headphones to the speakers, triggering the Lombard effect.All conditions are readily available in face-andvoice and voice-only formats.The speech material is annotated, employing a multi-layer architecture.Due to the nature of the dataset and in accordance with existing regulations, it is stored in a restricted-access Zenodo repository under an academic, non-commercial license, which requires signing an End User License Agreement (EULA).The dataset is summarized in Table 1 1 .The matrix sentence test material development and the data collection process are described in the following sections.All abbreviations used in this article are explained in the corresponding section at the end of the article.

Construction of the Test Material
The test material used in the BELMASK dataset is modeled after established matrix tests for the German language, such as the Oldenburg sentence test (OLSA) [49].Matrix sentence tests were originally developed for adaptive speech in noise tests and are primarily used in the context of audiological diagnostics, e.g., to determine speech reception thresholds (SRT), but are also broadly employed in speech intelligibility experiments.They usually consist of a basic inventory (matrix) of words and have a fixed grammatical structure.Candidates of word groups are interchangeable between sentences, allowing for random combinations.Due to the limited alternatives per word group, words are eventually repeated between sentences.Whilst this is not critical or even desirable for certain settings, it limits the use of matrix tests for certain memory tasks with multiple testing conditions due to potential learning effects.This motivated the development of novel matrix sentences to be used for the administration of a cued serial recall task.
The BELMASK test material consists of 96 semantically coherent German sentences.Considering that the memorization of words is influenced by their lexical frequency, with high-frequency words being easier to remember [50], the construction of sentences took lexical frequency into account.This was done to equalize the level of difficulty, with mainly average and high lexical frequency words being used, see Figure 1.To avoid context-based and linguistic structure bonuses in recall performance [51][52][53], the sentences are designed to not be highly predictable.Predictability was validated using a Large Language Model (LLM) and an optimized version of the pseudo-log-likelihood metric, see Section 3.2 for details.We opted for this type of validation instead of one with human subjects, due to its cost-effectiveness and reproducibility, while also circumventing the multitude of confounders typically encountered with human participants.All sentences are syntactically identical and consist of 5-6 words, beginning with a subject, followed by a verb, a numeral, an adjective and an object, e.g., "Timo besitzt neun rosa Fahnen" ("Timo owns nine pink flags").Subjects are either common German names or a noun with its respective article, accounting for the inconsistency in the number of words per sentence.The latter two words are always in plural form and serve as the keywords to be recalled.They consist of a total of four or five syllables to balance out the difficulty and prevent word length effects [54].We opted for two keywords instead of one to facilitate mnemonic processing, encouraging strategies such as visualization or association.Each word, except for numerals, appears once within the test material.
To ensure that the test material is representative of the speech sounds contained in everyday communication, the sentences exhibit a phonemic distribution that aligns with the average phoneme distribution found in the German language and is comparable to other matrix tests; see Figure 2. Subjects contain the tense German vowels /a:/, /e:/, /i:/, /o:/, /u:/, equally distributed among the test material.Given that the sentences were initially not conceived to be used as separate test lists, the consonant distribution is not balanced between subsets of sentences.In future versions, we intend to optimize subsets with regard to equal phoneme distribution for all phoneme classes and optimize the sentences for equal mean intelligibility, equal degree of familiarity and equal number of syllables throughout all words.For the complete list of sentences refer to Appendix A.  [49], compared to the average phoneme distribution for written German (red), as reported in [55] (see Table "100.000sound count") and conversational German (yellow), based on the https://www.bas.uni-muenchen.de/forschung/Bas/BasPHONSTATeng.html [56] extended phone monogram statistics for the Verbmobil 1+2, SmartKom and RVG1 databases, accessed on 21 July 2024.

Validation of Predictability
The predictability of the matrix sentences was validated using Bidirectional Encoder Representations from Transformers (BERT), a language model, which is freely available and can be used out-of-the-box without further fine-tuning [57].We used a variant of BERT, trained on 16 GB of German text data [58].Masked Language Models (MLMs) like BERT are designed to predict semantic content, such as omitted words within a sentence, based on the context provided by the surrounding words.The process involves masking certain words within the input text and then training the model to anticipate the masked words using contextual clues from the unmasked words.To do so, the model accesses information bidirectionally from preceding and subsequent tokens.Tokens represent smaller units of words, segmented using the WordPiece subword tokenization method outlined in [59].
Perplexity is a metric used to score the performance of a model in this task, i.e., how surprised it is when predicting the next word in a sequence.High perplexity values indicate worse model performance and therefore lower predictability.Perplexity for bidirectional models such as BERT can be calculated by means of the pseudo-log-likelihood (PLL) sentence scoring approach, proposed by [60].To score the BELMASK matrix sentences, we used an optimized metric (PLL-whole-word) proposed by [61], which adapts the scorer module of the minicons library [62] and takes into account whether preceding and future tokens belong to the same word.The score of a sentence "is obtained as the sum of the log probabilities of each of the |w| tokens in each of the |S| words in [a sentence] S given the token's context": The resulting analysis demonstrates that the BELMASK matrix sentences have high negative PLL scores (in absolute values), which reflects model surprisal.Compared to a reference sentence with high predictability "Die Rakete fliegt ins All" ("The rocket flies into space") with a PLL score of −9.72, all other sentences exhibit PLL scores in the range of −30 to −92.6, see Figure 3.This verifies that the content of the BELMASK matrix sentences is not highly predictable and that they are thereby suited for memory tasks.The figure also shows an inverse relationship between perplexity and sentence length, with longer sentences exhibiting higher perplexity.Figure 4 furthermore demonstrates the positive correlation between perplexity and word frequency, with higher frequency words resulting in less perplexity.The variation in word frequency and PLL scores also allows for an evaluation of how these metrics correlate with human retention.

Speakers
A total of ten German native speakers (4 female, 6 male) were recruited for the recording sessions of the matrix sentences with a mean age of 30.2 years (SD: 6.3 years).The sample consisted of university students and academic staff.All speakers reported normal hearing and vision and no known reading or speaking impairments.Table 2 summarizes the demographics for each subject.Written consent was obtained from all speakers to process, store and publish collected sociodemographics, audio and video data.Compensation was offered in the form of trial participant credit.

Audio and Video Recordings
The recording sessions were conducted in a sound-attenuated speaker booth under constant lighting conditions.The experimental setup was a 2 × 2 (physical obstruction × sound background) factorial setting, aimed at simulating adverse speaking conditions.Physical obstruction was either a face mask or no face mask, and the sound background was either quiet ("silence") or babble noise played back over circumaural, acoustically closed Beyerdynamic DT 1770 Pro headphones.Headphones were worn by the speakers throughout the individual sessions to enable communication with the experimenter seated outside the booth.This resulted in the following four recording conditions: nomask sil , mask sil , nomask noise , mask noise .The face mask used was an unvalved class 2 filtering facepiece (FFP2), type 3M 9320+.This type was chosen because it is certified, unrestrictedly available and is frequently used in the field of occupational safety.The transmission properties of the mask are shown in Figure 5, evidencing a dampening effect in the 2-8 kHz frequency band.Audio was recorded in stereo at a sampling rate of 48 kHz in Audacity, combining inputs from two separate channels: the microphone of the speaker inside the booth and that of the experimenter outside the booth.The latter channel was utilized for documentation during post-processing and was subsequently split from the main channel, resulting in mono audio.The recording microphone was a Sennheiser MD421-II cardioid studio microphone positioned at a 15 cm distance and 45 • angle from the speakers' mouth.The bass roll-off filter was activated at its lowest setting (M + 1) to reduce potential proximity effects.The nomask noise condition was used to adjust the microphone gain level at the beginning of each recording session; see Table 3.This adjustment was necessary to avoid clipping artifacts due to the varying speech volume of the speakers.
Videos were captured by means of a Razer Kiyo Pro HD camera, mounted on a monitor and positioned at a distance of 80 cm in front of the subjects, using the inbuilt Windows 10 camera application.Speakers were instructed to look into the camera while producing the sentences.Audio recordings were exported in .wavformat and encoded as signed 24-bit PCM.Prior to exporting the audio files, the communication stream with the experimenter was separated from the main stream.Video recordings were exported in .mp4format in full HD.
Table 3.The recording settings and condition orders for individual speakers.L x is the gain level in dB and x the corresponding linear factor with L x = 0 dB and x 0 = 1 as reference values.
1 dB 1.122 m n → nm s → nm n → m s nm s = no mask/silence, m s = mask/silence, m n = mask/noise, nm n = no mask/noise.
The sound pressure level (SPL) of the produced speech was tracked with an NTi Audio XL2 sound level meter, positioned next to the microphone.All audio recording and playback devices were routed via a RME Fireface UCX-II audio interface.Babble noise was looped and mixed into the same audio channel used to communicate with the experimenter but was only audible for the speakers, who had to selectively focus their attention on the experimenter's voice when receiving instructions.The TotalMix FX software, version 1.95, RME Audio (Haimhausen, Germany) was used to control recording settings.The experimental setup is depicted in Figure 6.
The noise used during two of the recording conditions consisted of mixed-gender, six-talker babble, which was created by superimposing concatenated, read sentences of six individual speakers from the KGGS corpus [64].The content of these sentences was unrelated to that of the test material.Prior to superimposition, the chains of sentences were trimmed or filled with silence to all have the same length and were then normalized using the Linear Mapping filter in ArtemiS SUITE.This was performed to minimize the effect of single voices standing out and distracting the speaker during sentence production.Playback level for the babble noise was calibrated at ∼67.5 dB(A), using a HMS III HEAD acoustics artificial head.This level was deemed optimal to naturally trigger Lombard speech, while at the same time avoiding leakage during recordings.
Speakers were not given any instructions regarding articulation speed or speech style in order to preserve naturalness and maintain interspeaker variability.The aim was to accurately record any changes in the speech signal caused by the face mask's physical obstruction and exposure to noise, without biasing speakers towards any particular adaptation strategy.Condition order was randomized for each speaker to account for order and carryover effects, see Table 3.To ensure ecological validity while maintaining a controlled laboratory environment, sentences were not read out.Instead, speakers were cued by seeing the last three words of each sentence in their uninflected form via slides that were duplicated on a screen inside the booth underneath the question meant to trigger the full sentence, for instance, "What does Timo own?" ("9, pink, flag").In the context of a memory task, this was implemented to minimize potential recall boosts produced through read speech, which is characterized by reduced speech rate and clearer articulation [65].To avoid hesitations, speakers were asked to first mentally form the sentences and then speak them out loud.The slides were controlled by the experimenter, who monitored the correctness of spoken sentences and the speakers' eye contact with the camera, asking participants to repeat sentences in case of errors.

Post-Processing
During post-processing, it became evident that videos were recorded with a variable frame rate.They were therefore resampled with a fixed frame rate of 30 fps for further processing, using Kdenlive (version 23.08.04).Given that audio and video recordings did not start simultaneously, the offset was first manually detected to synchronize the two modalities.The audio recordings underwent visual and auditory inspection for each speaker and condition to localize the best utterances, which were then manually segmented using tier boundaries within the PRAAT software (version 6.4.14) [66].The start and end times of each sentence were extracted automatically to enable cutting.Utterances were deemed optimal if they lacked any disfluencies or slips of the tongue and exhibited predominantly neutral intonation.Additionally, instances of blinking and averted gazes were taken into account during the selection process.
Audio and video cutting was performed with an automated Matlab (version R2022b Mathworks, Natick, MA, USA) script.Using the determined start and end times and the offset, individual sentences for each speaker and condition were extracted from the audio and video files.Original video audio was removed and replaced with the high-quality audio recording of the MD421-II microphone.Individual sentences were exported as single .wavand combined audio-video .mkvfiles.The wave file bit rate was reduced to 16-bit for further processing within the Bavarian Archive for Speech Signals (BAS) Web Service framework [67].The video files were exported using ffmpeg (version 4.4.2) within Matlab, encoded with the H.264 codec.
The high-quality audio recordings of each sentence with their corresponding orthographic transcriptions were uploaded to the BAS Web Services within the Matlab script, using the provided RESTful API to enable automatic segmentation and annotation using the G2P→ MAUS→ PHO2SYL pipeline 2 .Given that the orthographic transcriptions were readily available as .txtfiles and for data protection reasons, we opted for the pipeline without ASR.The output format was specified as 'Praat (TextGrid)' and output encoding as X-SAMPA (ASCII) with all other options set to their default settings.The 'G2P' module converts the orthographic input transcript into a phonemic one, the 'MAUS' service segments the input into words and phones, and the 'PHO2SYL' service generates phonetic syllable segments based on the phonetic segmentation.
The resulting .textgridfile has five annotation layers; see Figure 7.The ORT-MAU layer contains the sentence in its orthographic form, segmented and tokenized into single words.Non-vocalized segments and pauses are denoted as <p:>.The KAN-MAU layer is the phonemic, i.e., canonical transcription of the same chain with the KAS-MAU layer additionally showing syllable boundaries, denoted as a dot.In contrast, the bottom two layers MAU and MAS contain the phonetic transcript, which deviates from standard, canonical transcription and mirrors what the speakers actually said, e.g., fYmf instead of fYnf.The MAU layer contains the individual phoneme segmentation, while the MAS layer contains the syllabified chain.The Munich Automatic Segementation System (MAUS) algorithm relies on Viterbi-alignment, which is the process of aligning speech features to the most probable sequence of states, using a set of continuous Hidden Markov Models (HMMs), which take acoustic probabilities into account [68].Due to this probabilistic framework, some boundaries and annotations may have to be manually adjusted, if the deviations from the norm are not automatically detected.

Linear Mapping
The audio recordings are provided in their raw format and have not been normalized.This maintains the information of absolute and relative level differences between individual speakers and conditions.To reflect actual, absolute level values, which is important for the computation of psychoacoustic measures and to determine the range of the Lombard effect, linear mapping has to be applied to the wave files after reading, corresponding to the actual SPL measured.The average measured SPL for the nomask sil of VP02 was 72.6 dB(A).This value includes pauses between sentences.The measured SPL for single sentences without pauses was 79 dB(A) on average.This latter value can be used as a reference to calculate the mapping factor when reading the provided single-sentence audio files.Given that the microphone gain levels were adjusted for every speaker, the gain factors provided in Table 3 have to be considered as well to accurately represent the resulting SPL.They therefore have to be multiplied with the derived linear mapping factors.The relative level differences between sentences and conditions remain in tact.

Corrections and Errors
The documentation process revealed that the test material used in the recording sessions included the verb 'erzählt' ('told') and the noun 'Fische' ('fish') on two occasions.The verb in sentence s40 has therefore been corrected to 'erwähnt' ('mentioned') and the object in sentence s29 to 'Tische' ('tables') in the provided BELMASK matrix sentences, see Appendix A. The video and audio recordings however contain the original doubles.The phonemic distribution in Figure 2 takes these corrections into account.Slips of the tongue during recordings are summarized in Table 4. Notably, these errors were almost exclusively made during the noisy conditions and consist mostly of misplacements (speakers uttered words they remembered from previous sentences) or arithmetic errors.

File Structure
The following file structure tree demonstrates how the provided files were organized, exemplified for speaker VP01.The complete dataset requires 18 GB storage capacity.

Conclusions
In this article, we present the Berlin Dataset of Lombard and Masked Speech (BEL-MASK), a comprehensive dataset of speech produced in adverse conditions.The dataset contains audiovisual recordings of ten German native speakers uttering matrix sentences with and without an FFP2 face mask in a quiet setting and while being subjected to noise over headphones, triggering the Lombard effect.The article outlines the test material development, as well as the data collection and post-processing.
In contrast to previous datasets, the main advantage of the BELMASK dataset is that it contains recordings of phonetically controlled and standardized test material, which has been optimized in terms of lexical frequency and predictability.To the authors' best knowledge, it is also the first audiovisual dataset of face-masked and Lombard speech for the German language.The speech tasks were neither guided nor read, but instead cued, which allows for fairly natural recordings while maintaining the controlled and high-quality setting of a laboratory environment.Given the multimodal nature of the dataset, it is furthermore possible to explore the role of visual versus auditory cues and potential interactions.Lastly, through the provision of linear mapping and gain factors, both absolute and relative information about level differences has been preserved across various conditions and speakers.These considerations are important when administering cognitive tasks and deriving psychoacoustic metrics.A limitation of the BELMASK dataset is the relatively small sample of speakers, which may restrict its applicability for certain classification tasks that require large datasets.However, the dataset has been fully annotated and several data formats are provided, which allows for its out-of-the-box use.
The dataset aims to facilitate research in the field of auditory cognition by contributing to a deeper understanding of how cognitive processes are affected by adverse speaking conditions.It furthermore enables the training and evaluation of speech processing models under realistic and varied conditions, provides data to enhance the robustness of ASR systems, improve speaker identification and verification accuracy, and refine assistive technologies for the hearing impaired.Additionally, it can be used as a resource to broaden audio-visual-dependent research, e.g., in the field of computer vision and simulation or biometric authentication.By providing this dataset, we aim to extend auditory cognitive research and support the development of more resilient speech-processing technologies that can adapt to the ongoing and future needs of masked communication in diverse settings.

Figure 3 .
Figure 3. Relationship between pseudo log likelihood (PLL) scores of the BELMASK matrix and sentence length (number of tokens), including correlation analysis (Pearson's r = −0.77).Shaded area of regression line corresponds to the 95% confidence interval.Each dot represents a sentence.The red dot represents the highly-predictable reference sentence "The rocket flies into space", not contained in the BELMASK set.

Figure 4 .
Figure 4. Relationship between pseudo log likelihood (PLL) scores of the BELMASK matrix words and DWDS word log frequency, including correlation analysis (Pearson's r = 0.56).Shaded area of regression line corresponds to the 95 % confidence interval.Each dot represents a unique word.

Figure 5 .
Figure 5. Frequency response of the face mask used during recordings with subsequent 1/12 octave band smoothing, measured reciprocally using a 3D-printed head[63].

Figure 6 .
Figure 6.Experimental setup of the recording sessions.Display of keywords on screen in speaker booth not depicted.

Figure 7 .
Figure 7. Example of annotation layers in the Praat TextGrid object as a result of the G2P→ MAUS→ PHO2SYL pipeline.

Table 1 .
Summary of the BELMASK dataset.

Table 4 .
Utterance errors contained in the recorded audio and video files.