Test-retest evaluation of a notched-noise test using consumer-grade mobile audio equipment

Abstract Objective The aim of this study was to investigate whether consumer-grade mobile audio equipment can be reliably used as a platform for the notched-noise test, including when the test is conducted outside the laboratory. Design Two studies were conducted: Study 1 was a notched-noise masking experiment with three different setups: in a psychoacoustic test booth with a standard laboratory PC; in a psychoacoustic test booth with a mobile device; and in a quiet office room with a mobile device. Study 2 employed the same task as Study 1, but compared circumaural headphones to insert earphones. Study sample Nine and ten young, normal-hearing participants completed studies 1 and 2, respectively. Results The test-retest accuracy of the notched-noise test on the mobile implementation did not differ from that for the laboratory setup. A possible effect of the earphone design was identified in Study 1, which was corroborated by Study 2, where test-retest variability was smallest when comparing results from experiments conducted using identical acoustic transducers. Conclusions Results and test-retest repeatability comparable to standard laboratory settings for the notched-noise test can be obtained with mobile equipment outside the laboratory.


Introduction
Within the field of psychoacoustics, many experimental procedures are very time-consuming and some require training to familiarise the participants with the task and stabilise their performance before the data collection.Personal mobile devices show a large potential for aiding in psychoacoustic experiments, since implementing experiments as standalone mobile applications can allow participants to familiarise themselves with the task (e.g. at home) in advance.Mobile applications can also enable collection of longitudinal data with repeated measures without the need to visit the research site at each timepoint.Thus, personal mobile devices offer an appealing alternative to laboratory experiments (Brungart et al. 2018;Gallun et al. 2018;Swanepoel et al. 2019;van Zyl, Swanepoel, and Myburgh 2018).However, it is not known how well current psychoacoustic methods translate to mobile equipment and acoustically less controlled environments.So far, attention has been mainly directed at automated mobile versions of traditional audiometry, where the calibration of presentation levels and the level of environmental noise play a critical role.Despite suboptimal acoustic conditions and consumer-grade audio equipment at home, direct comparisons between audiometry conducted at a clinic by professional personnel and self-administered audiometry have shown no statistically significant differences in threshold estimates (Derin et al. 2016;Masalski and Kre R cicki 2013;Whitton et al. 2016), although care must be taken when setting up the equipment (Barczik and Serpanos 2018;Corry, Sanders, and Searchfield 2017).Additionally, negative effects of background noise can be addressed to some extent with advanced signal processing approaches, such as active noise cancellation (Bromwich et al. 2008;Clark et al. 2017).
Greater tolerance of background noise can be expected for suprathreshold auditory tests-such as the notched-noise test (Glasberg and Moore 1990;Patterson 1976;Weber 1977)-due to higher signal presentation levels (and thus higher signal-tobackground-noise-ratios).The notched-noise test determines detection thresholds for a pure-tone signal presented simultaneously with a broadband noise masker.The masker has a spectral notch centred at the signal frequency, and varying the spectral distance from the signal frequency to the lower and upper notch edges leads to changes in the detection thresholds.The changes of the thresholds can be explained by the concept of auditory filters, i.e. the frequency selectivity of the auditory system.Specifically, the power-spectrum model of masking predicts that the farther away the masker edges are from the signal, the less masker energy falls within the auditory filter centred at the signal's frequency, thus reducing the signal threshold.Expressing the threshold as a function of notch width results in a curve from which the shape of the underlying auditory filter can be derived (Glasberg and Moore 1990).Degraded frequency selectivity of the auditory system has been found to be related to difficulties in understanding speech in a noisy background, even when the auditory thresholds are normal (Badri, Siegel, and Wright 2011;King and Stephens 1992;Strelcyk and Dau 2009), and therefore information about auditory filter shapes can complement standard audiological measures.
The notched noise test can be time-consuming, and due to the limited time available during a visit to an audiology clinic, additional auditory tests beyond the audiogram are often not feasible (Stone, Glasberg, and Moore 1992).It has been possible to shorten the time needed for conducting a frequency selectivity test (Rens Leeuw & Dreschler, 1994;Schlittenlacher, Turner, and Moore 2020;Sek et al. 2005;Shen, Kern, and Richards 2019;Shen, Sivakumar, and Richards 2014;Shen and Richards 2013), but the cost of including more experimental tests into the clinical assessment may still be too high.Therefore, automated mobile tests, which can be completed independently by the patient at home, can be used for additional audiological testing, providing the clinician with additional information on the patient's hearing abilities.
The main focus of the current work was to determine whether there are any differences in the outcomes of a notchednoise test between a standard laboratory setup and a mobile platform.The same experimental procedure was conducted in a listening booth with laboratory equipment and a mobile phone to investigate possible hardware-related issues.In addition, the mobile device was tested in a regular office setting, in order to identify possible problems related to the various acoustic and non-acoustic effects of less acoustically controlled surroundings.After identifying a possible influence of earphone type on the results, a follow-up study was conducted where circumaural headphones were compared to insert earphones using the standard laboratory setup.

Study design
Study 1 compared the results from a standard laboratory setup to those obtained using a mobile phone.Study 2 compared the results from circumaural headphones to those from insert earphones, both collected using the standard laboratory setup.
All participants provided informed consent and all experiments were approved by the Science-Ethics Committee for the Capital Region of Denmark (reference H-16036391).

Hardware platforms
The two hardware platforms compared in the current study were a standard personal computer equipped for psychoacoustic research (referred to as PC), and an Apple iPhone 8 mobile phone (Phone).
On the PC, stimuli were generated with Matlab, D/A-converted by an external Fireface UCX soundcard, and amplified by a Sound Performance Lab Phonitor mini headphone amplifier.In Study 1, sounds were presented via Sennheiser HD-650 circumaural headphones.In Study 2, condition HD650 used Sennheiser HD-650 headphones, and condition ER2 used Etymotic ER-2 insert earphones.
On the Phone, stimuli were generated with the iOS vDSP library, included in the Accelerate framework, and presented via Apple EarPod earphones connected to the phone's Lightning port.The mode of the AVAudioSession object used for playback of the stimuli was set to measurement in order to avoid any sound processing by the operating system.
On both systems, a 44.1-kHz sampling rate was used.The frequency responses of the two systems were recorded using a B&K head and torso simulator (HATS, type 4128-C) with an artificial pinna and ear canal (DZ-9769) and a 2-cc coupler.The frequency response of each system was compensated by an inverse filter applied to the digital signal prior to the presentation.Spectra of example notched-noise stimuli, presented through the two systems after compensation are shown in Figure 1.
The main interest of Study 1 was to evaluate whether the hardware platform or the environment had any effect on the results of a notched-noise experiment.Therefore, the following three conditions were included in the study: 1) PC in booth, 2) Phone in booth, and 3) Phone in room, where PC and Phone refer to the hardware platforms (Section 2.2), and booth and room refer to a double-walled acoustically treated listening booth, and a quiet office room, respectively.The level of background noise, expressed as A-weighted continuous equivalent sound pressure level L Aeq, 10s was 17.9 dB SPL in booth and 24.7 dB SPL in room.The Phone in room condition was repeated on another day to get an estimate of the test-retest variability for the same equipment and environment, and, prior to the actual experiment, all participants completed one PC in booth training block.Thus, in total all participants completed one training block and four experimental blocks.

Study 2
Ten young, normal-hearing participants (5 male, 5 female) took part in Study 2. With the exception of one participant, none of the participants had taken part in Study 1.
The main goal of Study 2 was to clarify whether some results of Study 1 could have been caused by the differences in the acoustic coupling of the earphones used in the PC and Phone conditions.Two conditions were included in Study 2: 1) PC with circumaural headphones, and 2) PC with insert earphones.We refer to these conditions as HD650 and ER2, respectively.The equipment is described in more detail in Section 2.2.Both conditions were repeated on a second day, in order to get an estimate of the test-retest variability for both earphone types for comparison with the results of Study 1. Prior to the actual experiment, all participants completed one training block with circumaural headphones.Thus, all participants completed one training block and four experimental blocks.

Notched-noise test
Detection thresholds for a 2-kHz tone, presented simultaneously with a broadband noise masker, were determined using a twoalternative forced-choice (2-AFC) paradigm.The masker was a white noise band with a frequency range from 0.1 to 10 kHz, with a symmetric notch around the signal frequency f S ¼ 2 kHz.The masker spectrum had a notch between f S À Df and f S þ Df : Following common practice in notched-noise experiments, the notch width is expressed in terms of the normalised value of frequency with respect to the signal frequency: D ¼ Df fS : In the current study, the total notch width corresponds to 2D, since the notch extends by Df both upwards and downwards in frequency relative to the signal frequency (Glasberg and Moore 1990;Moore and Glasberg 1987;Patterson 1976;Rosen and Baker 1994;Weber 1977).The spectral level of the masker was held constant at 30 dB SPL/Hz.The noise was generated via an inverse Fourier transform, where the frequency components had equal amplitude and uniformly distributed random phase for the frequencies in the two passbands, and zero amplitude for the frequencies within the notch and outside the masker frequency range.
The stimuli consisted of either the masker alone (non-signal interval) or the masker and the signal simultaneously (signal interval), in random order.The task of the 2-AFC trial was to indicate the signal interval.Visual feedback (correct/incorrect) was provided after each trial.The length of the stimuli was 300 ms, and the inter-stimulus interval was 400 ms.Sounds were presented monaurally to the left ear for most participants.For two participants in Study 2 the right ear was chosen instead, due to the presence of earwax or discomfort caused by the insert earphone in the left ear.

Grid tracking method
Traditionally, detection thresholds are determined based on individual experimental runs for each masker notch width of interest, using for example a transformed up-down method (Levitt 1971) with a fixed notch width, and varying only the target level.However, in this approach, each experimental run usually starts with the signal level well above the detection threshold.When this procedure is repeated for many notch widths, a considerable proportion of trials are conducted using signal levels far away from the threshold.
To shorten the time needed for estimating detection thresholds at multiple notch widths, the grid method of Fereczkowski (2015) was used.This aims to increase the proportion of trials with signal levels close to the threshold.The grid method estimates the full threshold curve during one experimental run by alternating between adjusting the notch width g and the signal level L sig (Figure 2).When L sig is adjusted, D is kept fixed until the detection threshold L th sig ðDÞ has been determined.Then, L sig is fixed at the threshold value and D is adjusted until the detection threshold for D at that L sig is found.This process is continued until a predefined maximum notch width or minimum presentation level is reached.
Just as in a transformed up-down method, different tracking rules can be implemented, such as 3-down-1-right ("right" referring to increasing notch width D, see Figure 2), which corresponds to the traditional 3-down-1-up rule, and the choice of parameters determines the point along the psychometric function (e.g. the 79.4% detection threshold for a 3-down-1-up track).
In the current study, the threshold at zero notch width was first determined with a 3-down-1-up staircase procedure, with four reversals using a 6-dB step size, followed by six reversals using a 3-dB step size.The threshold at zero notch width was   Rosen and Baker (1994) and Weber (1977), respectively.calculated as the average signal level at the last six reversals.This threshold is referred to as the tone-in-noise-threshold.Then, the run continued with a 3-down-1-right grid procedure from the sixth reversal of the 3-down-1-up track.The run was terminated when either the maximum notch width of 0.5 or the minimum signal level of 30 dB SPL was reached.The minimum level was chosen to further shorten the time needed for testing, allowing more conditions to be investigated.
A single experimental block consisted of three repeated runs.A two-parameter rounded-exponential roex(p, r) filter model (Patterson et al. 1982) was fitted to the thresholds obtained in each run, and the resulting p and r parameters were averaged over each experimental block.The roex(p, r) filter shape is of the form: where g is the deviation from filter centre frequency divided by the centre frequency.

Study 1
Figure 3 shows the threshold curves for each participant.By visual inspection, it is clear that the curves are very similar in all four cases.For comparison, each panel shows the data from Rosen and Baker (1994) and Weber (1977), for the same masker type, i.e. symmetric notch and 30 dB SPL/Hz spectral level.Up to notch widths of about 0.2 the slopes of the curves collected here seem similar to those from the literature, and in particular from Rosen and Baker (1994), as will be shown later.
The thresholds obtained here for D ¼ 0 are slightly below those from previous studies, and at larger notch widths the curves from the current study appear to flatten out due to the fact that in the current study the lower limit for the signal level was 30 dB SPL.While the reason for the differences in threshold curve shapes between the current study and earlier results is not clear, it is possible that feedback given during the experiments played a role.Whereas Rosen and Baker (1994) did not report whether feedback was given, in our study feedback was given after each answer.This can have the effect of lowering detection thresholds at D ¼ 0, as found by Lukaszewski and Elliott (1962), who reported that auditory thresholds were on average 3.4 dB lower when feedback was given, compared to no feedback.Looking at the tone-in-noise thresholds in the current study, the difference to thresholds in Rosen and Baker (1994) is between À2.9 dB and À4.3 dB, for Phone in room retest and Phone in room conditions respectively, consistent with the finding of Lukaszewski and Elliot (1962).In a later study by Baker and Rosen (2006) with identical methodology, it was explicitly stated The horizontal line shows the average difference in thresholds between two conditions, and the shaded area illustrates the 95% limits of agreement (mean ±1.96 SD) between the two conditions.The vertical grey dashed lines show the tone-in-noise-thresholds from Rosen and Baker (1994) and Weber (1977).
that feedback was used, but the masked thresholds at 2 kHz were very similar to the thresholds from the earlier study (Rosen and Baker 1994).Thus, it remains unclear whether providing feedback could explain the differences between the current study and earlier results.It should also be noted that both Weber (1977) and Rosen and Baker (1994) only had three participants in their studies.In the current study, a total of 18 participants took part in the listening experiments and in each condition the individual tone-in-noise thresholds were spread across approximately a 6-9 dB range.Expressed as standard deviations, the variability of threshold estimates for D ¼ 0 was between 1.9 dB (PC in booth) and 2.0 dB (Phone in room retest).This is in line with Moore (1987), who reported a standard deviation of 2.0 dB for a notch width of 0.0.Given that the individual thresholds span such a wide range, it is possible that the differences between our study and those of Weber (1977) and Rosen and Baker (1994) reflect individual differences.
To investigate the differences between platforms and environments more systematically, the four experimental cases were compared in a pairwise manner with a Bland-Altman plot (Bland and Altman 1986), which is a method for assessing the agreement between two measures.Figure 4 shows the pairwise Bland-Altman plots for the tone-in-noise-thresholds.The largest mean difference between two conditions was 1.4 dB between the Phone in room and the Phone in room retest.The 95% limits of agreement (±1.96Â standard deviation) indicate the estimated range within which 95% of the individual test-retest differences are expected to lie.The widest limits of agreement (±4.6 dB) were between the Phone in booth and the Phone in room conditions.
There are no clear patterns in the data, and in fact the testretest accuracy is within expected limits for all conditions, as was verified by a Monte-Carlo simulation (1000 rounds) of the tonein-noise-threshold determination.Simulations were run using the same experimental parameters as in the current study and reference values for the estimators from Schlauch and Rose (1990).Assuming the threshold estimate to be a normallydistributed variable X $ Nðl, r 2 Þ, the simulated estimate of the standard deviation of a single threshold estimate was found to be 2 dB.Using this estimate, the distribution of the Figure 5. Pairwise Bland-Altman plots visualising test-retest repeatability for the p parameter of the roex filter in Study 1.The horizontal line shows the average difference in p, and the shaded area illustrates the 95% limits of agreement (mean ± 1.96 SD) between the two conditions.
tone-in-noise-thresholds was calculated as the average over three repeated runs: and so, the variance of the averaged tone-in-noise-threshold is expected to be r For two conditions a and b, the difference in thresholds is expected to be also a normally distributed variable: 3 dB 2 : If a sample (n ¼ 9, the number of participants in Study 1) is drawn from this distribution, the 95% confidence interval for the limits of agreement would be 2.2 À 6.1 dB.This follows from the fact that the limits of agreement correspond to 1:96 Á r, and the confidence interval for r can be derived from the v 2 distribution.Thus, it is expected that with the current experimental design, the observed spread is not limited by the equipment or the environment, as in all cases the limits of agreement are smaller than those suggested by the simulations; the largest interval for the 95% limits of agreement for tone-in-noise threshold test-retest variability is for Phone in booth vs Phone in room, where the limits spread ±4.6 dB from the average.This is within the aforementioned confidence interval for the estimate.
The obtained thresholds were then used to fit a roex(p, r) auditory filter model (see Table 1 for model parameter mean values and Figure 5 for parameter p test-retest repeatability).Here, the results were more mixed.The absolute values of p were slightly lower than those predicted at 2 kHz by the equations in Glasberg and Moore (1990; p l ¼ 32.1, p u ¼ 33.6), but similar to the values from Rosen and Baker (1994; p l ¼ 25.5, p u ¼ 26.7).The p values are also close to those reported at 2 kHz by Moore (1987;p ¼ 26.3), although they used a masker spectrum level of 45 dB SPL, where the higher level is likely to result in wider auditory filters, and lower p values.The k parameter values are lower than in Rosen and Baker (1994; k ¼ À2.9 dB) and Moore (1987;k ¼ À0.7 dB), which also reflects the lower thresholds at D ¼ 0.
Overall, the test-retest repeatability of the p parameter is relatively good across all conditions.Shen, Kern, and Richards (2019) assessed the reliability of a Bayesian adaptive procedure for estimating roex model parameters, and found the 95% limits of agreement for p to be À0.4 ± 18.7, which is slightly higher than PC in booth vs Phone in room retest (À2.5 ± 14.8) in the current study.However, the phone conditions show even better agreement with each other than with the PC in booth.The source of this discrepancy between PC and phone conditions is not clear from the data in Study 1.If the poorer agreement was due to differences in, for example, the frequency responses of the two systems, the test-retest differences should show a systematic error.Since the average difference is close to zero, it appears that the differences are driven by individual variability.One factor that could play a role is the difference in headphones.For the PC, the headphones were circumaural, whereas the EarPods are in-ear earbuds.Thus, although both systems were calibrated with the same HATS setup, it is plausible that individual differences in the coupling between headphones/earbud and the ear affected the spectral content and thereby the shape of the threshold curve.For example, changes in the ear canal resonance caused by the partial insertion of the EarPod might explain the observed differences.This hypothesis was tested in Study 2.

Study 2
Figure 6 shows the individual threshold curves for the two repeated conditions investigated in Study 2. The results were very similar to those of Study 1 and there were again no dramatic differences between conditions.Just as in Study 1, the thresholds for D ¼ 0 were lower than those found by Rosen and Baker (1994); the offset at zero notch width was À2.6 dB in the ER2 retest and À3.1 dB in the HD650 retest.
The Bland-Altman plots for the tone-in-noise thresholds in Figure 7 show that there were no systematic differences between different types of headphones, and that the individual variability was within the expected range for the experimental paradigm (see Section 3.1).
The roex model fits gave similar results to those found in Study 1 (Table 2); p values were similar to those from Rosen and Baker (1994) and lower than those from Glasberg and Moore (1990), and k values were slightly lower than in Rosen and Baker (1994) apart from ER2, where k ¼ À2.8 was very close to Rosen and Baker's value of k ¼ À2.9. Figure 8 shows the Bland-Altman plots for the roex p parameter.Comparing the results to those in Figure 5, the same patterns can be seen here: the test-retest variability is smallest when using insert earphones (ER2 vs ER2 retest; Phone in room vs Phone in room retest in Study 1), and greatest when comparing results from two different types of headphones (e.g.HD650 retest vs ER2 retest; PC in booth vs Phone in booth in Study 1).The within-participant test-retest variability indicates that it is indeed an individual effect that cancels out at a group level, since there is no systematic bias in the p parameter values.These findings provide some support for the hypothesis that the acoustic characteristics of the coupling between headphone/earbud and the ear can have an effect on the  Rosen and Baker (1994) and Weber (1977), respectively.thresholds.However, the test-retest repeatability is also very good for the circumaural headphones (HD650 vs HD650 retest), so it does not seem to be the case that insert earphones would provide considerably more accurate results than circumaural headphones, but that when comparing notched noise test results performed at different timepoints, it is important to keep experimental equipment as similar as possible.
It is remarkable that the consumer-grade earphones used with the phone gave similar test-retest repeatability to the researchgrade insert earphones.Of course, this does not indicate that the electroacoustic characteristics are similar, only that the characteristics as well as earphone placement/coupling are stable across multiple sessions and over days.This was an unexpected finding, since it was assumed that results obtained with the phone would show larger variability due to imprecise positioning of the earphone.Based on the results of Study 1, it seems that participants were able to position the earphone in a consistent manner, despite receiving no specific instructions about earphone placement.They were however asked to take off and reinsert the earphone after each experimental run; this was not done for the HD-650s or the ER2s.The stability of results obtained using consumer-grade earphones in a self-administered experiment is a promising finding for future at-home studies.

Conclusions
Conducting the notched-noise test on a mobile phone and outside a sound-insulated listening booth did not influence the detection thresholds in the notched-noise test.Tone-in-noisethreshold estimation was not affected by differences in equipment or surroundings, as there were no differences in test-retest results between conditions.For the estimates of auditory frequency resolution, the larger spread (but no systematic bias) of test-retest differences between PC and phone was probably caused by different earphone designs, which was confirmed by a follow-up study comparing circumaural headphones and insert earphones.
The current results demonstrate that the notched-noise test can be performed with a mobile device and consumer-grade .Pairwise Bland-Altman plots visualising test-retest repeatability for tone-in-noise-thresholds (threshold at zero notch) between the two conditions in Study 2. The horizontal line shows the average difference in thresholds between two conditions, and the shaded area illustrates the 95% limits of agreement (mean ±1.96 SD) between the two conditions.The vertical grey dashed lines show the tone-in-noise-thresholds from Rosen and Baker (1994) and Weber (1977).earphones, and in acoustically less controlled environments, without introducing bias or loss of test-retest accuracy in the results.Thus, there is potential for a mobile version of the notched-noise test, which can be self-administered, for example, as part of a field study.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure1.Two spectra of a noise band (0.01-10 kHz) with a spectral notch of ±0.2 kHz around 2 kHz, presented via the two different playback systems after compensating for differences in frequency responses by inverse filtering.

Figure 2 .
Figure2.Illustration of one run of the grid method.The track starts with a moderately high signal level at zero notch width.Level is decreased until the signal detection threshold (L th sig ðDÞ, shown as a grey line) is reached, after which the notch is increased until there is again a correct response.

Figure 3 .
Figure 3. Detection thresholds as a function of notch width for Study 1. Thin lines show the individual threshold curves.The round and triangular symbols are the same in all figures and show data fromRosen and Baker (1994) andWeber (1977), respectively.

Figure 4 .
Figure 4. Pairwise Bland-Altman plots visualising test-retest repeatability for tone-in-noise-thresholds (threshold at zero notch) between two conditions in Study 1.The horizontal line shows the average difference in thresholds between two conditions, and the shaded area illustrates the 95% limits of agreement (mean ±1.96 SD) between the two conditions.The vertical grey dashed lines show the tone-in-noise-thresholds fromRosen and Baker (1994) andWeber (1977).

Figure 6 .
Figure 6.Detection thresholds as a function of notch width for Study 2. The thin lines show the individual threshold curves.The round and triangular symbols are the same in all figures and show data fromRosen and Baker (1994) andWeber (1977), respectively.

Figure 7
Figure7.Pairwise Bland-Altman plots visualising test-retest repeatability for tone-in-noise-thresholds (threshold at zero notch) between the two conditions in Study 2. The horizontal line shows the average difference in thresholds between two conditions, and the shaded area illustrates the 95% limits of agreement (mean ±1.96 SD) between the two conditions.The vertical grey dashed lines show the tone-in-noise-thresholds fromRosen and Baker (1994) andWeber (1977).

Table 1 .
The mean (standard deviation) of the p parameter and the corresponding equivalent rectangular bandwidth (erb) of the roex filter for each condition of Study 1.

Table 2 .
The mean (standard deviation) of the p parameter and the corresponding equivalent rectangular bandwidth (erb) of the roex filter for each condition of Study 2.