Auditory modelling of the perceptual similarity between piano sounds

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers.


Introduction
In the context of acoustics, similarity assessments are used in sound quality evaluations [1] and in the study of specific sound types [2], among other applications.Since in an everyday listening experience, sound objects are unlikely to be repeated exactly in the same way [ 3], the concept of similarity is relevant to label sound objects as being the same or not.
In psychology,t he similarity between objects is normally assessed experimentally.Often, the goal is to relate the physical properties of the test stimuli to the dimensions of an abstract psychological space [3].Apopular method to construct such as pace is the method of triadic comparisons [2,4].In aprevious study,weproposed an alternative method that assesses the similarity between piano sounds by measuring discrimination thresholds in background noise as signal-to-noise ratios (SNRs) [5].The focus of the current study is to model such similarity data in noise using an existing computational framework, which has been successfully used to simulate the results of several experimental tasks in noise [6,7].

The auditory model
The family of models that describes the "effective"p rocessing of the auditory system provides au nified framework to simulate anumber of auditory phenomena such as Received4March 2018, accepted 5September 2018.masking and modulation-detection tasks [6,7].The structure of the adopted auditory model is as described in [7] butwith configuration parameters taken from more recent model versions.The block diagram of the model is shown in Figure 1.The peripheral stages of the model (stages 1-6)d elivert he internal representation of as ound.The central processor is ab ack-end stage that compares two or more internal representations with the aim of deciding whether those representations are distinct enough to be judged as different by an artificial listener.

Peripheral stages
Stage 1. Outer-a nd middle-ear filtering:T his stage is implemented as twoc ascaded filters whose combined response can be roughly approximated as ab andpass filter centred at f c = 800 Hz (slopes of 6d B/octave)( for details see [6]).

Stage 2. Gammatone filter bank:T his stage represents
an approximation to ac ritical-band filter bank.The filter bank consists of 30 bands ranging from 101 to 7324 Hz (i.e., 3.4 to 32.4 ERB N ).The bands are spaced at 1ERB and one of the bands matches the frequencyo f5 54 Hz (11.4 ERB N ).The model uses only the real part of the complex-valued all-pole implementation that is described in [8].
Stages 3a nd 4. Hair-cell transduction:T his stage simulates the transformation from mechanical oscillations in the basilar membrane into receptor potentials in the inner hair cells.The signals are first half-waverectified and then filtered (5 th -order IIR LPF, f cut−off = 770 Hz).This stage is implemented as in [9].Stage 5. Adaptation loops:This stage simulates the adaptive properties of the auditory system.Adaptation refers to changes in the gain of the system when the levelofthe input signal changes.When the input levelc hanges rapidly (relative to fivetime constants between 5and 500 ms), the levelistransformed linearly.For slower variations, the input leveli sc ompressed.This stage is implemented as in [6], butusing astronger overshoot limitation (factor of 5).Stage 6. Modulation filter bank:Inthis stage the incoming signal is analysed according to its envelope changes.Each auditory band is split into am aximum of 12 modulation filters with frequencies between 2.5 and 1000 Hz.This stage is implemented as in [6].

Central processor
In this stage, the information receivedf rom the modulation filter bank is compared with areference representation (template)that is stored in the model.Inspired by the concept of an optimal detector (e.g., part II of [10]), the model can be seen as an artificial listener and the template can be seen as an expected sound representation that givesaclear indication about "what to listen to".The model with the adopted central processor wasevaluated for the piano similarity data and its backward compatibility wasestablished for an umber of psychophysical intensity-discrimination and masking tasks (shown in [11]).
Template:T he use of at emplate assumes that when discriminating asignal among others, some type of awareness about that signal is expected.This corresponds to at opdown process within the auditory model.This approach is widely used in the field of vision where there is evidence of brain activity in response to features of the expected signal (e.g., [12]).In the model the template is derived( learned)b yt he artificial listener in the course of a specifice xperimental paradigm in ac ondition where the sounds can be easily discriminated (low-noise condition).This is in line with other central processors used in the literature (e.g., [6,7]).The adopted template derivation is further described in Section 3.1.1.

Materials and methods
This study simulates the results of al istening experiment which is similar to the task described in [5].The same stimuli were adopted in both studies, butreverberation was added to the sounds used in this study.

Procedure: 3-AFC similarity task
The similarity between sounds wasq uantified using a3 -AFC discrimination task.Within each experimental trial, one of the sounds served as areference and waspresented in twoi ntervals while the other sound wasu sed as the target and wasp resented in one interval.The task of the listener wast oi dentify the target interval.Ab ackground noise wasadded to the piano waveforms to change the difficulty of the discrimination task.The levelo ft he noise wasadjusted using atwo-down one-up rule, which tracks the noise levela tw hich the discrimination score equals 70.7%.The experiment continued until 8r eversals were reached.The starting SNR start wasset to 16 dB.The noise levelw as varied in steps of 4, 2, and 1d B. The discrimination threshold thres sim wasassessed as the median noise levelofthe last 4reversals.The presentation levelofeach interval wasvaried (roved) between ±4dB, drawn from a uniform distribution.

Template in the 3-AFC similarity task
In the 3-AFC similarity task, the discriminability of the target sound depends on howd i ff erent its representation R t is from that of the reference sound R r .For this reason our approach uses twotemplates: T p,t for the target sound, and T p,r for the reference sound.The templates are obtained as average internal representations at ah ighly discriminable condition (atSNR supra )and theyare normalised to unit energy [7].We used an SNR supra of 21 dB (i.e., SNR start + 5dB).
Within atrial, the internal representation of each of the three intervals is cross-correlated with the templates T p,t and T p,r ,leading to twodecision criteria.Before the actual assessment, the corresponding noise representation R N,x in the interval x is subtracted from the piano-plus-noise representation R x : Twosets of three cross-correlation values (CCV)are used in order to determine the target interval.The artificial listener chooses the interval with the highest CCV x,t and the lowest CCV x,r ,tochoose the interval with the highest similarity to T p,t and the lowest similarity to T p,r ,respectively.Acorrect decision is obtained if both criteria point to the same interval.The decision is limited by an additive internal noise.

Piano sounds
Recordings of one note (C# 5 , f 0 = 554 Hz)p layed on sevenViennese pianos were used.Some information about the piano recordings is shown in Table I.The recordings were auralised using the binaural impulse response of a room having an early decay time (EDT mid )of3s.The duration of each sound wass et to 2s ,w ith the note onset occurring at time 0.1 s.The sounds were ramped down using a300-ms linear ramp.The piano sounds were loudness balanced to have am aximum value of about 18 sone, resulting in sounds with maximum levels between 73.1 and 81 dB SPL.With aset of 7sounds, the number of possible piano pairs is 21 (see the abscissa of Figure 3a).

Piano-weighted noises
Individual piano noise:T ogenerate noises with spectrotemporal properties similar to those of the piano sounds, an algorithm based on the ICRA noise algorithm [13] was used.As ac onsequence of as eries of interleavedfi ltering and randomising stages, the algorithm is able to keep the spectro-temporal properties of the input (piano)sounds [5].This modified algorithm produces noises that are similar to applying a3 0-channel noise vocoder to the piano sounds.One realisation of noise N1, generated from piano P1, is shown in Figure 2. The spectrum of N1 is shown as band levels per ERB band.To better visualise the tonal components of P1, its spectrum wasobtained using apeak detection algorithm instead.
Paired piano noise:The individual piano noises were not used directly in the 3-AFC task.Forag iven piano pair, e.g., 13 (piano P1 being compared with P3), one realisation of N1 and one realisation of N3 were combined by averaging the waveforms of the twos elected noise realisations.This operation can be seen as at rade-off in the property weighting of, in this example, pianos P1 andP3.Fore ach piano pair as et of twelver andom ("running") noises were generated.

Reference data
The reference data were collected from 20 participants using the stimuli described in this section, providing 210 thresholds.After applying predetermined exclusion criteria, median thresholds thres exp were obtained from 6to10 data points per pair.T he thres exp values are shown as red triangles in Figure 3a.The procedure only differed from the description in Section 3.1 in that 12 reversals were used for the staircases.

Simulation results
The simulations were run using monaural (left-channel) sounds.Each threshold estimation wasr epeated 6t imes due to the presence of sources of external variability in the 3-AFC task (level roving, running noises)a nd internal variability of the model (internal noise).The median of the 6e stimated thresholds wasu sed to obtain as ingle threshold thres sim per pair.T he initial simulations delivered lowS NR thresholds (thres sim of −2.75 dB or lower, see Table II, column t obs = 2s), meaning that the artificial listener had access to more information than the actual participants.While the model is capable of making point by point comparisons between the reference and target intervals overthe 2-s (whole duration)p iano sounds, it seems unlikely that humans have access to that same amount of synchronised information.We hypothesised that the simplest waytoreduce the available information is by focus-  ing on the onset part of the sounds, which can be seen as an attentional trigger.T oincorporate this idea we introduced areduced observation period t obs which allows a synchronised comparison of the initial information of the sound representations (t ≤ t obs )and anon-useful comparison for information available after that moment (t>t obs ).The test t obs durations ranged from 0.16 to 2s.The simulation results for different observation durations t obs are summarised in Table II.The best agreement between thres sim and thres exp values wasfound for t obs = 0.2s.The simulations at that duration are not only highly correlated with thres exp data butt heya lso reach ac omparable dynamic range (DR) of more than 20 dB (see Table II).The thres sim and thres exp values are correlated with aPearson r p of 0.58 and ar ank-order (Spearman) r s of 0.61, as indicated in the scatter plots of Figure 3b-c.As indicated by the regression line in panel (b),o ne data point (pair 47, with thres exp >16 dB)w as omitted from the analysis.Its inclusion would have led to abiased r p of 0.80.

Discussion and conclusion
In this paper an auditory model wasu sed to simulate the discrimination thresholds between 7p iano sounds.In order to compare twointernal representations, twotemplates were required to allowthe artificial listener to distinguish one piano from the other.T he need of the model to access the representation of the sounds being compared can be interpreted as an approach that resembles arecognition rather than ad iscrimination task.The obtained thres sim values were highly correlated with the thres exp values when only the initial part of the waveforms wasused.The best agreement wasf ound for a t obs of 0.2 s.In this con-text, the success of the simulations might be interpreted in the following way: (1) Using longer t obs durations, the artificial listener has access to more cues than the actual participants.The use of shorter t obs durations is as imple waytoreduce the optimally-integrated information in the central processor stage; (2) Although not shown in this paper,the use of shorter t obs brings the interval-CCV values to arange where the source of internal noise limits the performance of the model [11].It is important to emphasise that the amount of internal noise is not afree parameter of the model butiscalibrated in an independent level JNDt ask.The results presented in this paper support the idea that the unified framework of the auditory model can be used to evaluate perceptual tasks using piano sounds.This can be seen as an extension of the use of the model with acentral processor that is backward compatible.The choice of model parameters wasb ased on existing literature, and we used only one actual free parameter: the duration t obs .A na lternative and more elaborate form of information reduction using an additional source of internal noise is described in [14].Such an approach is compatible with the adopted auditory model and might have replaced the use of t obs .The rationale is, however, the same in both approaches.

Figure 1 .
Figure 1.Block diagram of the auditory model.Each of its stages is briefly explained in the text.

Figure 2 .
Figure 2. (Colour online)(a, b) Waveforms of piano P1 and noise N1 (SNR of 0d B) converted to SPL.The black lines indicate their Hilbert envelope.(c) Spectra of P1 (inb lue)a nd N1 (in black)averaged overtheir first 0.6 s.

Figure 3 .
Figure 3. (Colour online)( a) Simulated thres sim (magenta circles)and experimental thres exp (red triangles)thresholds together with their interquartile ranges for each of the 21 piano pairs.The thres sim values consider an observation period t obs of 0.20 s.The thres exp values range between −4dBand 24 dB (range of 28 dB).(b,c)S catter plots between the thres sim and thres exp values.The results are corrted with r p = 0.58 and rank-order r s = 0.61.

Table I .
List of pianos and information about the recording levels as used in this study.

Table II .
Simulation results for various t obs durations.Minimum (thres min )a nd maximum (thres max )s imulated thresholds are indicated together with their dynamic range (DR = thres max − thres min ).Correlation values with the corresponding experimental data are indicated by r p and r s .