Analyzing time-varying spectral characteristics of speech with function-on-scalar regression

The acoustic characteristics of noise from fricatives and stop releases are dif ﬁ cult to analyze. The spectral characteristics of such noise are multi-dimensional, and popular methods for analyzing them typically rely on reducing this complex information to one or a few discrete numbers, such as spectral moments or coef ﬁ cients of discrete cosine transformations. In this paper, I propose using function-on-scalar regression models as a method for analyzing and mass-comparing spectra with minimal reduction of the complexity in the signal. The method is further useful for analyzing how spectra change as a function of time. The usefulness of this method is demonstrated with a corpus analysis of Danish aspirated stop releases, using the DanPASS corpus. The analysis ﬁ nds that /t/ releases are invariably affricated; /k/ releases are highly affected by coarticulatory context; and /p/ releases are almost always dominated by aspiration in the latter half of the release, but are affricated in the ﬁ rst half in certain contexts. (cid:1) 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http:// creativecommons.org/licenses/by/4.0/).


Introduction
When analyzing the dynamics of spectral characteristics, researchers usually resort to using a small number of discrete measurements aimed at capturing as much of the relevant spectral information as possible.For vowels and sonorant consonants, an example would be formants; for obstruent consonants, examples would be spectral moments and coefficients of discrete cosine transformations.The goal of this paper is to demonstrate function-on-scalar regression (FOSR; Reiss et al., 2010;Greven & Scheipl, 2017a;Bauer et al., 2018) as a method for taking the entire spectrum into account when analyzing sources of variance in the acoustic signal.Rather than relying on discrete measurements, FOSR allows for the use of complete spectra as dependent variables.FOSR gives a clear and easily interpretable overview of the influence of various factors on time-varying spectral characteristics, and does so with minimal reduction of the information in the acoustic signal.Other recent studies have compared full (temporally static) spectra in order to illuminate differences between palatalized and non-palatalized consonants using smoothing spline ANOVA (Iskarous & Kavitskaya, 2018) and generalized additive models (Nance & Kirkham, 2020), and functional regression models have been used in the analysis of phonetic data previously (e.g.Pouplier et al., 2014Pouplier et al., , 2017;;Cederbaum et al., 2016;Carignan et al., 2020;Volkmann et al., 2021).However, to the extent of my knowledge, this is the first study to use FOSR to analyze characteristics of speech spectra. 1s a case study, I focus on the spectral characteristics of Danish stop releases, how they change over time, and how they are affected by various contextual variables.It is wellestablished that the aspirated alveolar stop /t/ in Standard Danish is usually strongly affricated.This was pointed out early on by Otto Jespersen (1899: 335), who noted that foreigners would often perceive Danish /t/ as an affricate [ts]particularly before high front vowels.He maintained that /t/ was still an aspirated stop, but assumed that Danish was undergoing a sound change whereby all aspirated stops would eventually become affricates, as had happened in some varieties of German a millennium earlier with the Second Consonant Shift.He assumed that /t/ was most advanced in this sound change, followed by /k/, and finally /p/.This paper uses function-on-scalar regression models to explore the outcome of Jespersen's prediction more than a century later.This is not straightforward: the boundary between an aspirated stop and an affricated one is fuzzy, as is the boundary between an affricated stop and a proper affricate.I approach this question by looking holistically and dynamically at time-varying spectral characteristics throughout stop releases, and how they vary, in a corpus of spontaneous spoken Danish.
The results of the case study show that /t/ releases are invariably affricated, but that aspiration also plays an important role in /t/ releases.Affrication is also found in /p k/ releases, but only plays an important role in certain phonetic contexts, which vary by place of articulation.The state of affairs does not seem to have changed much since Otto Jespersen's time: /t/ is still heavily affricated, and affrication remains less prominent in /p k/.
In the following subsection, I introduce a general problem in acoustic phonetic research: there are multiple competing methods for measuring frication noise, which all rely on boiling down the information available in the spectrum to a small number of variables.Subsequently, I argue that smoothing-based approaches to dynamic data analysis, and FOSR in particular, may be solutions to this problem.In Section 2, I introduce the case, viz. the releases of aspirated Danish stops and their position on the blurry aspirated-affricated continuum, as well as the data used to explore this case.In Section 3, I describe three statistical models, one for each of /p t k/, and the results of these models are presented and interpreted.In Section 4, I discuss the advantages and challenges of using FOSR for analyzing spectral variance, and in Chapter 5, a brief conclusion is given.

Measuring frication
It has long been established that frication at different places of articulation (whether in fricatives, stop releases, or otherwise), has distinct spectral properties (see Kopp & Green, 1946).A classic method for differentiating places of articulation in frication is locating peaks and valleys in spectral energy distribution, essentially by 'eyeballing' spectrograms (e.g.Hughes & Halle, 1956;Strevens, 1960).Forrest et al. (1988) popularized treating the complex spectrum as a probability mass function, and analyzing it by calculating four moments: 1) the 'mean frequency', or center of gravity (COG); 2) standard deviation (SD), 3) skewness, and 4) kurtosis.COG reflects the mean distribution of energy across the spectrum; SD reflects how much the energy deviates from the mean; skewness reflects how much the energy distribution is skewed relative to the mean, and in which direction; kurtosis reflects the peakedness of the energy distribution.Forrest et al. found that spectral moments distinguished fairly well between places of articulation in stop bursts, and that particularly COG, skewness, and kurtosis distinguished fairly well between places of articulation in alveolar and postalveolar fricatives; Stoel-Gammon et al. (1994) on the other hand found that SD is particularly stable in determining the difference between dental and alveolar stop bursts.Subsequent studies testing this have not been particularly stable (see e.g.Shadle & Mair, 1996), but COG in particular has remained a very popular measure in the analysis of spectral properties of fricatives often without taking into account higher moments.This is problematic, since spectra often correspond to func-tions that are far from normally distributed.The mean value from a non-normal distribution does not give a very full picture of the shape of the distribution, and spectra with quite different shapes may have the same COG.As an example, Gordon et al. (2002: 152ff.)find nearly identical mean COG in Western Aleut [x v], but the spectral shapes are otherwise quite distinct.
A number of other candidate measures have been proposed in the literature for analyzing frication, mainly for determining the precise place of articulation of fricatives.Jongman et al. (2000) find that the different places of articulation in English fricatives are distinguished fairly well using the average location of the spectral peak.Koenig et al. (2013) show that the mid-frequency spectral peakthe frequency with the highest amplitude within a 3000-7000 Hz bandcaptures the fairly subtle difference between labialized and non-labialized /s/ in adolescents.
Another proposed method is using cepstral coefficients derived from a discrete cosine transform of the spectrum (DCT; Watson & Harrington, 1999).DCT reduces the high dimensionality of the spectrum to (typically) four discrete values, corresponding to the amplitude of half-cycle cosine waves derived from the spectrum.DCT0 reflects the mean amplitude of the spectrum; DCT1 reflects the linear slope; DCT2 reflects the curvature; and DCT3 reflects the amplitude at higher frequencies.In a comparison of /ʃ ç/ in different varieties of German, Jannedy and Weirich (2017) show that DCT-based classification more closely approximates the perception of these sounds than classification based on spectral moments, and DCT coefficients have been shown to yield more correct classifications of place of articulation than spectral moments in both voiceless stops (Bunnell et al., 2004) and fricatives (Spinu & Lilley, 2016).While DCT coefficients give a fuller picture of spectral shape than spectral moments, they are also more difficult to interpret.
Measurements such as the ones discussed above have often been taken at fixed points in time, such as the midpoint or some pre-determined range around the midpoint of fricatives or affricates; Mücke et al. (2014) refer to these points in time as 'magic moments'.Magic moments give us a limited picture of the acoustic nature of these sounds; affricates are inherently dynamic, and Reidy (2016a) shows even sibilant coronal fricatives vary dynamically throughout their time course in language-specific ways.Spectral properties of stop releases are usually measured only at the burst, which in aspirated stops is a relatively small initial portion of the release (see e.g.Chodroff & Wilson, 2014).
Summing up, most existing approaches to quantifying frication reduce the complex time-varying information from spectra to something more manageable.This is very reasonable, because 1) many statistical methods frequently used in linguistics cannot necessarily handle variables with high dimensionality, and 2) it is a goal in itself to propose the simplest possible model of how language works with the highest possible explanatory value.With regards to 1), statistical models which can take into account complex dynamic information are increasingly being used, as discussed below, and this paper demonstrates how FOSR can be used to model time-varying spectral information with little reduction of dimensionality.With regards to 2), deciding on a model of language which balances simplicity and explanatory value can simply not be done if we have not tested complex models.Studies mentioned above have shown how some patterns can only be uncovered by increasing dimensionality.For example, Reidy (2016a) shows that the language-specific nature of spectral dynamics in fricatives only becomes apparent when measuring spectral properties at several timepoints, and Jannedy and Weirich (2017) show that the spectral differences between [ç ʃ] in German (which are currently undergoing a merger) are more readily apparent when using a measure which takes more of the spectrum into account (i.e., using DCT coefficients rather than moments).
All the above-mentioned measures are derived from the spectrum rather than directly from the waveform, so it follows that the method used for spectral estimation may have an influence on the results.The most common method is fast Fourier transformation (FFT).FFT is very efficient, but in some ways not particularly suitable for acoustic data.The Fourier basis is periodic, making FFT highly suitable for periodic data, such as voiced portions of speech, but less suitable for noisy, randomly generated data, such as voiceless portions of speech (Ramsay & Silverman, 2005: ch. 3.4).For noisy data, it is theoretically preferable to use a lower variance spectral estimate such as that provided by multitaper spectral estimation (Blacklock, 2004).Note however that Reidy (2015) showed that the spectral moments derived from FFT spectra and multitaper spectra may be practically equivalent.

Smoothing approaches to the analysis of dynamic data
In the past years, following Baayen's (2008) popularization of mixed-effects regression models in linguistics, there has been a rapid increase in the use of sophisticated statistical techniques in linguistics.A general problem in the field has been the analysis of dynamically varying data, in particular data from time series.If some measuresay, COGvaries as a function of time, then a linear model by necessity assumes that the variation follows a straight line.As Sóskuthy (2017) demonstrates for formants, this is a poor assumption; variation as a function of time is often nonlinear.A solution to this problem is the use of smoothed curves.Given a number of data points associated with e.g. a time series, a smoothing function (see de Boor, 2001;Wood et al., 2016) can be used to approximate a continuous curve corresponding to the data's non-linear variation as a function of time.Smoothing involves reducing the observations to a number of basis functions (or 'knots'), and using a penalizing smoothing parameter to determine the smoothness or wiggliness of the resulting curve (see Gubian et al., 2015).Combining too many basis functions with a low smoothing penalty will lead to overfitting, resulting in curves that include irrelevant information in the signal; conversely, combining too few basis functions with a high smoothing penalty will likely lead to underfitting, resulting in curves that omit relevant information in the signal.
Functional data analysis (FDA; Ramsay & Silverman, 2005;Ramsay et al., 2009;Gubian et al., 2015;Pouplier et al., 2017) has overall had less influence on statistical modeling in linguistics.FDA is a family of statistical methods which extend existing methods to account for functional data.In practice, this means that curves can be used as input variables rather than discrete values.An example is the functional extension of principal components analysis (FPCA), which can be used to determine the primary sources of variation in curves.Gubian et al. (2015) use FPCA to jointly analyze how F1 and F2 pattern in the realization of diphthongs and hiatuses in Spanish, respectively, and Puggaard-Rode (forthc.: ch. 6) combines GAMMs and FPCA to analyze how speech spectra of stop release midpoints vary geographically in traditional regional varieties of Danish.

Function-on-scalar regression
Functional regression models are suitable when one or more variables are of a functional nature (Bauer et al., 2018).If an independent variable is functional and the response variable is constant over the functional domain, this can be modeled with scalar-on-function regression.This could e.g.be useful for researchers seeking to predict reaction times from pitch contours (e.g.Cutler, 1976); pitch contours consist of complex functional data, which will otherwise have to be either simplified or tightly controlled in the experimental set-up.If the response variable is functional and all independent variables are constant over the functional domain, this is suitably modeled with function-on-scalar regression.This, in contrast, could be useful for researchers seeking to predict the shape of pitch contours from a range of predictor variables, such as pragmatic context or duration.It is likewise useful for modeling how the spectrum of a speech sound is affected by e.g.contextual variables, as I will show below.
ɡ(Á) is a pre-specified link function mapping the predictor to the functional domain; in the case of a Gaussian model, this is simply the identity.The expected value E(Á) of each observation i = 1, . .., n of the response variable Y as a function of t conditional on a set of covariates v and residual functional error E(t) corresponds to the sum of 1) the functional intercept b 0 (t), 2) R covariate effects f r (Á), each of which form a subset v r of the full covariate set and may vary over the functional domain t, and 3) residual functional error E(t).
Functional regression models and GAMMs are conceptually very similar.GAMMs are often fitted using the R package mgcv (Wood, 2017a(Wood, , 2021)), which allows for significant flexibility in the selection of spline bases (Wood, 2017a: ch. 5), residual error distributions (Wood et al., 2016), and smoothing parameter estimation methods (Wood, 2011;Wood et al., 2015), as well as handling of autocorrelated residuals (Baayen et al., 2018), and which can handle very large data sets (Wood et al., 2017).Wood (2017a: 290ff.)gives a number of examples of how functional regression models can be implemented in mgcv.Perhaps for this reason, the framework for functional regression modeling I adhere to here is sometimes referred to as (generalized) functional additive mixed modeling (Scheipl et al., 2015(Scheipl et al., , 2016)).
Functional additive regression models are implemented in the pffr function of the R package refund (Goldsmith et al., 2021).This function uses the mgcv computation engine, and inherits the same flexibility as GAMMs fitted with mgcv.The syntax is also similar to mgcv, except there are several more term constructors for including various kinds of variables; most of these are not discussed here.An advantage of using refund rather than mgcv to fit FOSR models is that dependent and independent functional variables can be explicitly included in model formulas, allowing the user to conceptualize a problem as FOSR.In a model of spectral variance formalized in mgcv, the response variable would have to be an amplitude measure, whereas in a model formalized in refund, the response variable can be the spectral shape, which is conceptually more satisfactory.Note that this has no influence on how the models are fitted 'under the hood'; from a computational perspective, they are the same (e.g.Morris, 2017).In Appendix A, I show how a simplified version of the models presented in Section 4.1 are fitted with pffr, and compare this to how that same model would be fitted with the bam function in mgcv.Furthermore, the model fitting and selection procedure is fully documented in annotated form in Puggaard-Rode (2022).
Functional regression models are usually high-dimensional and the number of underlying data points is often very high.This can make traditional significance tests unreliable, as these are highly affected by sample size (see e.g.Kühberger et al., 2015 and references therein).Wood (2013) proposes an F-test for calculating significance of nonlinear variables in GAMMs, and the results of this test are also reported in the output of pffr; however, researchers should exercise caution in interpreting the resulting p-values, as even tiny effects will appear highly significant if the sample size is sufficiently large.This is also the case for likelihood ratio tests of nested models.For this reason, I do not report p-values in this chapter.This issue is not specific to FOSR models, but holds for essentially all frequentist models with very high sample sizes.
In any case, p-values and associated measures of nonlinear effects can only tell us if there is an effect, they cannot tell us much about the nature of that effect.A more suitable way to explore non-linear effects in exploratory studies such as this one is to visualize them.If the goal is hypothesis testing, Bauer et al. (2018) propose several different solutions.Marra and Wood (2012) propose a method for calculating confidence intervals of non-linear effects; this method can be used to quantify and visualize the uncertainty associated with nonlinear fitted effects along the functional grid.Bauer et al. (2018) propose a more precise bootstrap-based method for calculating confidence intervals, but this precision comes at a significant computational cost.

The case: Stop releases in Danish
It is not a goal of this paper to determine whether /p t k/ are phonetic affricates in Danish.The boundary between an aspirated stop and an affricated one is fuzzy, as is the boundary between an affricated stop and a proper affricate.In the end, a decision can only be made with targeted articulatory studies comparing Danish with other languages with clear-cut stop-affricate distinctions.This is rather an exploratory study aimed at better understanding the distribution of spectral properties in Danish stop releases.I focus on the following broad questions, which are more readily answerable: How do the spectral characteristics of Danish stop releases vary across time?How are their time-varying characteristics affected by different phonetic contexts?An example could be coarticulation effects following from features of the following vowel, like backness, height, and rounding, all of which affect the size and shape of the vocal tract.
In the section below, I discuss the aspirate-affricate continuum from a theoretical perspective.Subsequently, I discuss the Danish stops /p t k/ and their position on this continuum.Finally, I introduce the corpus used for the study (DanPASS; Grønnum, 2009), and introduce the acoustic analysis of the data.

Aspirated stops, affricated stops, and affricates
The production of both stop consonants and affricates has been modeled thoroughly in the work of Fant (1960) and Stevens (e.g. 1993aStevens (e.g. , 1993bStevens (e.g. , 1998: chs. 7-8): chs. 7-8).A shared component of both types of sound is a complete occlusion somewhere in the oral cavity, which allows intraoral air pressure to build up.Another shared component is a release phase, in which this pressure is released, resulting in a rapid sequence of acoustic events, including an initial brief transient followed by frication.The transient shows a fairly even distribution of noise throughout the spectrum.Frication noise is subsequently generated at or near the point of occlusion; due to the high pressure behind the constriction and the narrow gap in the oral cavity, the escaping air becomes turbulent and excites the area around the constriction.The nature of this noise gradually changes as the approximation gradually widens.In aspirated stops, air will continue to escape through the open glottis for some time after the release, and turbulence noise generated at the area around the vocal folds continually excites the vocal tract.The different stages of an aspirated stop release are shown on a spectrogram in Fig. 1; visualizations of stops in this section are all from the corpus used for the case study (see Section 2.3.1).
The energy distribution of the turbulent frication noise depends on the nature of the obstruction (Shadle, 1991).In labials, since there is no cavity in front of the obstruction, the frication noise is generated directly at the lips, causing a fairly even distribution of noise throughout the spectrum, with a slight linear drop in amplitude at increasing frequencies.In alveolars, the turbulent air stream impinges on the teeth immediately in front of the constriction, meaning there is only a very small cavity anterior to the constriction, causing high resonance frequencies around 5000 Hz to be excited.In velars, the turbulent air stream impinges on the hard palate at an oblique angle, before being filtered through a sizeable front cavity, causing relatively low resonance frequencies somewhat below 2000 Hz; note, however, that the exact point of occlusion in velars is variable and depends on surrounding vowel(s), since the tongue body is less precisely controlled than the tip and blade (Ouni, 2014), and the tongue body is itself more directly involved in the production of vowels than the tip and blade.A more fronted obstruction will cause the air stream to more directly impinge on the hard palate, causing higher resonance frequencies.Examples of aspirated bilabial, alveolar, and velar stops are shown in Fig. 2; /k/ is shown before a high front vowel and a low back vowel to visualize the clearly variable spectral characteristics in these environments.
During aspiration, low-frequency noise is generated as the airstream passing through the glottis impinges on the vocal folds, epiglottis, and surfaces directly above the glottis; this turbulence noise further excites the natural resonances of the oral cavity, which (as in vowels) largely depend on e.g. the position of the tongue.The aspiration noise is present throughout the release, but is initially dominated by frication.As the obstruction above the glottis opens, aspiration noise will gradually overtake frication noise in prominence (Hanson & Stevens, 2003).
In voiceless unaspirated stops, the frication phase is very brief, but it is an important cue to place of articulation.There are two primary place cues in stops: the spectral characteristics of the initial frication phase (e.g.Stevens, 1971;Stevens & Blumstein, 1978;Blumstein & Stevens, 1979, 1980), and the transitions of formants as the articulators move from occlusion to vowel (Kewley-Port, 1982, 1983;Kewley-Port et al., 1983;Stevens et al., 1999).In aspirated stops, formant transitions are relatively weak, because movement of the articulators typically happens before the onset of voicing, making frication all the more important as a place cue.Frication is also usually more acoustically salient in aspirated stops relative to unaspirated stops: since the glottis is spread during at least part of the closure, there is a greater build-up of supraglottal air pressure, causing quicker releases and greater burst intensities than in unaspirated stops (see e.g.Löfqvist, 1975Löfqvist, , 1980;;Jaeger, 1983).Long voicing lag can in itself lead to affrication in certain environments: when devoiced, high front vowels can be acoustically similar to fricatives (Mortensen, 2012).This can lead to the common sound change whereby /k/ ?/tʃ/ before /i/ (Hock, 1991; Ohala, 1992), as observed in e.g.Slavic, Indo-Iranian, and Middle Chinese (Guion, 1998 and references therein), and the common phonological process where /t/ is realized as an affricate or fricative before /i/, as observed in e.g.Finnish and Korean (Kim, 2001;Hall & Hamann, 2006;Hall et al., 2006).
There are no clear heuristics to decide whether a particular speech sound is an affricated aspirated stop or an affricateat least not from the acoustic signal alone.In phonology, a decision may be reached on the basis of behavior.Affricates are often assumed to contain a feature like [stop] as well as one usually used in the representation of fricatives, such as [strident] (e.g.Jakobson et al., 1951) or [continuant] (e.g.Lombardi, 1990) 2 ; see Lin (2011) for an overview of how affricates have been modeled in phonological theory.If an occlusive with a lot of frication behaves like an aspirated stop to all extents and purposes, it should probably be considered an aspirated stop at the phonological level; there will be no need to posit a [continuant] feature.If it patterns with fricatives, or shows other forms of exceptional behavior, those would be grounds for considering it an affricate at the phonological level.
From a phonetic perspective, Stevens (1993a) defines affricates as sounds which have two separate constrictions formed by the primary articulator.The anterior constriction forms a complete closure, while the posterior one forms a close approximation.In affricates, frication noise is generated at this posterior constriction, while in stops, frication noise is generated directly at the point of occlusion.This distinction is difficult to extend to acoustics or to gauge impressionistically.In practice, most decisions about stop-affricate category membership is likely based on intuition; a sound is categorized as an affricate if frication lasts for more than a certain proportion of the release.

Danish aspirated stops
In Section 1, I mentioned Otto Jespersenhis early observation that /p t k/ were becoming increasingly affricated, and his claim that they would eventually develop into affricates.Today, more than a century after Jespersen's observations, the affrication of /t/ has been established several times over, has been shown to be exceptionless, and is taken for granted in the literature.Fischer-Jørgensen (1954: 50) reports that the frequency range of the aspiration noise of /t/ is "exactly the same" as for /s/ throughout most of the release.While it is common for the initial burst noise of stops to have a similar frequency range to homorganic fricatives, this usually makes up a smaller portion of aspirated releases.Brink and Lund (1975) tracked the development of affrication in /t/ across more than a century of recordings from Copenhagen, and showed that it went from a widespread phenomenon in the mid-19th century to an exceptionless phenomenon in the mid-20th century.While earlier sources would often transcribe /t/ with both affrication and aspiration (e.g.Brink & Lund, 1975;Basbøll & Wagner, 1985), [t s ] has emerged as the standard transcription in the last few decades (e.g.Grønnum, 1998).Recently, Schachtenhaufen (2022) has proposed acknowledging the sound as a pure affricate and transcribing it as simply [ts].However, both frication and aspiration are often clearly present in /t/ releases, as exemplified in Fig. 3.
While there is consensus about the affricated status of /t/, possible affrication patterns in /p k/ have never been investigated.The most straightforward explanation for this is that no one ever noticed salient affrication in /p k/.This could either be because there truly is no affrication in /p k/, or because labial and dorsal frication are simply less salient than coronal frication.On the one hand, since /p t k/ show class behavior in other matters (e.g.phonotactics; Vestergaard, 1967), we might also expect them to show class behavior in phonetic implementation; on the other hand, Chodroff and Wilson (2018) recently found only moderate signs of class behavior in the realization of place cues in a large-scale study of American English /p t k/.
The timing of gestures in Danish aspirated stops is seemingly different from comparable Germanic languages.In Icelandic and Swedish, peak glottal opening is achieved relatively early during the closure of aspirated stops (Pétursson, 1976;Löfqvist, 1980); also in English and German, the glottis is typically fully spread sometime before the stop release (Sawashima, 1970;Hoole et al., 1984).Furthermore, closures in aspirated stops are typically longer than in unaspirated stops (Lisker, 1957;Löfqvist, 1976;Stathopoulos & Weismer, 1983;Braunschweiler, 1997).This ensures that supraglottal air pressure is high at the time of the release.In Danish, however, peak glottal opening typically falls just after the release of the stop (Frøkjaer-Jensen et al., 1971), and closure duration is shortest in unaspirated stops (Fischer-Jørgensen, 1969, 1972).Taken together, these two facts about Danish aspirated stopslate peak glottal opening, and relatively short closure durationmean that there are fewer mechanisms in place to ensure high supraglottal air pressure at the time of release, and accordingly, less guarantee of a prominent burst.This can partially explain why a constriction is retained for relatively long during Danish stop releases, since higher air pressure at the time of release in itself causes quicker movement of the articulators as the muscular tension forming the constriction is released.Functionally, it can also explain the 'need' for affricated releases in Danish: if the place cues of the burst are not otherwise so prominent, they can be strengthened by retaining constriction following the release.
From a phonological perspective, Danish /p t k/ show similar behavior.Their phonotactic behavior is similar to that of unaspirated stops (Vestergaard, 1967), and they all show the same patterns of positional allophony, with loss of aspiration (or affrication) after /s/ and medially before schwa, and loss of release syllable-finally (although they are optionally released phrase-finally; Grønnum, 2005).When loan words with alveolar affricates are nativized and adapted to Danish phonology, the affricate is generally reanalyzed as /s/ rather than /t/, as in the following examples; etymologies are from DSL (2018).In a study of Danish speakers' productive acquisition of Standard Chinese coronal obstruents (Puggaard, 2020), it was further shown that the most common error in the production of (non-aspirated) /ts/ is realizing it with no closure phase, i.e. similar or identical to /s/.Native speakers of Danish do not map Standard Chinese /ts/ to their native /t/ phoneme.They do, however, tend to map Standard Chinese /tsʰ/ to their native /t/ phoneme, further cementing that both affrication and aspiration are crucial cues to Danish /t/.

The DanPASS corpus
The data for this study comes from the Danish Phonetically Annotated Spontaneous Speech (DanPASS) corpus (Grønnum, 2009(Grønnum, , 2016)).This corpus was established to obtain high-quality recordings of unscripted Danish speech.The recordings are of single speakers or pairs of speakers solving unscripted tasks.The corpus has already served as the basis for studies on plosive reduction (Pharao, 2011), the relationship between vowel height and voice onset time (Mortensen & Tøndering, 2013), and intervocalic stop voicing (Puggaard-Rode et al., 2022a), as well as a number of other studies; see Grønnum (2016) for a full list.
The corpus consists of monologues recorded in 1996, and dialogues recorded in 2004.Only the monologues are used in this study, since these are more simple to analyze.These consist of 171 minutes of speech from 18 speakers.13 were men and 5 were women, and they were between 20 and 68 years old at the time of recording (mean = 29 years).Each speaker contributed a mean of 9m27s (range 6m13s -15m49s).Age is not taken into account in the statistical modeling of the data, so the heterogeneity in age across speakers should not be a problem, especially since we have no reason to assume that there was significant change in the realization of /p t k/ across the speakers' age span.The gender imbalance should also not be a problem, since most mixed modeling frameworks (including the one used here) do not assume balanced variables (e.g.Wood, 2017: ch. 2).
The recordings are segmented at the levels of prosodic phrase, word, and syllable, and transcribed both orthographically, phonemically, phonetically, and prosodically, as well as coded for morphology and parts-of-speech.Technical details about the recordings can be found in Grønnum (2009).The monologues consist of speakers performing three different tasks.In the network task (Swerts & Collier, 1992), they describe various shapes and colors.In the city task (Swerts, 1994), they describe routes drawn on a city map.In the house task (Terken, 1984), they describe how to build a house model using provided building blocks.

Acoustic analysis
The initial acoustic analysis was done using Praat (Boersma, 2001;Boersma & Weenink, 2019).An automated script was used to locate all aspirated stops (i.e.members of /p t k/) in simple onsets in the DanPASS monologues, and combine them into a single sound file with a subset of the (relevant) original annotations.This was an edited version of the script used by Puggaard-Rode et al. (2022b) written by Dirk Jan Vet.This located a total of 2,539 stops.The release phases of the stops were segmented primarily on the basis of the waveform, with the burst taken as the beginning of the release and the first signs of periodicity taken as the end (following Francis et al., 2003).If multiple bursts were present, the final one was chosen (following Cho & Ladefoged, 1999).This process was partially automatized by a Praat script searching for sudden increases in amplitude, but required extensive manual correction.An example of a segmented /t/ release can be seen in Fig. 4. 205 tokens were excluded during this process if there was no discernible closure.The distribution of stops by phonemic category is shown in Table 1, along with the mean duration of release for stressed and unstressed tokens.This is equivalent to positive voice onset time (VOT).Note that in some cases, the mean VOT values differ quite dramatically from those reported by Mortensen and Tøndering (2013), also on the basis of the DanPASS corpus; this is likely because they follow Fischer-Jørgensen and Hutters (1981) in going by the onset of higher formants rather 3 A counterexample is tzatziki, which is nativized as [tʰaetˈsiki] (DSL, 2018); here, the first /ts/ is reanalyzed as /t/, and the second as ambisyllabic /t.s/.(1981) recommend this landmark because it yields relatively stable VOT values across vowel types; however, voicing is clearly and consistently present in the waveform and spectrogram before the onset of higher formants, meaning that this landmark is not suitable for an analysis of spectral variance where we would like to avoid the presence of intrusive voicing and (in this case) the first formant.Subsequently, a Praat script (available in Puggaard-Rode, 2022) was used to extract the release duration and information about the phonetic context from each stop.The phonetic context in question is four binary variables describing the height, backness, and rounding of the following vowel, as well as stress.For this purpose, [i y u ɪ ʏ ʊ e ø o] are all defined as high vowels.Danish has a very complex vowel system, and these vowels all have a mean F1 below 400 Hz in modern Standard Danish (Grønnum, 1995;Juul et al., 2016).[u ʊ o ʌ ɔ ɑ ɒ] are the relevant back vowels, and [y u ʊ ø o oe ɔ ɶ ɒ] are the relevant rounded vowels.
Each stop was split into 20 equally long time steps.This is too coarse-grained to tease apart very dynamic sequences, such as the segue from initial transient to frication, but should be fine-grained enough to capture gross changes in affrication.The recordings are filtered to include a frequency range of 500-12,000 Hz.Frequencies below 500 Hz were filtered away to avoid a potential influence of intrusive voicing or low frequency ambient noise.Frequencies above 12,000 Hz were filtered because they rarely play a significant role in speech.In fact, 12,000 Hz is a relatively high cut-off point compared to similar studies, but was chosen due to a study by Pharao and Maegaard (2017) on sociolinguistic variation in Danish /t/ showing that mean COG for fronted realizations of /t/ can go above 6000 Hz, suggesting that very high frequencies may occasionally play a role.For each time step, the four first spectral moments were also extracted; the spectral moments are not used for this analysis, but are published alongside the other data used for the analysis (Puggaard-Rode, 2022).
Multitaper spectra were generated for each time step using R (R Core Team, 2020; RStudio Team, 2021). 4The study relies on multitaper spectra for both theoretical reasons (outlined in Section 1.1) and practical reasons.From a practical perspective, multitaper spectra consist of fewer frequency bins than FFT spectra, making their use in statistical models less computationally expensive.The parameters used for spectral estimation (the number of tapers K = 8 and the time-bandwidth parameter nW = 4) were set to match previous studies by Romeo et al. (2013) and Reidy (2015).
Three tokens of /k/ were excluded because their total release duration was below 10 ms, and the algorithm used to generate the spectra does not function for sound files shorter than 0.5 ms.The multitaper spectra are the dependent variable in the statistical analysis; each consists of a vector of amplitude values by frequency ranges.The study uses raw amplitude on the W/m 2 scale rather than the more commonly used transformed amplitude on the decibel scale; models were tested on both scales and yielded practically similar results, but the results are somewhat easier to interpret when using raw values.This is discussed in more detail in Puggaard-Rode (forthc.: chs.5-6).The frequency ranges differ in size depending on the duration of the time step; longer time steps have more fine-grained spectra.For each spectrum, the amplitude values were standardized, 5 since plenty of non-linguistic factors can lead to deviations in overall amplitude level.Only the frequency range between 500-10,000 Hz was used for the statistical analysis of /t/ spectra, and 500-8000 Hz for /p k/, since the minor activity above these limits seemed to be essentially random noise, which interfered with the clarity of the results.

The study
This section covers the statistical analysis and interpretation of the results.I first discuss the model specification, present the results for each stop in turn, and link the results to their presumed underlying articulatory mechanisms.

Model specification
All statistics were calculated using R (R Core Team, 2020; RStudio Team, 2021) and a number of add-on packages. 6All code is freely available in annotated form online (Puggaard-Rode, 2022); this includes various residual and autocorrelation plots.Separate FOSR models were fitted for each stop with multitaper spectra as the dependent variables.The spectra are smoothed using P-splines with the number of basis functions for the global intercept set as the mean number of amplitude observations per spectrum (corresponding to 32 for /t/, 19 for /k/, and 17 for /p/), which seems to strike a good balance between signal and noise.For the functional responses, 6 basis functions were used for the /t/ model and 5 for the /k/ and /p/ models, guided by the selection procedure proposed by Pya and Wood (2016).P-splines are useful for sparsely distributed data, i.e. when the number of observations per function differs (Wood, 2017b).Time step is included as a non-linear independent variable, smoothed with thin plate regression splines (Wood, 2003) with 16 basis functions to ensure quite high granularity in the temporal dimension.Smoothing penalization parameters were automatically selected using fast restricted maximum likelihood estimation (Wood, 2011).The residuals for the models are reasonably normally distributed, although for the /p/ model, they are somewhat leptokurtic (kurtosis = 5.45); however, Gaussian models with a high number of observations should be quite robust to violations of normality (e.g.Knief & Forstmeier, 2021).
A major advantage of GAMMs is the ability to account for autocorrelated residual error (Baayen et al., 2018;Wieling, 2018); for example, measurements taken at adjacent steps in a time series are likely to be correlated simply because they (Rahim, 2014;Rahim & Burr, 2020), with convenience functions published with the package nzilbb.labbcat(Fromont, 2021) based on code from Reidy (2013Reidy ( , 2016b)).The code is published in Puggaard-Rode (2022). 5The values were standardized by subtracting the mean and dividing by two standard deviations, following Gelman and Hill (2006). 6As mentioned above, refund (Goldsmith et al., 2021) was used to fit FOSR models and for various health checks, and mgcv (Wood, 2017a(Wood, , 2021)) are adjacent, which adds unwanted structure to the model residuals.This also applies to adjacent amplitude values in the frequency domain.One way to correct for this is by setting a q-parameter, often corresponding to the autocorrelation at 'lag-1', i.e. the mean correlation between adjacent measurements.This correction, called an AR(1) model, can also be included in FOSR models.AR(1) models are included in all models with q set at 0.1 above the lag-1 autocorrelation in a corresponding model with no correction. 7Autocorrelation along the frequency domain in the AR(1)-corrected models is negligible and short-range (between 0.06-0.16at lag-1).Note that all models display moderate negative autocorrelation at higher lags, which is stable across different values for q (between 0.14-0.39 at lag-8).This is demonstrated in Fig. 5, which shows autocorrelation plots for the model of /t/ releases from four different models with different parameter settings for q.Note that the models for both /k/ and /p/ show less autocorrelation (both positive and negative) than the model for /t/.
Another method for accounting for autocorrelated errors is the use of functional random intercepts, potentially with smoothing parameters set using splines based on functional principal components (Scheipl et al., 2015;Greven & Scheipl, 2017a;Bauer et al., 2018).Pouplier et al. (2017) argue in favor of the latter approach because 1) the influence of random effects can then be more readily decomposed, and 2) the basis for the correction is computed directly from the data, while the parameter setting used for AR(1)-correction is necessarily somewhat ad hoc.The latter approach can also be implemented in pffr, but at a significant computational cost.It is less computationally heavy to account for autocorrelation using another spline basis, such as thin plate regression splines.Recall that these are used to model variation along the time domain; Scheipl et al. (2015: appendix C.1) show that this approach can be used to simultaneously model a predictor variable and minimize autocorrelation along a functional domain (in this case the time domain).They further suggest that remaining residual structure can be accounted for by including scalar random intercepts, in this case corresponding to an intercept for each individual stop token.These are not included in the final models here, but I show in the accompanying code that adding such a scalar random intercept adds a significant computational load while accounting for very little variance in this case (Puggaard-Rode, 2022).
The model further includes by-category smooths for a number of independent binary variables: speaker sex, following vowel height, backness, and rounding, as well as stress.The influence of speaker sex on the spectral profile has not been discussed much above, but is also included here, since previous studies have shown a gender effect on the spectral profile of fricatives (e.g.Stuart-Smith, 2007).I am interested only in how these variables affect the time-varying characteristics of spectra, so no main effects were included for these variables.These are contrast coded (see Schad et al., 2020), such that absence of the feature in question is numerically coded as À½ and presence as +½; the speaker sex variable was (randomly) coded as À½ female, +½ male.Contrast coding of categorical variables is similar to centralizing continuous variables, and ensures that the global intercept corresponds to a weighted global mean, which makes the final results much easier to interpret.For each of these effects, by-speaker random slopes were also included (except for speaker sex, which logically cannot vary by-speaker). 8Including interaction effects (e.g.backness Â rounding) would be possible, but I have opted against doing so here, since it would make the results unnecessarily complicated for an exploratory study such as this one.The model formulae can be expressed as follows, where all variables are standardized (or pseudo-normalized via contrast-coding): where i indexes each observation, j indexes each speaker, F is frequency, t is time, a(F) denotes the smooth functional intercept, c(t ij , F) denotes the non-linear time variable, and E ij (F) denotes the functional residual error.The individual error of an observation e i is further modulated by the error of the preceding observation e i-1 by a factor of q; this is the AR(1) process described above (Baayen et al., 2018).In Appendix A, I describe how a simplified model is fitted with pffr, and compare this to how a similar model would be fitted as a GAMM with mgcv.
As discussed in Section 5.4, I do not report p-values for the FOSR models, as they likely reflect the very large number of observations (almost 550,000 normalized amplitude values for the largest model) rather than practical significance.I do report the rest of the model summary, which is similar to summaries of GAMMs fitted with mgcv.I do not include random effects in the summaries, but they are included in the accompanying code (Puggaard-Rode, 2022).Model summaries include estimated degrees of freedom (edf), which reflect the linearity of a variable, with a low edf near 1 indicating that the variable is near-linear; reference degrees of freedom (ref. df), which reflect the complexity of fitting a variable; and F-values, which reflect how much variation in the data is accounted for by a variable.As such, ref.df and F combined reflect the fitting-complexity tradeoff of including a variable, and these are usually used to calculate p-values, following the procedure described by Wood (2013;2017: ch. 6.12).As this study is largely exploratory, I take F-values as a proxy for the influence of individual variables, and do not otherwise touch on statistical significance in the traditional sense.This may not be satisfactory for all kinds of studies, and I return to the issue of null-hypothesis significance testing in Section 4.
I primarily explore the model fits through two types of plots: 1) Spectrum intercepts, which visualize the functional intercepts of the models, corresponding to an average release spectrum when all other variables (including variation along the time domain) are kept at zero.These are not very telling 7 Autocorrelation remained somewhat too high when q corresponded exactly to lag-1, which is why q is increased here. 8Using factor smooths instead of random slopes would have given a more thorough estimation of the by-speaker variation in the data (Baayen et al., 2018;Wieling, 2018;Sóskuthy, 2021), but unfortunately these cannot currently be fitted with sparsely distributed data.
in themselves, but are important for interpreting other effects.The spectrum intercepts are plotted with 95% confidence intervals, computed in the manner proposed by Marra and Wood (2012).2) Spectro-temporal fits, which visualize variation in the spectrum across time.The interpretation of these is similar to spectrograms; they are 'flipped' spectra, with normalized time along the x-axes, frequency along the y-axes, and greyscale shading indicating differences in fitted amplitude along the time-frequency domains, with darker shading indicating higher energy.These visualizations reflect the effect size of different variables.The plots of the main effect of time are computed by combining the functional intercept with the fitted effect of time; the plots of other variables are computed by combining the functional intercept, the fitted (main) effect of time, and the fitted time-varying effect of the variable in question.This means that if the model finds no noticeable effect of time, there will be no noticeable change along the horizontal dimension; if there is no noticeable effect of a particular variable, the plot associated with this variable will be similar or identical to the plotted main effect of time.Since these plots are two-dimensional, visualizing 95% confidence intervals require separate plots for the upper and lower limits.Plotted 95% confidence intervals for the multidimensional variables are included in Appendix B. These plots demonstrate the uncertainty associated with each fitted effect, and I will refer to these plots throughout the paper.

Results
The results of the three different models will be presented in separate sections below, starting with the model for /t/.

/t/
The model of /t/ releases has a high effect size of R 2 = 0.53.The functional intercept (see Fig. 6) shows an energy peak around 3500-5000 Hz, with comparatively little energy elsewhere, particularly above 8000 Hz.Since all the binary variables in the statistical model are contrast coded, the intercept reflects a grand weighted mean across time with all contextual variables kept at zero.Since the intercept summarizes a dynamic series of events, it is not in itself very meaningful.In the spectro-temporal fits (Figs.7-8), any changes on the horizontal dimension are a result of spectral characteristics changing as a function of time.
The /t/ model shows a strong main effect of time in the expected direction, as shown in Fig. 7. Initially, energy is skewed towards the higher end of the spectrum, with a fairly strong energy peak in the 3500-5500 Hz range (as in the intercept spectrum, Fig. 6), but also reasonably equal distribution of energy in the 5500-8000 Hz range.Increased energy above the main peak gradually tapers off, and in the final threefourths of releases, energy is broadly distributed below 5000 Hz, including at the lowest frequencies visualized (500 Hz); this is comparable to the concrete example of a /t/ release shown in Fig. 3, which shows a similar development over time.The main effect of time is quite robust, as visible from 95% confidence intervals (see Appendix B.1).
Spectro-temporal fits for each direction of the individual variables are shown in Fig. 8. Table 2 shows the model summary.Fig. 8 reflects a residual issue with this modeling technique.In contexts where we expect reduced affrication and earlier onset of aspiration, as in e.g.non-high vowels relative to high vowels, the figures show a relatively early increase in energy at low Fig. 5. Autocorrelation along the frequency domain in models of /t/ with no correction (top left), AR(1) model with q = lag-1 (top right), AR(1) model with q = 0.1 below lag-1 (bottom left), AR(1) model with q = 0.1 above lag-1 (bottom right).
frequencies, but also tend to show a sudden final increase in energy at higher frequencies.There is no linguistic reason to expect this, and it is consistent across models; I assume that this is a technical issue that does not reflect the data or the linguistic reality.
Overall, men show relatively little energy above the peak in the intercept spectrum, and lower frequencies (indicative of aspiration) 9 begin dominating relatively early.Women show strong initial energy in frequencies above 5000 Hz, and although lower frequencies come into play late in the release, frequencies up to 5000 Hz are excited throughout the release.The effect of sex is robust (see Appendix B.1) and associated with a large Fscore.
Lower frequencies start dominating towards the end of the release in unstressed syllables, and much earlier in stressed syllables; this core effect is quite stable, but there is significant uncertainty associated with the size of the effect (see Appendix B.1). Lower frequencies also dominate relatively late before high vowels, and frequencies above 6000 Hz are also more excited at the beginning of the release in this context.This is a strong effect associated with a high F-score.Again, these two effects are stable, but the size of the effects is relatively uncertain, and there is significant uncertainty associated with how vowel height otherwise affects the release.Lower fre-quencies dominate relatively early before back vowels and round vowels.In both of these contexts, there is also a coarticulatory effect at the start of the release: relatively high frequencies are excited before round and non-back vowels.These variables are less influential, with a particularly low F-score for the roundness variable, and they are both associated with significant uncertainty.
It is interesting that none of these variables are particularly influential around the middle portion of the release; they may affect whether particularly high frequencies are excited around the start of the release, and whether/when lower frequencies begin to dominate near the end of the release, but high energy in frequencies 3500-5000 Hz in the middle of the release is a consistent feature across all variables.

/k/
The model of /k/ releases has a high effect size of R 2 = 0.56.Recall that figures here do not extend above 8000 Hz.The intercept spectrum (see Fig. 9) shows almost evenly distributed energy below 4000 Hz, with small peaks around 500 Hz and just below 4000 Hz, and overall decreasing energy above 4000 Hz.
There is no strong main effect of time; there is little variance across the time domain in Fig. 10, and the variance that we do see is associated with significant uncertainty (see Appendix B.2).The associated F-score is also relatively small.Spectro-temporal fits for each direction of the individual variables are shown in Fig. 11.Table 3 shows the model summary.
There is a noticeable sex difference.There is little energy at lower frequencies during the first half of the release for female speakers, and more activity at frequencies above 4000 Hz.Lower frequencies become dominant in the last quarter of the release for female speakers, whereas for male speakers, they are seemingly dominant throughout the release.The Fscore for this variable is high, and the patterns are quite robust (see Appendix B.2).
As expected, phonetic context effects have a clear influence on the /k/ spectral trajectory, particularly those effects that reflect properties of the following vowel.Stressed syllables have somewhat more energy at the lower band around 500-1000 Hz, while unstressed syllables have more energy at the higher band around 3500-4000 Hz, although lower frequencies gradually become dominant in the latter half of the release.Note, however, that the associated F-score is modest, and the variable is associated with significant uncertainty (see Appendix B.2).
Before high vowels, there is a lot of high frequency energy between 3000-5000 Hz during the first half of the release, with more diffuse distribution of energy before the onset of lowfrequency noise towards the end of the release; low frequency energy overall dominates releases before non-high vowels.This variable is associated with a large F-score.Non-back vowels and non-round vowels show roughly the same patterns as high vowels, although with slightly varying temporal alignment.The backness variable in particular is associated with a very large F-score.High frequency noise lasts somewhat longer for non-round vowels than non-back vowels.The Fscore associated with the roundness variable is also large.All effects associated with features of the following vowel are robust. 9As mentioned in Section 5.2, during aspiration, low-frequency noise is generated at or near the glottis, and the turbulent airstream excites the resonant frequencies of the oral cavity.The dominance relationship between these sources may differ, but in both cases, the primary frequencies being excited are well below those excited during alveolar frication.

/p/
The model of /p/ releases has a very high effect size of R 2 = 0.7.The intercept spectrum (see Fig. 12) shows most energy in the lowest frequencies, with energy gradually reducing at higher frequencies.Assuming that the more diffuse distribution of noise towards the end of the release is not linguistically substantial, there is only a very marginal main effect of time (see Fig. 13), although what we do see is quite robust (see Appendix B.3).Some of the by-variable time-varying characteristics of /p/ are clearer, as shown in Fig. 14.Table 4 shows the model summary.
There are modest signs of higher frequencies being excited more in the first half of releases produced by women, but not by men.The sex effect is, however, quite weak; the overall pattern is relatively robust, but the magnitude of the pattern is associated with significant uncertainty (see Appendix B.3).During the first portion of the release, unstressed tokens have a broader distribution of energy throughout the spectrum, and more energy at higher frequencies (above approx.5000 Hz).The stress variable is quite strong and robust.A similar pattern is found before high vowels, with lower frequencies dominating relatively late in the release.The height variable has a relatively high F-score, and is also quite robust.To a lesser extent, the same pattern is found before non-back vowels.There is no obvious influence of round vowels, and this variable is associated with significant uncertainty (see Appendix B.3).

Linking the results to underlying articulatory mechanisms
In the preceding section, I described the patterns of energy distribution that are visible in the spectro-temporal fits in prose.In this section, I aim to provide a link between those representations and the articulatory mechanisms that presumably underlie them.This discussion is necessarily somewhat speculative, but relies on established knowledge about the articulation-acoustics link, and about the articulation of Danish specifically.
While all stops show diffuse patterns of energy distribution towards the end of the release, only /t/ clearly shows a strong main effect of time, with a gradual downward trend in energy distribution over time.During the first half of the release, high frequencies are excited, often above and beyond what is necessarily expected for an alveolar constriction.During the second half of the release, lower frequency energy consistent with a glottal noise source gradually becomes dominant.As mentioned in Section 2.2 above, there is reason to assume that oral air pressure is not particularly high at the time of release in Danish aspirated stops, which provides both an aerodynamic reason and a functional-phonological motivation for why the constriction is maintained somewhat longer than in comparable 'aspiration languages': there is no high air pressure to ensure that the constriction is quickly released, and to ensure a salient burst.Nevertheless, contrary to the general conception in literature, alveolar constriction usually does not dominate the entire release.
The relative timing of the shift in dominance from an alveolar noise source to a glottal one is partially determined by contextual factors like stress and vowel height.Speaker sex also plays a role.Stop releases in stressed syllables show a larger proportion of aspiration.In other words, phonetic reduction mainly targets the aspiration in /t/ releases, not the frication.Features of the following vowel affect the relative timing of the dominance shift much more than they affect the distribution of energy during the first half of the release, although high and round vowels do show coarticulatory effects lasting throughout the release.The linguistic upshot is that lengthy alveolar frication is an invariant feature of /t/ releases in Modern Standard Danish, but the proportion of alveolar frication varies; some degree of aspiration is almost always observed.
Stevens' (1998) model of velar stop releases suggested that the velar frication excites low resonance frequencies mostly below 2000 Hz.The results here, however, show two primary patterns of energy distribution: much higher resonance frequencies around 4000 Hz, or resonance frequencies centered around the lower end of the spectrum.I presume that   and non-back vowels in particular, noise at higher frequencies is dominant during the first part of the release.If the following vowel is high, the tongue dorsum logically remains fairly close to the velum throughout the release, causing a dominant dorsal noise source, the characteristics of which vary on the basis of other vowel features.The point of occlusion varies by backness of the following vowel, such that the outgoing air impinges more directly on the hard palate before front vowels, causing more salient noise at higher frequencies.The energy from the palatal noise source is dampened by lip rounding, which increases the size of the oral cavity.The linguistic upshot is that coarticulation has a major influence on spectral characteristics throughout /k/ releases; this is in line with the general observation that the point of occlusion in velar stops is prone to coarticulatory variation (e.g.Ouni, 2014)./p/ releases also vary in whether there is a primary glottal noise source (a strong energy peak at lower frequencies), or whether there is a primary labial noise source (no strong energy peak at lower frequencies).There is no strong main effect of time.In unstressed syllables, before high vowels, and to some extent before non-back vowels, energy is more broadly distributed throughout the spectrum, indicating a dominant labial noise source./p/ releases vary relatively little compared to /t k/.
These results confirm the observation that /t/-affrication in Modern Standard Danish is invariant.Generally, however, /t/ affrication does not last throughout the release; aspiration is also an important component of /t/ releases, especially in stressed position.There is also a frication component in /p k/ releases, but under many conditions, these releases are dominated by a glottal noise source.During a /t/ release, the outgoing air impinges on a hard surfacethe teethimmediately downstream of the preceding occlusion.This is not the case for either /p/ or /k/; the lips constitute a soft surface, and the hard palate is further removed from the velar occlusion.As such, it is well-understood why an alveolar noise source dominates a glottal one more readily than corresponding bilabial or velar noise sources.

Discussion: Function-on-scalar regression and the spectrum
This paper has introduced the use of FOSR in the analysis of speech spectra and their variance as a function of time.This method shows a lot of promise.It allows us to get around the problem of choosing one or a few discrete measures to represent the spectrum, all of which come with their own set of methodological problems.In a sense, analyzing these models is similar to the classical technique of 'eyeballing' spectrograms, but in a way that allows the user to more efficiently and reliably find systematic patterns of variation in the data, to tease apart various influences on the results and compute the uncertainty associated with each, and to filter out byspeaker variation.Some lingering issues remain with the method; some specific to this study, and some inherent to the field.I will briefly address a few of these.
As with any kind of quantitative phonetic study, there are significant researcher degrees of freedom involved in FOSR modeling of spectra (see Roettger, 2019).Token selection, spectral estimation, smoothing procedure, low-level software implementation, as well as several other factors all have a potentially non-trivial influence on the results.There is no easy remedy to this, but transparent reporting and motivation of all these choices goes a long way.I have aimed to do that here, and the actual code used to implement the analysis is available in annotated form (Puggaard-Rode, 2022).
FOSR modeling of spectra will generally involve highly multidimensional data, especially if the temporal dimension is also taken into account.This makes the use of traditional methods for significance testing problematic.I do not consider this to be an issue in the current study.For one, the study is largely exploratory, and the research questions are not necessarily suitable for null hypothesis significance testing.With that said, there are methods for testing the stability of the results.This includes the 95% confidence intervals proposed by Marra and Wood (2012), which are provided in Appendix B and which I have referred to throughout.These are computationally efficient, and visualize the uncertainty associated with fitted   effects along the functional domain(s).This method is implemented for FOSR visualization in the FoSIntro package in R (Bauer, 2021).Additionally, there are functional implementa-tions of discriminant analysis and regression trees which may be used to explore the generalizability of results, and fully Bayesian implementation of the analysis would make it possi- ble to readily quantify the uncertainty related to the results (see e.g.Vasishth et al., 2018).This will hopefully be explored in future research, but is beyond the scope of the current study.
The prospects of hypothesis testing in FOSR models is further explored in a recent dissertation by Biswas (2022).
The implementation of FOSR in this study shares a problem with analyses based on e.g.spectral moments, mid-frequency peaks, and DCT: the Hz-based frequency scale and the W/m 2 -based amplitude scale are 'physicalist' in nature, in that they represent the behavior of vibrations in the air, and not how these vibrations are perceived by the human ear (Plummer & Reidy, 2018).I use the Hz scale here because it results in a model output which is more immediately interpretable for readers with experience with analyzing spectrograms; I use the W/m 2 scale because it results in more clearly interpretable patterns in the fitted models.It is, however, worth exploring in future studies how the results would be affected by combining perceptually motivated scales, such as the decibel scale for amplitude, and the equivalent rectangular bandwidth (ERB) scale, Bark scale, or mel scale for frequency. 10Note that the latter two scales have been profitably used in combination with DCT in previous studies (Bukmaier & Harrington, 2016;Jannedy & Weirich, 2017).
The most serious lingering issue is the diffuse patterns sometimes seen in the final time steps of the spectrotemporal visualizations.These cannot be considered linguistically meaningful; there is no linguistic reason why high frequencies above 4000 Hz would suddenly be excited immediately before the onset of voicing in a stop-vowel sequence, regardless of phonetic context.I can see three possible explanations for this: 1) the spectral characteristics of aspiration are highly variable, making it impossible for the model to make precise predictions, 2) the pseudocentralization (contrast coding) of categorical variables sometimes causes the model to infer patterns that are not meaningful for one pole of variables, or 3) it is caused by phase variation.Regarding 2), consider /k/ before high and non-high vowels: the model finds a strong increase in low frequency energy in the final time steps before high vowels, which is linguistically meaningful, as the glottal noise source becomes dominant immediately before the onset of voicing.The model finds a corresponding increase in high frequencies and decrease in low frequencies in the final time steps before non-high vowels, which is not linguistically meaningful, but is the direct opposite of the meaningful finding before high vowels.A possible solution would be to fit the model without contrast-coded categorical variables, but this would make it impossible to interpret models' intercepts and main effects of time, which I believe would seriously harm the interpretability of the findings.Regarding 3), phase variation is a practical problem in functional data analysis, where lateral displacement in input curves can cause results to be blurred and distorted.Managing phase variation in the analysis of functional data is an area of active research (Marron et al., 2015;Bauer et al., 2021).

Conclusion
This paper has introduced function-on-scalar regression as a method for analyzing speech spectra and how they vary over time.This method forgoes the need to boil down the complex, multi-dimensional information in the spectrum to a few discrete values, and it forgoes the need to rely on 'magic moments' in time.By plotting the fit of a FOSR model, we can explore the systematic influences of different variables on the spectrum with visualizations that should be intuitively familiar to anyone already used to working with spectrograms.I showed how this tool can be fruitfully applied in the analysis of Danish stop releases, how they vary over time, and how they are affected by their phonetic environments.
The analysis finds that /t/, as expected from the literature, is invariably affricatedbut also that the spectrum is very dynamic throughout /t/ releases, with affrication gradually turning to aspiration.Affrication dominates the majority of the spectrum, and much of the aspiration is lost in unstressed syllables.Coarticulatory context effects may affect the entirety of /t/ releases, and not just the final portion.Coarticulatory context effects greatly influence the spectra of /k/ releases.As the precise point of occlusion in velar stops is known to be largely determined by the following vowel, these also have a great influence on release noise in /k/, particularly in the first portion of the release.The acoustic characteristics of /p/ releases show less prominent coarticulatory context effects, which mainly affect the first half of the release.

Fig. 1 .
Fig. 1.Different stages of an aspirated stop release, exemplified in a token of /k/ before a low back vowel.'t' is short for 'transient'.The spectrogram shows frequencies from 0 to 8000 Hz.

Fig. 2 .
Fig. 2. Examples of aspirated stops at different places of articulation.The spectrograms show frequencies from 0 to 8000 Hz.

Fig. 8 .
Fig. 8. Spectro-temporal fits of /t/ for each direction of the individual variables.

Fig. 14 .
Fig. 14.Spectro-temporal fits of /p/ for each direction of the individual variables.

Table 1
Number of aspirated stops included in the study, along with mean VOT values.the first signs of periodicity, which leads to higher overall values, particularly for /k/.Fischer-Jørgensen and Hutters than

Table 2
Summary of /t/ model.

Table 3
Summary of /k/ model.