Pitch changes as the marker of a phrase edge in standard Lithuanian

Abstract


Introduction
The sets of phrases used in the two parts of the study differed, but they were both drawn from the 36-hour phoneme-level annotated speech corpus (created as part of the Lithuanian Research Council-funded project "Complex study on prosody in Lithuanian text: intonation, rhythm, and narrow focus," 2012-2014, contract No. LIT-5-4) to which the authors have access.
For the first part of the study, the material consists of 1458 phrases that were taken from the recordings of 9 speakers (4 men and 5 women).The recordings with phrases were chosen in random order.The phrases consist of both spontaneous (TV talk shows and private conversations) and read (news) speech recordings.
For the second part of the study, the material consists of 8274 phrases taken from the annotated speech corpus.The material for the second part of the research consists of recordings of various speakers and speech types: read and spontaneous speech, radio plays, monologues and dialogues, and public and private speech.
This study focuses on analysing the pitch changes at the beginning and end of both intonational and intermediate phrases.The study consists of two parts, and therefore the aim is twofold: 1 The first part of the study is intended to be a pilot study and aims to analyse whether the F0 difference is a reliable measure to determine the F0 rise or fall at the end of declarative phrases (both intermediate and intonational) in standard Lithuanian and whether the F0 difference depends on the speaker's sex.

2
The second part takes into account the results of the pilot study and provides a speech corpus analysis on the tendencies of pitch changes at both the beginning and end of intonational and intermediate phrases in standard Lithuanian.
Such an analysis is important not only for fundamental research but for applied research as well.Examining the phonetic features that signal the boundaries of prosodic units is relevant for speech synthesis and recognition.Correct phrasing is also essential for learning public speaking.

Material and Methods
Nusi4prausė <p> po_stipria4 / du4šo <lg>srove4</lg>.// <imitation type="Montegas"> Šian3dien da9rbas <lg>bai3gtas</lg>.// After the revision of the time-aligned textual annotations, the automatic re-alignment of the recordings (i.e., the alignment with the text according to five different levels: intonational phrases, intermediate phrases, prosodic words, words, and syllables) was conducted.The re-alignment process as well as the initial alignment process were conducted according to the methodology of Vytautas Magnus University researchers (Kazlauskienė et al., 2010).First, the textual content of the corpus was converted into phoneme sequences.Using the Hidden Markov Models (HMM) methodology (Baum & Petrie, 1966), the phoneme sequences were aligned to the audio recordings.Phonemes had been modelled by 3-5 state HMMs.HMM parameters were estimated by the Baum-Welch machine learning algorithm from the speech corpus.The start and end times of individual phonemes were determined by the Viterbi algorithm (Forney, 1973).The HTK software package (Young et al., 2001) was used at this stage, and special scripts were developed to make the annotation process automatic.Information about the start and end times of individual phonemes made the inference about the start and end times of higher-level lexical units straightforward.
After the automatic re-alignment, different procedures were applied to the research materials for the first and second parts of the study, and a description of them is provided in the sections below.

Procedure
After the revision of the annotated recordings, in the material of the first part of the study, the pitch accents related to the last stressed syllable in a phrase and the boundary tones related to the end of the same phrase were marked following the methodology of autosegmentalmetrical phonology (started with the works of Pierrehumbert, 1980) and the ToBI tone annotation system (Jun, 2005;2014;Frota & Prieto, 2015; for the adaptation of ToBI to Lithuanian, see Sabonytė, 2022).Different tone levels were marked (see Fig. 1), taking into account each speaker's F0 range and span in the recordings.L marks low tone, H marks high tone, LH marks rising tone, and HL marks falling tone.The % sign marks the beginning (e.g., %H, %L, %LH) and the end (e.g., H%, L%, HL%) of an intonational phrase; the hyphen (-) marks the beginning (e.g., -H, -L, -LH) and the end (e.g., H-, L-, HL-) of an intermediate phrase; and the asterisk (*) marks the pitch accent related to a stressed syllable.
Pitch accents and boundary tones allowed observing whether the tone levels were different, comparing the levels related to the last stressed syllable and the levels at the end of each intermediate phrase (sometimes coupled with the end of an intonational phrase).
According to the marked tone levels, the phrases were grouped into three different clusters: 1 The phrases with rising F0 at the end; 2 The phrases with falling F0 at the end; Out of the 1458 phrases examined, there were 1327 phrases in which the levels of F0 did not change.Thus, it was interpreted that in these phrases, some other phonetic features (such as pauses, intensity changes, lengthening of sounds, etc.) rather than F0 are the markers of the phrase end (the possibility of a phrase boundary being marked by some kind of F0 change at the beginning of a subsequent phrase is not rejected in this pilot study but rather left for future research).Therefore, in this pilot study, these phrases were discarded from further analysis.In the next stage of this analysis, 45 phrases with a rising F0 and 86 phrases with a falling F0 were examined.As it could be hypothesized, due to the phenomenon of F0 declination, the number of phrases with falling F0 was nearly double compared to the number of phrases with rising F0 at the end.
Rising F0 at the end of the phrase was found both in read and spontaneous speech recordings.However, its relation to grammar and semantics is different between the two types of recordings.In the read-speech recordings, rising F0 was almost always related to the beginning of a new sentence.The first intermediate phrase ended with a rising F0 in the following cases: a If a sentence starts with an adverbial or an object, after these sentence parts (especially if adverbials or objects are compounds of a few words because in this case the physiological pause is needed and the listener perceives it as the boundary of an intermediate phrase; this could also be influenced by the lack of experience of those who wrote the texts for reading as they might not have realised that the perception of written texts and the same recorded texts is different; therefore, the word order of texts that are being prepared for reading should be very well-thought in terms of phrasing); b Between the subject and verb groups, if they were few-word compounds; c After a parenthetical expression; d In complex sentences before a subordinate clause of an object.
In spontaneous speech, such phrases with a rising F0 at the end were found before pausing to think about what to say next, as well as after parenthetical words.
Both groups of phrases with rising and falling F0 were analysed as follows: 1 The two types of F0 data were extracted using Praat (Boersma & Weenink, 2019, version 6.0.56): the F0 mean of the last stressed syllable in a phrase and the absolute F0 measure at the end of the phrase.
2 The extracted data was divided into male and female speakers' groups.There were 19 male and 27 female speakers' phrases with a rising F0 and 45 male and 41 female speakers' phrases with a falling F0.

Rising F0 at the End of a Phrase
In this part of the pilot study, 19 male speakers' phrases and 27 female speakers' phrases were analysed.For the male phrases, the minimal F0 at the end of a phrase was 134 Hz, the maximum F0 at the end of a phrase was 215 Hz, and the average was 178.3 Hz.The minimal F0 mean of the last stressed syllable was 84 Hz, the maximum F0 mean of the last stressed syllable was 176 Hz, and the average was 139.4 Hz.
For the female phrases, the minimal F0 at the end of a phrase was 229 Hz, the maximum F0 at the end of a phrase was 357 Hz, and the average was 281.3 Hz.The minimal F0 mean of the last stressed syllable was 160 Hz, the maximum F0 mean of the last stressed syllable was 272 Hz, and the average was 231.6 Hz.
The difference between F0 at the end of a phrase and the F0 mean of the last stressed syllable was statistically significant for both male (T < 0.00001, paring was statistically significant, T = 0.00042) and female (T < 0.00001, although paring was not statistically significant, T = 0.13890) speakers' phrases (see Fig. 2).For male phrases, the minimal difference was 12 Hz, the maximum difference was 76 Hz, and the average difference was 38.9 Hz.In the female speakers' recordings, the variety among the differences was greater than for male speakers: the minimal difference was 7 Hz, and the maximum difference was 125 Hz.The average difference was 49.7 Hz.
3 In order to determine the difference between F0 of the last stressed syllable and F0 at the end of the phrase, the subtraction was counted as follows: a For the rising F0 at the end of a phrase: F0 difference = F0 at the end of a phrase -F0 mean of the last stressed syllable b For the falling F0 at the end of a phrase: F0 difference = F0 mean of the last stressed syllable -F0 at the end of a phrase 4 After determining the F0 differences, the F0 data for both rising and falling F0 phrases were compared as follows: a F0 mean of the last stressed syllable compared with F0 at the end of phrases for separately male and female speakers' phrases; b Male F0 differences compared with female F0 differences.
Paired and unpaired t-tests for the different group comparisons were performed, and graphs were produced using GraphPad Prism 8.

Results
Fig. 2 Rising F0 at the end of a phrase: comparison of F0 at the end of a phrase and F0 mean of the last stressed syllable in male and female speakers' data The variety among the F0 differences between F0 at the end of a phrase and the F0 mean of the last stressed syllable was greater in female speakers' phrases.However, after comparing the same F0 differences (as absolute measures) between the male and female groups (see Fig. 3), it was determined that despite the natural distinction in the voice height, the groups of men and women were not significantly distinct (T = 0.25).
This raises the conclusion that in the case of rising F0 at the end of a phrase, the difference between F0 at the end of a phrase and the F0 mean of the last stressed syllable is a reliable measure to use for determining the F0 rise, and it does not depend on whether a speaker is a man or a woman.

Falling F0 at the End of a Phrase
45 male speakers' phrases and 41 female speakers' phrases were analysed in this part of the study.For the male phrases, the minimal F0 mean of the last stressed syllable was 108 Hz, the maximum F0 mean of the last stressed syllable was 73 Hz, and the average was 154.8 Hz.The minimal F0 at the end of a phrase was 76 Hz, the maximum F0 at the end of a phrase was 186 Hz, and the average was 113.3 Hz.
For the female phrases, the minimal F0 mean of the last stressed syllable was 181 Hz, the maximum F0 mean of the last stressed syllable was 347 Hz, and the average was 261 Hz.The minimal F0 at the end of a phrase was 147 Hz, the maximum F0 at the end of a phrase was 291 Hz, and the average was 215.3 Hz.
The difference between the F0 mean of the last stressed syllable and F0 at the end of a phrase was statistically significant for both male (T < 0.00001, paring was statistically significant, T = 0.00094) and female (T < 0.00001, although paring was not statistically significant, T = 0.3) speakers' phrases (see Fig. 4).For male phrases, the minimal difference was −13 Hz, the maximum difference was 136 Hz, and the average difference was 41.5 Hz.In the female speakers' recordings, the variety among the differences was greater than for male speakers: the minimal difference was −69 Hz, the maximum difference was 190 Hz, and the average difference was 45.8 Hz.A few of the differences were negative (though very rarely, only in 1 phrase of male speakers' data and in 3 phrases of female speakers' data) because sometimes the F0 mean of the last stressed syllable was lower than the F0 measure at the end of the phrase, but these exceptions do not seem to significantly influence the results of the comparison.
Fig. 4 Falling F0 at the end of a phrase: comparison of the F0 mean of the last stressed syllable and F0 at the end of a phrase in male and female speakers' data

Male Female
Fig. 3 F0 differences in male and female speakers' data with a rising F0 at the end of a phrase The variety among the F0 differences between F0 at the end of a phrase and the F0 mean of the last stressed syllable was greater in female speakers' phrases.However, after comparing the F0 differences between the groups of men and women (see Fig. 5), it was determined that, the same as in the case of rising F0, the groups of men and women were not significantly distinct (T = 0.66).
This allows raising the conclusion that when F0 is falling at the end of a phrase, the difference between F0 at the end of a phrase and the F0 mean of the last stressed syllable is a reliable measure to determine the F0 rise, and it does not depend on whether the speaker is a man or a woman.The per-Fig. 5 F0 differences in male and female speakers' data with falling F0 at the end of a phrase

Procedure
For this part of the study, phrase edge tones were annotated in 8274 phrases from the speech corpus.Tone levels were determined individually for each speaker based on the F0 data (range and span) and perception.The edge tones were marked based on the pitch change from the beginning of the phrase to the first stressed syllable and from the last stressed syllable to the end of the phrase.The tone labels used for edge tones were the same as in the first part of the analysis and were marked in the upper TextGrid tier.The labels between the two edge tones were separated by the # sign in the annotations (see Figs. [6][7][8][9][10][11]. Flat tones are indicated at the beginning of an intonational or intermediate phrase (%L / %H; -L / -H) if the pitch remains unchanged from the beginning of the intonational or intermediate phrase to the end of the first stressed syllable (see Fig. 6).A flat tone at the end of an intonational or internal phrase (L% / H%; L-/ H-) indicates that the pitch level remains the same from the last accented syllable to the end of the phrase (see Fig. 6).
A rising tone at the beginning of an intonational or intermediate phrase (see Fig. 7) indicates a pitch rise from the beginning of the phrase to the end of the first stressed syllable (%LH; -LH), while at the end of the phrase (see Fig. 8) it indicates a pitch rise from the last stressed syllable to the end of the phrase (LH %; LH-).
A falling tone at the beginning of an intonational or intermediate phrase (see Fig. 9) indicates a pitch fall from the beginning of the phrase to the end of the first stressed syllable (%HL; -HL), and at the end of the phrase (see Fig. 10) indicates a pitch fall from the last stressed syllable to the end of the phrase (HL%; HL-).
It is important to note that when the beginning or end of the intermediate phrase coincides with the beginning or end of the intonational phrase, the edge tones of the two types of phrases are also the same, with the only difference being in the edge signs written together with tones (% and -).In the material under study, in such cases, it was chosen to mark only the edge tone (%) of the intonation phrase (see Fig. 11).
After the edge tones were annotated, the composition of phrases was analysed.The analysis showed that 2445 of 8274, i.e., 30% of the phrases, consisted of only one word.They were not suitable for the analysis of edge tone frequency because the stressed syllable in such a case is both the first and last stressed syllable of the phrase; therefore, one-word phrases were discarded.It was expected that general patterns would emerge from phrases that consist of more than one word.After discarding the one-word phrases, the research material consisted of 1962 intonational phrases and 3867 intermediate phrases.
Validation of the phrase edge tone labels was then conducted, i.e., the speech corpus material was automatically reviewed and a list of different labels of each type was collected.The inaccuracies found in this list were corrected manually in the text annotations.
For each label type, a list of intermediate phrases in ('it smells like fish') Fig. 9 Falling tone at the beginning of a phase.Phrase "arba jį šalina" [ɐrbɐ_²ˈjiː ²ˈʃɑːlʲɪnɐ] ('either they expel him') which that label was found was then compiled.Each list was divided into two groups: intermediate phrases (a) at the end of an intonational phrase and (b) in the middle of an intonational phrase.The phrases (recordings and annotations) of the two groups were merged into three files in a sequential manner to facilitate the review and analysis of these groups.Software developed specifically for this purpose was used for grouping and merging.The groups of phrases were manually reviewed, the interface between labels and phrase boundaries was refined, and the frequency of labels at phrase edges was calculated.
The phrase frequency data (see Table 1) show that intonational phrases are most often started (58%) and ended (75%) in a low tone (%L/L%).The second most frequent tone (28%) on the left edge is a rising tone (%LH), and the second most frequent tone (19%) on the right edge is a falling tone (HL%).The high tone (%H) at the beginning of intonational phrases is rare (11%), and the falling tone (%HL) is even less frequent (3%).Both of the latter tones (H%, HL%) are rare at the end of intonational phrases (3% each).
The patterns of the edge tones of intermediate phrases have quite similar tendencies as those of intonational phrases.Intermediate phrases usually start (70%) and end (63%) with a low tone (-L/L).The second most frequent tone on the left edge is rising (-LH), and on the right edge, falling (LH-).Both account for one-fifth of the samples in their group.A high (-H) or falling (-HL) tone at the beginning of intermediate phrases is rare (both together accounting for 9%).High (H-) and rising (LH-) tones at the end of intermediate phrases are more frequent than in intonational phrases (both together accounting for 17%).
To sum up, the analysis of phrase edge tones in the annotated speech corpus showed that intonational and intermediate phrases most often (86% and 90%, respectively) started with a low or rising tone and ended (94% and 83%, respectively) with a low or falling tone.A low or high tone at the beginning and end of a phrase is hardly a reliable indicator of phrase juncture if the process of falling or rising is not visible.Thus, flat edge Table 1 Distribution of edge tones at the beginning and end of intonational and intermediate phrases (%)

Conclusions
The results of the pilot study showed that the mean F0 of the last stressed syllable in both male and female phrases with a rising F0 is significantly lower than the F0 at the end of a phrase.In phrases with a falling F0, on the other hand, in both male and female samples, the mean F0 of the last stressed syllable is significantly higher than the F0 at the end of a phrase.In both the falling and rising tone phrases, the differences in the F0 of the last stressed syllable and the end of the phrase between the male and female groups are not statistically significant.Hence, the difference is a reliable indicator for identifying the rising or falling tone at the end of a phrase and is independent of the speaker's sex.This conclusion allowed for the analysis of the samples in an annotated speech corpus without differentiating male and female speakers' samples.
The annotated speech corpus analysis revealed that both intonational and intermediate phrases are usually started and finished in a low tone.Hence, the high and rising tones at the beginning or end of a phrase are the marked members of the phrase edge tone system, and they can help to identify phrase junctures.However, the analysis showed that in standard Lithuanian, only about a quarter of the phrase junctures can be identified based on a pitch change.Thus, in seeking to identify phrase junctures, it is important to pay attention not only to changes in pitch but also to other acoustic features such as pauses and intensity, as well as grammatical and semantic aspects of phrase composition.

Acknowledgement
This research was funded by a grant (No.S-LIP-21-5) from the Research Council of Lithuania.tones (especially low L) are static and do not in themselves signal the end of a phrase.The phrase juncture can be identified by a rising or falling tone at the beginning or end of a phrase.Less than a third (31%) of the intonational phrases had such a tone at the beginning, a quarter (24%) of the intermediate phrases, almost a fifth (22%) of the intonational phrases at the end, and almost a third (30%) of the intermediate phrases.This means that in standard Lithuanian, only about a quarter of the phrase junctures can be identified on the basis of a change in pitch.

Fig. 1 3
Fig. 1 Tone annotations.Phrase "Kartojame dienos klausimą."[kɐr¹ˈtoːjɛmʲɛ dʲiɛ²ˈnoːs ¹ˈklɑˑʊsʲɪmɑː] ('We are repeating the question of the day.') annotations.Phrase "Kartojame dienos klausimą."[kɐr¹ˈtoːjɛmʲɛ dʲiɛ²ˈnoːs ¹ˈklɑˑʊsʲɪmɑː] ('We are e question of the day.')F0 at the end of a phrase: comparison of F0 at the end of a phrase and F0 mean of the last stressed ale and female speakers' data fferences in male and female speakers' data with a rising F0 at the end of a phrase

Fig. 2 Fig. 3 F0
Fig. 2 Rising F0 at the end of a phrase: comparison of F0 at the end of a phrase and F0 mean of the last stres syllable in male and female speakers' data

Fig. 4 FallingFig. 5 F0Fig. 7 Fig. 7 Fig. 7 Fig. 7
Fig. 4 Falling F0 at the end of a phrase: comparison of the F0 mean of the last stressed syllab end of a phrase in male and female speakers' data