Identifying Speech Quality Dimensions in a Telephone Conversation

Speech telecommunication services are traditionally used for communication between two interlocutors interacting in a conversation. Thus, the quality of transmitted speech in a conversational situation, as perceived by the end-users, is the important indicator for service providers to evaluate their systems. In this context, it is not enough to only provide information about the overall quality but also to indicate reasons and sources for quality losses. In this article, we present an approach towards analyzing speech quality in a conversational situation by dividing a conversation into three separate phases and identifying corresponding quality-relevant perceptual dimensions, as perceived by the system users. The identified dimensions can be combined for the overall quality assessment and may separately be used to diagnose the technical reasons of quality degradations. For this, four separate subjective experiments to uncover the underlying dimensions in each conversational phase are conducted. The resulting quality-profile, consisting of seven perceptual dimensions, is then validated in an extensive conversational experiment triggering all three phases of a conversation using a new proposed test-paradigm. This allows deeply analyzing conversational speech quality for diagnosis and optimization of telecommunication systems and provides the fundamentals for instrumental diagnostic conversational speech quality measures.


Introduction
Vo cal human-to-human communication is the main purpose for using speech telephonys ervices.Technological development within traditional and modern packed-based (Voice-over-IP)t elephonynetworks can affect -and possibly also impair -t he transmitted speech signal.The network and terminal device elements which are responsible for this (referred to as quality elements [1]), are codecs, bandwith limitation (narrowband (300-3400 Hz) and wideband (50-7000 Hz)),linear and non-linear filters, delay,packet loss, echo, and noise [2].
It is therefore of high priority for telecommunication providers to findo ut howe nd-users perceive and experience degradations.Forthis, assessing the quality of transmitted speech overtelecommunication systems allows the providers to improve their services and counter possible issues.In this context, the quality of transmitted speech is also referred to the so-called Quality of Experience (QoE) that "is the degree of delight or annoyance of the user of an application or service.It results from the fulfillment of his or her expectations with respect to the utility and/or enjoyment of the application or service in the light of the user'spersonality and current state" [3].
In telephonys ervices passive subjective experiments with human participants in al aboratory context are common means to study and understand QoE (so-called listening-only tests, LOTs).In these experiments, overall (ori ntegral)q uality ratings on five-point Absolute Category Rating (ACR)s cales are gathered [4].The experiments yield a Mean Opinion Score (MOS) [ 5], representing the average integral quality rating of an average person.
Since subjective experiments are time and money consuming, the demand of telecommunication service providers for instrumental models to predict the overall quality of transmitted speech, as gathered in LOTs, raised.Research led to the development of multiple types and approaches (parametric and signal-based)f or instrumental models [6].Nevertheless, as described in [7], the aforementioned LOTs and the instrumental models hold two main limitations: • Integral quality:Only the integral quality is taken into account, reasons for underlying sub-optimum quality are not uncovered.• Non-interactive settings:The methods refer to the passive listening situation, butc onversational and interactive aspects are not considered.The first limitation (integral quality)p oints out, that two dissimilar speech samples impaired by different degradations, for example one by ab andwidth limitation and one by background noise, can be rated with the same low MOS value.Having only the MOS value at hand, system providers cannot identify the reason for apossible quality loss, and therefore do not knowhow to improve their services.In LOTs, experimenters can of course directly ask for specificd egradations, buti nt hat case theyh avet ob e certain about the presence of these degradations beforehand.Thus, traditional methods do not provide diagnostic information.To counter this problem, news ubjective [8] as well as newi nstrumental [9] diagnostic methods have been developed.Theyidentify and assess quality-relevant perceptual dimensions to obtain diagnostic information.
The definition and the underlying idea of perceptual quality dimensions is the following: The output of atransmission system, aspeech signal possibly degraded by the aforementioned quality elements, is perceivedbythe system user as ac omposition of explicit features,t hat are orthogonal (and thus independent)a nd represent recognizable and nameable characteristics of the speech sound [1,2].These features are perceptual dimensions in a multidimensional perceptual space.When the user judges quality,s/he makes use of these perceptual dimensions to determine aperceptual difference to an optimum, degradationfree situation.Overall quality can thus be determined on the basis of perceptual features.In turn, the features allow identifying reasons for quality losses.Fore xample, two speech samples showing the same integral quality rating may exhibit different perceptual dimension judgments that are connected to specificquality elements.
In [10] Wältermann identified four perceptual quality dimensions for narrowband and wideband speech transmission in al istening-only situation.The benefito fq uality dimensions is also pointed out by recent development within the InternationalT elecommunication Union (ITU).The currently started work item Perceptual Approaches for Multi-Dimensional Analysis (P.AMD) [ 11] is aiming at developing as ignal-based quality predictor that provides diagnostic information on the basis of the perceptual quality dimensions identified by Wältermann.
The second limitation (non-interactive settings)reveals that the aforementioned traditional methods only consider the unrealistic passive listening-only situation.Quality elements that affect the interaction or the speaking (for example echo or delay)c annot be determined in LOTs.To fillthis gap, conversational tests [5,12] and speaking test [13] have been designed.
Feasible solutions to both limitations have only been developed separately.T his leads to the trade-off for an experimenter to either extract diagnostic information or to address different conversational phases in an experiment.This article presents one approach to address this trade-off by formulating and answering the following question: What arethe quality-relevant perceptualdimensions that an interactive conversational situation is composed of?To answer this question, we followt he approach of combining the advantages of both solutions.More specifically, we identify quality-relevant perceptual dimensions in each conversational phase, namely in the Listening,the Speaking,and the Interaction Phase.Thus, this article has four main contributions: • Multidimensional analysis of aconversation:The results of four experiments yielding the perceptual dimensions in the Speaking and the Interaction Phase.• Anew quality-profile forconversational speech quality:T ogether with the work conducted by Wältermann [10] the multidimensional analysis reveals seven perceptual dimensions underlying the conversational speech quality.• An ew conversational test-paradigm:F or the direct quantization of the perceptual dimensions and for the validation of the quality-profile an ew subjective conversational test-paradigm that separately addresses each phase of aconversation is established.• Validation of the proposed quality-profile:T ogether with the newt est-paradigm the quality-profile is validated in afinal conversational experiment.The newquality-profile and the work presented allows to assess and diagnose conversational speech quality in future work.In addition, it is the direct followupofthe studies conducted by Wältermann [10] and serves as afundamental framework for developing diagnostic instrumental models to predict the quality of transmitted speech in a conversational situation as demanded in the current ITU-T work item Objective ConversationalV oice Quality Assessment Model (P.CQO) [14].
The rest of the paper is organized as follows.In Section 2, ar eviewo fs peech quality in ac onversational situation is given.The perceptual dimensions are identifiedb ya uditory experiments with following multidimensional analyses.Section 3g ives an understanding of the paradigms used for the experiments.The actual experiments conducted to uncoverthe underlying perceptual dimensions in aconversation and their results are presented in Section 4. The identified dimensions are validated in a separate extensive conversation experiment using the new test-paradigm.The results and adiscussion are illustrated in Section 5. Conclusions are drawn and an outlook towards future work is giveninSection 6.

Speech Quality in aConversational Situation
To provide information about ac omplete conversational situation (see limitation two( non-interactive setting)i n section 1),at ypical conversation has to be investigated with respect to all possible occurring situations.In aconversation, the interlocutors alternately adopt the roles of listener and talker which introduces interaction between the participants.In [15] and [16] aconversational process is described as afour-state model: while having aconversation, the participants either listen to what is said (01) or speak (10) while exchanging information.Additionally the participants can also both speak (11) or remain silent (00) at the same time.
Table I.Overviewofthe so faridentified perceptual quality dimensions in aconversational situation (see [10]).According to [17], this leads to three phases of aconversation: the Speaking Phase (10),the Listening Phase (01), and the Interaction Phase describing the alternation of the states ( 10) and (01).T he frequencyo fc hanges describes the degree of interaction and as aside-effect the states (00) and ( 11) can occur.The three phases, as perceivedbyone participant, are illustrated in astate diagram in Figure 1.
Thus, from aspeech-quality point-of-view, aconversation is affected by the quality elements encountered in the Listening Phase (codecs or filters), in the Speaking Phase (echo or sidetone), and those affecting the interactivity of the conversation in the Interaction Phase (delay,d ouble talk and mutual silence impaired by signal processing in the devices or the network) [15,17].In the following, the three phases also describe the possible user'ss ituations during ac onversation.To obtain diagnostic information on conversational speech quality (addressing the described trade-off in Section 1),t he three phases will be analyzed in detail in the following subsections.

Listening Phase
Forthe Listening Phase,subjective and instrumental methods to assess the listening quality are standardized and recommended by the ITU [5,18].To obtain diagnostic information (see the first limitation (integral quality)i nS ection 1) in the Listening Phase,the approach of identifying perceptual dimensions related to impairments, such as degraded colouration or noisiness, is used.
As mentioned in Section 1, perceptual dimensions are features of the multidimensional space formed by ap erceptual event inside al istener.T ypically,t wo methodologies that are described in Section 3are used for the identi-fication of perceptual dimensions.Using both methodologies, Wältermann uncovered four perceptual dimensions for the Listening Phase [10]: colouration, noisiness, discontinuity,a nd sub-optimal loudness.E ach of these perceptual dimensions can directly be connected to the aforementioned quality elements (see Table I).A lso, Wältermann showed that it is possible to directly quantify the identified perceptual dimensions in asubjective test [8].
The proposed subjective method is similar to what is recommended for noisy speech signals in [19] and is the orgin of the current ITU-T work item P. AMD.
Other proposals, e.g by Sen, who used the Diagnostic Acceptability Measure (DAM [20]), identified four to sevendimensions, which are subdimensions (and thus not orthogonal)o ft he dimensions identified by Wältermann [21].
In sum, the multidimensional space inside al istener in the Listening Phase is composed of four perceptual dimensions (colouration, noisiness, discontinuity and suboptimal loudness)t hat lead to the development of diagnostic subjective and instrumental methods.

Speaking Phase
Technically,t he Speaking Phase is usually distorted by degradations due to talker-echoes and sidetones (see [22] and [2]).Thus, separate Talking and Listening Tests [13] are conducted to assess the quality in the presence of those distortions.In these subjective tests, participants are asked to speak into at ransmission system and rate afterwards, howt he system affects one'so wn speaking [12].But, simultaneously speaking and listening can cause considerable fatigue to test participants.Therefore, so called 3rd-Party-Listening-Tests have been developed.In these tests the spoken and the back coupled of the ownv oice of a participant is recorded and afterwards both are rated by a third person [13].However, these methods only determine an integral quality value, without diagnostic information.

Interaction Phase
The Interaction Phase covers not only the change from state (01) to (10) and the change from state (10) to (01), butalso the states (00) and (11) (see Figure 1).Increased amounts of the states (00) and (11) technically occur especially due to transmission delay [23], which is particularly noticeable by ashift of the usual rhythm of conversation, leading to passive interruptions (occurs when as peaker becomes interrupted by the delayed arrivalo fac ounterpart'sutterance)a nd active interruptions (occurs when an speaker starts to speak, while he still hears his counterpart talking) [24].
Forthe subjective assessment of the interaction quality, conversational tests have to be conducted that, unlikef or the Listening or Speaking Phase,require twoparticipants.To simulate an atural conversation, the participants usually followp articular scenarios.Fore xample, the ITU-T recommends so called short-conversations tests(SCTs)in which the participants are asked to solvetasks in role-plays (ordering aplane-ticket or pizza), or so-called interactive tests in which the participants are asked to align numbers or addresses as fast as possible [25].One of these interactive tests is the so-called Random Number Ve rification Test (RNVT)inwhich the participants are asked to alternately compare apredefined series of numbers (each participant has one series at hand while one number is different) [2].
Following these guidelines, the participants are generally asked to rate the overall quality,o rt he interruption effort.To deeper analyze the interactivity,v alues liket he Speaker Alternation Rate or the ConversationalT emperature have been introduced [26].However, these values focus more on the alternation and turn-taking of the speakers, and less on the diagnoses or the perceptual space of the interlocutors.

Summary
The approach of the presented work is to combine the advantages of considering all possible user situations in a conversation, and of diagnosing the quality of transmitted speech on the basis of perceptual dimensions.Table I givesa no verviewo ft he currently known perceptual dimensions in aconversational situation.As it can be seen, except for the Listening Phase no perceptual dimensions have so farb een identified.This leads to the formulation of the already stated research question in Section 1, what perceptual dimensions an interactive conversational situation is composed of.
To answer this question, perceptual dimensions in the Speaking and the Interaction Phase have to be uncovered.The identification of the perceptual dimensions and the underlying experiments are presented in the Section 3and 4.

Experimental paradigms to uncoverp erceptual dimensions in at elephone conversation
Fore ach of the twor emaining phases of ac onversation (Speaking and Interaction Phase)t wo experiments with twod i ff erent experimental paradigms were conducted.Both paradigms followd i ff erent approaches to transform data into al ow-dimensional space with particular advantages and drawbacks.In the field of audio research, the methods described have been part in numerous studies, see for example [27], [28], or [29], to name just af ew.B ecause of that, we decided to use the same approaches for our studies.
Section 3.1 describes the method of Multidimensional Scaling (MDS)ofdissimilarity or preference ratings gathered in apairwise comparison experiment.The method of analyzing attribute ratings of a Semantic Differential (SD) experiment with a Principal Component Analysis (PCA) is introduced in section 3.2.
Using and comparing both methods leads a) to am ore distinct interpretation of the resulting dimensions and b) helps to verify the validity of the results.Thus, the two paradigms in combination provide asolid statement about the actual nature of the underlying dimensions for the phase under investigation.

Multidimensional Scaling
In general, MDS is used as amultivariate technique and is mainly applied to findthe number of dimensions required to represent perceptual attributes of stimulus objects in a low-dimensional multidimensional space [30].
The approach is to gather the dissimilarity between two pairwise presented stimuli.This results in ad issimilarity matrix for each participant.The MDS maps the (average) dissimilarities into distances.It is assumed, and it has been verified, that the psychological dissimilarities correspond to Euclidean distances (higher dissimilarities, higher distances) [8,30,31].
In the context of the presented work, we are interested in the quality of perceptual events, happening either during speaking or during interaction.Thus, our stimuli are obtained in an active or interactive instead of apassive situation, and instead of asking the participants for adissimilarity rating, we gathered preferences values.The twodifferent approaches of gathering dissimilarities and preferences have been analyzed and compared in different studies and experiments and revealed ahigh degree of correlation [32,27].
Since we are not interested in individual preferences but in group tendencies, we are looking for amultidimensional solution for an average person, and the preference ratings are averaged overt he individuals resulting into as ingle preference matrix.
However, the gathered preference data cannot be used in ac lassic MDS that uses dissimilarity data.Therefore, as oc alled non-metric MDS, also called ordinal MDS, is applied [33].While ac lassic MDS is metric, that is, the model represent various properties of the data related to algebraic operations, non-metric MDS represent only the ordinal properties of the data [30].
The preference matrix serves as input for the non-metric MDS where the mapping is restricted to be am onotone function.ALSCAL is employed as amethod for computing the non-metric MDS [34].
Following [30], to determine the resulting dimensionality,b oth, statistical fit parameters and the ability to interpret the resulting dimensions are considered.One important statistical fit parameter is the so-called Stress.I t is actually ab adness-of-fitp arameter specifying howb ad the resulting distances match with the givend ata.Ar easonable dimensionality is found if the Stress value does not decrease significantly with increasing the number of dimensions.Looking at aScree plot (see for example Figure 6),ideally asharp "elbow" marks the adequate dimensionality [30].
Using the MDS paradigm provides the advantage that the task for participants is practicable.No complexi nstructions are required and comparing twop airwise presented stimuli is straight forward.But, the interpretation of the resulting dimensionality and of the resulting dimensions is only possible on the basis of the known difference between the stimuli used.This may lead to intuitive and speculative interpretations.To express av alid interpretation it should be considered to compare the results of a MDS with other methods for minimizing dimensionality.

Semantic Differential
In aS De xperiment, ap reviously determined set of attributes is givent ot he participants in terms of bipolar scales.The extremities of each scale are labeled with a pair of opposite attributes, so called antonym-pairs (APs) (for example loud vs. quiet), each describing ao nedimensional feature.The intensity of each feature within agiven condition has to be judged by the test participants.
Using the Principal Component Analysis (PCA)onthe average ratings of the participants, only the components with eigenvalues above one are kept.The columns of the resulting matrix are the principal components (PCs)a nd correspond to the coordinates of the points representing the APs in the dimension-reduced space.Finally,the result is transformed into arotation matrix satisfying the VA RI-MAX criterion [35].The rotation causes that correlating scales are summarized by one axis, which leads to amore simple structure.Detailed information about the SD and the PCA can be found in [8] or [36], for example.
Compared to the MDS paradigm, the interpretation of the resulting dimensions is supposed to be easier because it is assumed that each dimension is represented by acluster of APs giving the experimenter direct hints on which aspects are covered.Nevertheless, to get avalid interpretation of the dimensions it is recommended to conduct both, aMDS and aSDexperiment.The disadvantage of the SD paradigm is that significant effort has to be conducted to determine the APs beforehand (see Section 4.1.3and Section 4.2.3).

Uncovering perceptual dimensions in the Speaking and Interaction Phase
As described in Section 2, the Listening Phase has been the subject of research towards understanding and uncovering the perceptual space of alistener.Inthis section we present studies that have been conducted to investigate the perceptual space of participants in ac onversational task.
To do this, aconversation is split into three phases according to Section 2, and experiments analogue to the Listening Phase (following the paradigms presented in Section 3) are conducted for the Speaking and the Interaction Phase.Part of the work illustrated in this section is based on the data presented in aformer publication [37].

Speaking Phase
To uncovert he perceptual dimensions of the Speaking Phase both methodologies (MDS and SD)a re applied.Since the speaking is usually impaired by sidetone and talker-echo (see Section 2.2), for both experiments apassive speaking-only test with these twod egradations was carried out with the goal to investigate howhearing one's ownv oice while speaking influences the speaking, and howthe participant perceivestheir ownvoice.

Technical setup
The test system for the twotests conducted for the Speaking Phase is implemented with the help of the graphical programming language tool for modeling and simulating dynamic systems [38].The system wasdeveloped to simulate sidetone and talker-echo.Forthe sidetone distortion the direct back coupling of the spoken voice with different levels of attenuation is used.Fort he talker-echo the delayed back-coupled and attenuated spoken voice with varying delay values is used.The conditions used can be seen in Tables II and IV.The direct back coupling had ad elay of < 10 ms and is recorded as 0msd elay.The attenuation leveli ss imulated in association to the input speech level.Note that some conditions simulate degradations with strong characteristics to guarantee that all naïve participants perceive the effects of sidetone and echo.The conditions were presented in randomized order.A nE DIROL USB AudioCapture UA-25EX soundcard wasu sed, together with aS ennheiser HMD 46 AT C 300 Headset.The back coupling waspresented diotic.The participants were set in at est room which meets the requirements according to [5].

Test design
Forboth test-paradigms (SDand MDS)basically the same task had to be conducted by the participants.Fore ach presented condition or comparison the participants were asked to read out aloud atextthat appeared to them on the test screen.Each piece of text consisted of twot ot hree sentences, and all together 27 randomly presented textpieces were used.One text-piece could for example look likethis (translated from German): "Can you please give me the best connection between Munich and Duisburg.Ihavetoarrive on Saturday at 12.30pm latest." To avoid the participants pay too much attention on reading the text, theywere asked to learn the text by rereading it at minimum three times.Thus, it wase nsured that the participants could speak the text as freely as possible, simulating areal Speaking Phase.

SD Experiment
As mentioned in Section 3.2, in an SD experiment apredefined set of attributes (APs)isgiven to the test participants in terms of bipolar scales.In order to findproper attributes, twopre-tests were conducted.
In afi rst test, participants were asked to freely use the degraded test setup.Their task wastogather as manydescriptions of the degraded test setup as possible.In sum, a list of 25 APs were collected by 3e xperts.Experts were chosen because it wasassumed that theycan describe the system adequately.H owever,s ince theya re very experienced with telecommunication degradations, theym ight also be biased.Thus, in the second test, 10 naïvep articipants, according to the definition in [39], were asked to use the degraded test setup and select 5ofthe 25 APs they think describe the system best.
The actual test wasc arried out by 16 naïvep articipants (4 female, 12 male)a ged between 21 and 36 years (mean age 26.4).Foreach condition (see Table II)the participants were asked to fulfill the task described in Section 4.1.2.After each task for each condition, the participants are asked for their subjective rating of the overall quality (MOS)f or av alidity check, and of the APs introduced before.
The scale shown in Figure 2w as used for the overall quality ratings (taken from [12]).This scale is, in particular,useful because it avoids scale-end effects and is more sensitive in comparison to the classical ACRscale [40].A similar scale (see Figure 3) with only twolabels wasused for the AP ratings.We used awithin-subject-test-design.

Results of the SD Experiment
The results of the conducted SD experiment for the Speaking Phase are structured in twog roups: first, we analyze the results of the overall quality,second, the results of the PCA on the AP ratings stemming from SD experiment are presented.
The results of the overall quality ratings are presented in Table II.The ratings are similar to the studies made in [41].The standard deviations lie within the range of standard deviations as typically also obtained in standard ACR experiments [12].Additionally,arepeated measure ANalysis Of VAriance (ANOVA ) [ 42] between the conditions and the overall quality ratings as depended variables was carried out.The results showthat the conditions have asignificant impact on the overall quality judgments of the test subjects (F (4.04, 60.64) = 33.86,p<.01).With this data it is proofed that the different degradation levels worked as intended (falling quality -lower rating /rising qualityhigher ratings).
To analyze the results of the PCA, first, the number of resulting perceptual dimensions has to be identified.As described in Section 3.2, the number of the dimensions is found by keeping only components with Eigenvalues above one.To visualize the results a Scree Plot can be seen in Figure 4.
The figure shows that twocomponents have eigenvalues above one, resulting in twod imensions.The determined twodimensions cover95.3%ofthe variance of the eleven APs.Table III shows the factor loadings for each of the elevenfeatures to the determined twodimensions.
It can be seen, that the first dimension (Dim 1) covers sevenofthe elevenfeatures with loadings above 0.8.These sevenfeatures (concentration, loud, fluent, distracting, exhausting, irritating, helpful)d escribe howt he hearing of the ownvoice is perceivedbythe speaker and what impact or effect hearing the ownvoice could trigger inside the listener.M ore precisely,t he results for the first dimension showt hat hearing one'so wn voice can, for example, be very irritating and can handicap the fluencyofthe speaking.
The second dimension (Dim 2) covers three features (distorted, clear,reverberant)with loadings above 0.7and twof eatures (helpful, thick)w ith loadings slightly below 0.4.The dimension seems to be descriptive in terms of representing the degree of degradation and impairment of the ownvoice the speaker perceiveshearing one'sown voice.
In other words, the resulting dimensions describes possible frequencydistortions of the sidetone and the echo path.This is mostly determined by results of the loadings for the features "distorted", "clear", "reverberant", and "thick".The lownumber of features and the inconsistent loading values showthat the second dimension seems to be weak.However, afi nal reflection (see Section 4.1.7)oft he resulting dimensions is only possible when having also the results from the MDS experiment at hand.

MDS Experiment
As mentioned in Section 3.1, in aM DS experiment the preferences of twop airwise presented stimuli is judged by the participants.Having N conditions this leads to N (N − 1) comparisons.Assuming that the preference between stimulus A and stimulus B is the same as the preference between stimulus B and stimulus A,t his leads to (N (N − 1))/2comparisons [43].Using the 16 conditions of the SD experiment this would lead to 120 comparisons.
Foraf easible experiment conducted in approximately one hour this would taketoo long.Therefore, only 9randomized conditions (see Table IV)w ere used for the test leading to 36 comparisons.Condition eight and one are alikeand serveasanchor-conditions for avalidity check.
To create the complete distance matrix for the ordinal MDS, one half of the participants judged the preference between stimulus A and stimulus B and the other half the preference between stimulus B and stimulus A.F or each comparison, the participants were asked to speak the textpiece (see Section 4.1.2)once for condition A and once for condition B.Theycould redo the comparison as often as desired.
Afterwards, the participants had to judge whether they prefer stimulus A overstimulus B (and vice-versa)onthe scale presented in Figure 5.The conditions were presented in randomized order and the MDS experiment wascarried out by 22 naïveparticipants (14female, 8male)a ged between 18 and 36 years (mean age 25.9)(different from the SD experiment).

Results of the MDS Experiment
The adequate dimensionality is found if the badness-offit parameter Stress does not decrease significantly with a further increase of the number of dimensions.To visualize the results aScree Plot can be seen in Figure 6.
The figure shows that the sharp "elbow" is located at the second dimension, thus, twodimensions are extracted for the MDS experiment.This result is similar to the result of the SD experiment.
To analyze and compare the dimensions the resulting space of the MDS (see Figure 7) has to be inspected.Looking at the twoa nchor-conditions (S0a nd S02)t he resulting space of the MDS shows that these twoc onditions are positioned with ashort distance, indicating, that the different quality levels worked as intended.
Dimension one shows that from left to right the conditions start with strong characteristics (strong echo or loud sidetone -S10E150, E250, S20)a nd end with rather weaker characteristics (quiet sidetone, e.g., Sminus10, Sminus25).The anchor-conditions are located in the middle of the scale.Astrong echo or aloud sidetone results in ah igh impact on the speaking abilities of the speaker.I n turn, aquiet sidetone does not have aimpact on the speaking.Apopular example for this effect is when aspeaker is confronted with aloud background noise.In this case, the speaker automatically raises the voice to mask the noise.This effect is called the LombardEffect [44,45].The same effect, butinthe opposite direction, can be observed when aspeaker is confronted with aloud copyofhis or her own voice overah eadset, al oud sidetone.In this case, the speaker automatically lowers the voice [46].In [41], the term self-listening comfort is introduced to describe this influence.These introduced effects of the used conditions are reflected in the results for the first identified dimension.Looking again at this result, the scale of dimension one (from right-low, strong echo or loud sidetone, to lefthigh, weak echo or quiet sidetone)describes the impact on the speaker of hearing one'sown voice while speaking.
Ford imension twot he scale starts with the anchorcondition S0 and then covers stepwise the conditions with stronger degradations (the higher,t he stronger the degradation).In general, if the sidetone is delayed, the speaker starts to feel uncomfortable.Fordelays below30ms(considered as sidetone)a nd high levels, the direct signal and the delayed version will be interfered at the speakers ears which leads to a comb-filtered version of the signal [47].The user will perceive this as acolouration in the sound of his or her ownvoice [41].If the delay exceeds 30 ms (considered as echo)a nd the sound leveli sh igh, the speaker will experiences difficulties in talking.This is expressed in as lower speaking in terms of the speaking rate and pauses between words [48].On the other hand, if the level is low, even high delayed echo hardly givesa ny degradation.Thus, aback coupled and delayed version of the own voice is perceiveda sac oloured and thus degraded version of the ownv oice by the speaker.T ransferring this to the results of the MDS experiment, the identified dimension shows that stronger degradations lead to am ore degraded perception of the ownv oice than weaker degradations.Hence, the scale of dimension two(from bottom-low to top-high)thus seems to describe the degree of degradation of the ownv oice the speaker perceiveshearing one's ownvoice.

Conclusion
The results of the SD (see Section 4.1.4)andthe MDS (see Section 4.1.6)experiment reveal ahigh degree of similarity.
In the SD experiment, the first resulting dimension covers APs that describe the impact of the ownheard voice on the speaker while speaking.The same properties can be seen in the results of the MDS experiment where the first dimension describes from lowt oh igh the characteristics (weak to strong echo/sidetone)o ft he conditions.In both cases the resulting dimensions seem to represent the impact of the degraded transmission system on the speaker while speaking.
The second resulting dimension in the SD experiment covers attributes that describe the amount of degradation of the conditions ("distorted -u ndistorted", "unclearclear", "reverberant -anechoic").In the MDS experiment the second identified dimension is also describing the same effects starting with the reference conditions ending with highly degraded conditions (strong echo/sidetone).Following from this, in both experiments the twoi dentified dimensions seem to portray the degradation of one'so wn voice perceivedbythe speaker.
In sum, the result of the multidimensional analysis in terms of twosubjective tests identified twoperceptual dimensions.It wasm entioned that al oud sidetone might decrease the voice of as peaker and that ab ack coupled and delayed version of one'so wn voice is perceiveda s ac olouration in the sound of the ownv oice by the user.These twoe ff ects match the twod imensions identified in the multidimensional analysis.One dimensions describes the impact on the speaker aback coupling might have (for example decreasing the voice)a nd the other dimensions describes the degraded perception of the ownv oice (for example acoloured sound).
However, it has to be mentioned again that the twoidentified dimensions might be dependent from each other in terms of their presence.While adegradation of one'sown voice is only perceivedw hen the ownv oice has also an impact on the speaking, aback coupling of the ownvoice might only have an impact on the speaking without perceiving adegradation of the ownv oice.Until now, this is just an assumption and has to be verified in an additional experiment.
Following from these results we liketopropose to call the twop erceptual dimensions of the Speaking Phase a) the impact of one'so wn voice on speaking (scaled from "no impact on speaking" (−1) to "high impact on speaking" (1))a nd b) the degradation of one'so wn voice (scaled from "own voice not degraded" (−1) to "own voice degraded" (1)).

Interaction Phase
To uncovert he perceptual dimensions of the Interaction Phase,a gain both methodologies (MDS and SD)a re applied.Especially interactive experiments are sensitive for the quality element delay (see Section 2.3)that impairs the interaction of twoi nterlocutors.So, for both experiments ac onversation test wasc arried out to investigate howt he user'sinteraction in acall is affected by varying amounts of transmission delay.

Technical setup
Forthe experiments atest system based on PureData (PD [49]), ag raphical programming language for signal processing, wasused.It allows manipulating audio effects in real-time and thus enables to simulate acoustical degradations likeecho, as well as non-stationary degradations.Additionally,t he system wase xtended with multiple speech codecs including G.711 or LPC-10, using open-source implementations.The codec components also introduce effects likepacket-loss on request, and were used in the validation experiment of Section 5.
The sound signal wasp resented via a Beyer Dynamic DT770 stereo headset.In both setups the participants were located in twos ound-insulated test rooms which met the requirements according to [5].

Test design
Forthe conversational tasks, SCTs (see Section 2.3)were used and modified by updating dates and currencies.The SCTs were selected because their tasks represent everyday-life situations and provide ar easonable degree of interaction while being limited to acceptable test duration.
In both experiments, each pair of participants first conducted one introduction SCT scenario to get familiar with the test design.In the SD experiment the participants were asked to give their rating on the APs for each condition and each SCT.
In the MDS experiment only one of the twoparticipants wasable to switch between twoconditions.The one participant wasa sked to rate the comparison of twoc onditions with regard to the interaction between both interlocutors.

SD Experiments
Again, to conduct the SD experiment ap redefined set of APs has to be found.To finds uitable attributes, twop retests were conducted (similar to the SD experiment of the Speaking Phase).
In the first test, as manyd escriptions as possible were collected by 6e xperts, resulting in al ist of 42 different antonyms.In the second test, 15 naïvep articipants were asked to select 5ofthe 42 attributes theythink describe the system best.Based on the overall frequencyofselection, a set of 10 antonym-pairs were finally selected: not exhausting -e xhausting; easy -h ard; unpleasant -p leasant; not frustrating -frustrating; effective -ineffective; does not requireconcentration -requires concentration; lazy -agile; clear-c onfusing; relaxing -a nnoying; distracting -n ot distracting.
Fore ach condition the participants were asked to play through one SCT scenario, rate the overall quality for a validity check, and then score on the APs introduced before.
Again, the same scales as in the SD experiment for the Speaking Phase were used (compare Figure 2a nd Figure 3).W eused awithin-subject-test-design.

Results of the SD Experiments
The results of the conducted SD experiment are structured in twog roups: first we analyze the results of the overall quality as av alidity check, then the results of the SD experiment.
After averaging the ratings of the overall interaction quality overt he conditions, ar epeated measure ANOVA between the conditions and the overall quality ratings as depended variables wascarried out.The result shows that the amount of delay has asignificant impact on the judgment of the test subjects (F (4.93, 152.75) = 17.19,p < .01).This data indicates that the different degradation levels worked as intended (low delay -high overall quality / high delay -low overall quality).
The judgments showt hat the addressed 10 attributes highly correlate with each other (average r≈0.9).The results of the following PCA indicate, that the 10 features can be described by one dimension, covering 96.12 %o f the variances of the 10 one-dimensional features.The resulting factor loadings for each of the 10 features can be seen in Table V.
The outcome shows that all features are covered by one dimension with high loadings above 0.9.Regarding the ten features the resulting dimension seems to describe the convenience or the challenge of interacting.But, afinal interpretation (see Section 4.2.7)ofthedimension is again only possible after analyzing the MDS experiment.

MDS Experiments
In the case of the Interaction Phase the task in the MDS experiment is to judge the preference of twopairwise presented amounts of transmission delay.T he eight conditions used in the SD experiment would lead to 28 comparisons and thus SCTs.Again, this would be too much for afeasible experiment.Therefore, only fiverandomized conditions (0,5 00, 1000, 1500, and 2000 ms)w ere used leading to 10 comparisons.
As done for the MDS experiment in the Speaking Phase, one half of the participants judged the preference between condition A and condition B and the other half the preference between condition B and condition A to create the complete distance matrix for the ordinal MDS.As an exception for this experiment, only one of the twop articipants wasasked to judge whether theyprefer condition A over B,t he other participant acted as dummy.T his procedure wasfollowed because only one of the participants wasa ble to change the condition and thus wasa ble to judge their preference.The rating wasa gain done on the scale shown in Figure 5.
The conditions were presented in randomized order and the MDS experiment wascarried out by 52 naïveparticipants grouped in 26 pairs.Thus, the results are based on the ratings of 26 participants (10f emale, 16 male)a ged between 20 and 32 years (mean age 24.6)( different from the SD experiment).

Results of the MDS Experiment
The MDS reveals aStress below0.5showingthat the resulting space is one-dimensional.The space can be seen in Figure 8.
Looking at the figure, it can be seen that the resulting dimension starts with the highest delay (2000 ms)and then covers stepwise the conditions with lower delay until reaching the lowest value (0 ms).
In literature, it is described that at ransmission delay may lead to three effects [50].First, the delay leads to an interruption.I nterruptions are distinguished between active and passive interruptions.Active interruptions occur when one interlocutor starts to speak, while he or she still hears the other interlocutor speaking.Passive interruptions occur when one interlocutor gets interrupted by the delayed arrivalofastatement of the other interlocutor.Second, due to the transmission delay,the perception of aconversation, in terms of structure and pattern, may considerably be different from one interlocutor to the other,while both are participating in the same conversation.Third, if  the test subjects perceive an unnatural rhythm of the conversational flow, theyadapt their behavior.The result of the MDS experiment thus seem to merge these three effects into one dimension.The resulting scale of the dimension (from bottom-high to top-low) seems to describe the effort or difficulty to interact with the interlocutor as described in [50].

Conclusion
Again, the results of the SD (see Section 4.2.4)and the MDS (see Section 4.2.6)experiment reveal ahigh degree of similarity.
In the SD experiment, the resulting dimension covers APs that describe the convenience or the difficulty of interacting.The same characteristics can be seen in the results of the MDS experiment where the resulting dimension describes from lowtohigh the effort or difficulty to interact (high to no delay).Thus, in both cases the resulting dimension seems to represent the degree of facility/difficulty to interact.
It wasmentioned earlier that atransmission delay may lead to passive and active interruptions that shift the natural interactive rhythm in ac onversation.These interrup- tions also lead to ad i ff erent perception (int erms of the twoi nterlocutors)o ft he conversational structure.In addition, too high amounts of delay are related to ar ising user dissatisfaction [51].The results of the twoconducted multidimensional analysis combine these findings of the user perception as the identified dimension seems to cover the effects of adelayed speech transmission.The resulting dimension can be described with used APs (see Table V) and the characteristics of the dimension is depended on the amount of transmission delay.
Following from these results we would liket op ropose to call the resulting perceptual dimensions of the Interaction Phase the interactivity (scaled from "easy to interact" (−1) to "hard to interact" (1)).

Resulting Quality Dimensions in aC onversational Situation
In memory of the aforementioned research question (see Section 2.4)a nd the twol imitations (see Section 1),w e nowhaveaset of sevenproposed dimensions for aentire conversation.
While the Listening Phase wasalready part of different studies and revealed four perceptual dimensions, twoa dditional perceptual dimensions for the Speaking Phase and one perceptual dimension for the Interaction Phase were identified.An overviewofthe perceptual quality spaces resulting from the multidimensional analysis can be seen in Table VI.The sevenp erceptual dimensions are proposed to be called: • Impact of one'sown voice on speaking • Degradation of one'sown voice, • Interactivity.The twoidentified dimensions for the Speaking Phase,the impact of one'so wn voice on speaking and degradation of one'so wn voice seem to covert he space spanned by the degradations sidetone and echo.However, also other degradations (e.g.loud background noise)might not only affect the Listening Phase,but also the Speaking phase.
Forthe Interaction Phase,the perceptual dimension interactivity wasidentified.We see mainly twoexplanations for this result: firstly,weidentified the perceptual dimension with the help of an SD experiment that is based on prior determination of antonyms.In our case we conducted twopre-tests with naïveparticipants and with experts, separately.However,the high correlation of the attributes suggests that the attributes only coveracertain limited space.This is due to the fact that the stimuli that we presented varied only with respect to delay.T his brings us to our second explanation: the only quality element we varied wasthe delay.W edid not consider quality elements of the Listening Phase or the Speaking Phase,which might have provokedother dimensions.So far, the three phases were treated mostly independent.It is not known and has to be analyzed if the results of the multidimensional analysis for the Speaking Phase and Interaction Phase would be different when quality elements of all phases are considered in one single tests.In particular,i th as to be verified if the separately identified dimensions can still be uncovered in ar eal conversational situation.Also, it is not known yet howthe presences of multiple degradations affect the characteristics of the sevenp erceptual dimensions.Fore xample, in [17] or [52] it wasi nvestigated that the conversational quality is rated more critically for echo than for transmission delay.If this could be adapted for the identified dimensions is not known yet.Fort his, additional studies to investigate and identify the conversational quality profile are necessary.
The multidimensional analysis revealed the perceptual quality spaces for each phase of ac onversation that in sum is composed of sevenperceptual dimensions.This set of perpetual dimensions allows diagnosing conversational speech quality in future work.However, this set of perceptual dimensions still has to be validated and their characteristics in aconversational test (and not in separate SOTs or LOTs)h avet ob ei nvestigated.Fort his, at first an ew subjective test-paradigm that allows considering all three conversational phases and their perceptual dimensions has to be developed.Using the developed test-paradigm then enables to verify the proposed perceptual spaces.The pro-posed paradigm and the subsequent study is presented in the next section.

Validation Experiment
To verify the news et of quality dimensions, we created an ew test-paradigm that separately addresses each phase of ac onversation as well as as hort structured conversation scenario.The proposed test-paradigm is presented in Section 5.1.The approach of the validation experiment is based on the hypothesis that the resulting dimensions of the separate conducted listening, speaking, and interaction experiments can also be identified using the new paradigm.We decided to conduct an additional SD experiment (see Section 3.2)toanalyze the identification of the dimensions, however, in future the paradigm will be used to directly quantify the sevend imensions.In the following the paradigm and the results of the experiment are explained in detail.Parts of the work presented in this section is based on the data presented in the publication [53].

Test design and new test-paradigm
Since all of the possible phases of ac onversation should be addressed the newtest-paradigm consists of 3sections: (I) In the first section, the task of the twoparticipants is to conduct aSCT.This section represents aregular everydaylife conversational scenario of about twot of our minutes length.After each SCT,the participants first have to judge the overall quality and second the 28 APs representing (and used in)all phases of aconversation.(II) The second section addresses the Listening and Speaking Phases.One of the participants is asked to read out a text while the other participant listens to what is read out.The sentences and procedures of the speaking part are similar to the previous conducted studies in Section 4.1.The listening part is analog to [10].After the first sequence, the participants change roles, so that each participant has to speak and listen.Foreach sequence, the participants are asked to rate the 11 APs for the Speaking Phase and 14 APs for the Listening Phase [10].(III)T he third section addresses the Interaction Phase.This task is supposed to be sensitive for possible delay in the transmission system.Therefore, RNVTs are used.Accordingly,the participants are asked to alternately verify a set of numbers.The participants are asked to rate the 10 APs representing the Interaction Phase.
The experiment wascarried out by 40 participants naïve (23female, 17 male)grouped into 20 pairs, aged between 18 and 53 years (mean age 28.7).Forall three sections, the participants were asked to communicate using atransmission system (see Section 4.2.1)that wasd istorted by 11 randomized different degradations (see Table VII)w hich were analogue to the previously conducted tests.
Each pair of participants first conducted one introduction session to get familiar with the test, and afterwards 11 sessions for each degradation consisting of all 3sections.The order of degradations wasr andomized between participants.and structure.Keep in mind that the rating of all APstake up to 10 minutes per condition.Therefore, the experiment wass plit into twos essions á 60 minutes to avoid participants fatigue.In addition, the participants were allowed to have extra pauses when required.Again, we used awithinsubject-test-design.

Attributes fort he SD
In the test the same APs as in the previous separate listening, speaking and, interaction tests were used.Fort he section Iall APs were used.Forsection II and III the corresponding APs for each phase of aconversation have been rated.

Results
The results of the conducted experiment are structured in fiveg roups: first we analyze the results of the overall quality,second the results of the third section (Interaction Phase), third and fourth the results of the second section (Listening Phase as well as Speaking Phase), finally the results of the first section (Conversation Test)o ft he SD experiment.

Overall quality
After averaging the ratings of the overall conversational quality overt he conditions, ar epeated measure ANOVA between the conditions as independent and the overall conversational quality ratings as depended variable wasc arried out, showing that the conditions have asignificant impact on the judgment of the test subjects (F (7.01, 224.14) = 45.88,p<.01).With this it is proofed that the different degradation levels worked as intended (falling quality -lower rating /rising quality -higher ratings).

Section III -Interaction Phase
The results of the following PCA indicate, that the 10 attributes can be described by one dimension, covering 85.4 %ofthe variance of the 10 one-dimensional features.
The resulting factor loadings can be seen in Table IX.This result is similar to the one of the previously conducted separate interaction experiment, and shows that the proposed dimension works as intended.

Section II -Listening Phase
The Scree Plot (see Figure 9a)ofthe PCA shows that only three potential dimensions result for the Listening Phase in Section II.The three dimensions are determined, covering 96.9 %o ft he variance of the 14 APs.In separate LOTs, however, four dimensions were proposed.An explanation for this can be found by analyzing the factor loadings for each feature to the determined three dimensions in Table IX.Dim 3d escribes the dimension Loudness ('loud -quiet' (0.972)) and Dim 2describes the dimension Noisiness (hissing (0.831), noisy (0.862), and Additionally,f or each dimensions (Discontinuity and Coloration)o ne of the twoc onditions is combined with ad i fferent degradation that might mask the Discontinuity and Coloration degradation.Also, in [54] it wasobserved that in adiagnostic listening experiment subjects reflect in the colouration scale distortions that are not clearly to classify to anyofthe other three dimensions.These facts could be the reason of the result, that one dimension covers the APs for Discontinuity and Coloration.
Thus, we think that the reduction of the dimensionality of the Listening Phase space from 4( found in the identification experiments)t o3(found in the validation experiment)i sd ue to the limited number of conditions which could trigger these perceptual dimensions.

Section II -Speaking Phase
The Scree Plot (see Figure 9b)ofthe PCA shows that two potential dimensions result for the Speaking Phase in Section II.These twod imensions are determined, covering 96.5 %ofthe variance of the 11 one-dimensional features.
Twod imensions have also been discovered in the separate speaking test, termed Impact of one'so wn voice on speaking (covering features likehelpful, irritating, exhausting, distracting or fluent)a nd Degradation of one's own voice (covering features liker everberant, clear,t hin and distorted).Looking at the factor loadings for the Speaking Phase (see Table IX), it can be seen, that Dim 1covers the same features as in the previous tests.Dim 2 explicitly only covers the features "thin", and with lower values "clear" (0,401)and "distorted" (0,280).These two features are also covered by Dim1.
Additionally,t he feature "reverberant", intended for Dim 2, is only respected by Dim 1.We explain this result with condition 9, where the echo is mixed with noise.In the perception of the participants, the noise seems to mask the echo degradation.Thus, only condition 4c overs pure reverberation, which potentially led to the presented outcome.We think that the limited coverage of the 2dimensions (this experiment)incomparison to the interpretation of the twop roposed dimensions (previous experiment)i s again due to the number of conditions triggering the dimensions.

Section I-C onversation Test
The Scree Plot (see Figure 9c)ofthe PCA shows that three potential dimensions result for the Conversation Test in Section I.These three dimensions are determined, covering 96.6 %ofthe variance of the 28 one-dimensional AP space.It wasi ntended that the results of the PCA show that all sevend imensions are perceivedi nt he Conversation Test.
However, it seems that only alimited number of dimensions can be perceivedi nat est-paradigm liket he SCTs that require the full attention of the test participants on the flowo ft he conversation, and not on the rating task.
As mentioned before, we assume that this result is due to the limited cognitive resources test participants could dedicate to the rating task, as these resources were bound by the conversation task of the STC.However, we argue that the results of the sections II and III of the experiment showthat the sevenproposed dimensions are still valid for aproper diagnosis of the quality of transmitted speech in aconversational situation.

Discussion
The results of the validation experiment showthat the proposed dimensions are difficult to identify in arealistic conversation situation, where the attention of the test participants is rather on the content of the conversation, and on the dialogue flow.It seems that too manyc ognitive resources are bound by this task, reducing the number of separately perceivable dimensions in this phase.Thus, in subsequent experiments the presented test-paradigm (see Section 5) that specifically allows the participants to perceive each phase separately,i na ddition to an atural conversation paradigm, should be used.
Additionally,t he results of Section II Listening Phase and Section Ishowthat the twodimensions Coloration and Discontinuity seem to merge.We explain this finding with the peculiarities of the conducted experiment.In twocondition the degradations triggering both dimensions might be masked, and the size of the experiment did not allow for more than one additional condition for each dimension.This finding has to be investigated in follow-up studies.More precisely,w hen designing test conditions care should be taken that each expected perceptual dimension is separately covered by as u ffi cient number of technical conditions.

Conclusions and Outlook
The work presented in this contribution analyzed ac onversational situation for the purpose of diagnostic quality assessment.The target of the work wastoidentify underlying quality-relevant perceptual dimensions of ac onversational situation.
Fort his, we analyzed ac onversation based on as eparation into three phases: the Listening Phase, Speaking Phase and Interaction Phase.W hile the Listening Phase had been already object of multidimensional analyzes in related research work, perceptual dimensions characterizing the Speaking Phase and the Interaction Phase are still not well explored.Thus, we presented four initial experiments that enabled us to identify three newperceptual dimensions; interactivity (Interaction Phase), the impact of one'sown voice on speaking,and the degradation of one's own voice (Speaking Phase).To analyze ac onversation with respect to both, the user'ssituation and the diagnostic information, we nowhaveaset of sevenperceptual dimensions: • Coloration, • Noisiness, • Discontinuity, • Loudness, • Impact of one'sown voice on speaking, • Degradation of one'sown voice, • Interactivity.However, these dimensions have only been analyzed in separate studies for each phase.Therefore, ag lobal conversation test using an ew test-paradigm addressing all three phases of ac onversation and potentially triggering all sevendimensions wasconducted.The experiment was divided into three sections I, II and III.While section II and III are addressing the three phases and their underlying dimensions, section Iw as supposed to simulate aconversation approaching all phases and dimensions in arealistic way, and with the test participants' attention on the conversation task.The results revealed that too manyc ognitive resources are bound in aconversational task, and thus the proposed dimensions are difficult to identify.T herefore, the test-paradigm used in the experiment, which specifically allows the participants to perceive each phase separately (and thus their underlying perceptual dimensions), is proposed for diagnostic quality assessments in aconversational situation.
However, the results of the experiments also showed that particular issues should be analyzed in future experiments.The dependencyo ft he twol istening dimensions Coloration and Discontinuity as well as of the twospeaking dimensions Degradation of one'so wn voice and Impact of one'sown voice should be addressed in future studies.
The named issues are not considered to be ap roblem of the identified perceptual quality space or the testparadigm, butrather of the particularities of the conducted experiments.In future test, care should be taken that each expected perceptual dimension is separately covered by a sufficient number of technical conditions.Forexample, for the perceptual dimensions that should be further analyzed, three distinct conditions with three different characteristics of at echnical degradation should be applied.In addition, the training should be applied for future experiments using the newtest-paradigm.
In future experiments, it would be interesting to identify weights of the individual perceptual dimensions for the overall quality rating.We expect that the weighting of the dimensions will depend on the conversation task, and on the conversation structure induced by this task.Forexample, in ah ighly interactive setting emphasis might be givent ot he dimensions in the Speaking and Interaction Phase,w hereas in less interactive settings the perceptual dimensions of the Listening Phase might dominate.
In sum, the contribution shows at est-paradigm for a diagnosis of conversational quality should coverb othphases of realistic task-drivenc onversation structures as well as phases where the Listening, Speaking and Interaction can be analyzed separately,without putting too much cognitive load on the test participants.Otherwise, perceptual dimensions which might be important for overall quality may remain unidentified.
In addition, independent laboratories should conduct experiments to further validate the identified perpetual quality space and the presented test-paradigm.Having additional ratings at hand could allowfurther analyzing and comparing the results.These results might give the possibility to push as tandardization process for an ew subjective conversational test-paradigm at ITU-T.
The ultimate aim of this work is that the conducted studies as well as the proposed test-paradigm and dimensions form afundamental framework for the development of an instrumental conversational speech quality measure that is based on perceptual quality dimensions.

Figure 4 .
Figure 4. Scree Plot for the PCAo nt he SD experiment in the Speaking Phase.

Figure 6 . 2 Figure 7 .
Figure 6.Scree Plot for the MDS on the comparison judgments in the Speaking Phase.

1 Figure 8 .
Figure 8. Results of the MDS experiment in the Interaction Phase.

Table II .
Conditions and Overall Quality Results for the SD Experiment in the Speaking Phase.α:A ttenuation [dB], β: Roundtrip Delay [ms], σ:Standard Deviation.

Table V .
Factor loadings (> 0.3) of the PCA on the SD experiment in the Interaction Phase -V ARIMAX rotated (Dim -D imension).

Table VI .
Overviewofthe sevenidentified and proposed perceptual quality dimensions for aconversational situation.
Table VIII describes the experimental procedure Table VII.Conditions used for the validation experiment.