Enough is enough: how much intonation is needed in the vocal delivery of audio description?

ABSTRACT In this article we report on an experiment examining audience experience when two different audio description (AD) intonation types are used. Following the classification put forward by Cabeza-Cáceres (2013), we tested twenty participants with vision loss who were asked to listen to twenty film clips with AD voiced following the adapted and emphatic intonation. All clips were rated as evoking emotions, either positive or negative. Participants’ experience was measured through heart rate variability, a self-assessment emotional response (Self-Assessment Manikin), a self-assessment presence questionnaire (ITC-SOPI) and an interview. Results show that participants strongly preferred adapted intonation, particularly when presented clips were evoking negative emotions. Higher rating of intonation type was also linked to greater intensity of the emotional experience and immersion, both self-reported and measured as psychophysiological reaction. The qualitative analysis of the participants’ reports on their experience indicated that adapted intonation is considered a golden mean between emotionless and melodramatic intonation. Film genre, voice type and emotional valence can be influencing factors. Emphatic intonation could be beneficial for certain genres, but a poorly-matched voice can distract the audiences.


Introduction
Audio description (AD) for films is an additional audio track woven into the soundtrack (Szarkowska, 2011). In order to prepare this track, a script is first written and then voiced by human or synthetic voice.
Common consensus is that ADV is an important aspect of AD reception (e.g., Fryer, 2016;Machuca et al., 2020) that has so far received insufficient attention from the academia and practitioners. While advice on ADV is provided in most AD guidelines (e.g., Bittner, 2010;Independent Television Comission, 2000;Remael et al., 2015;Snyder, 2010;Szymańska & Strzymiński, 2010), it is limited to general instructions that the voicing should be neutral and clear, and at the same time not monotonous, adequate to the tone of the original, and delivered at a rate that can be understood. Additionally, some guidelines contain specific reading speed recommendations (e.g., Netflix;Snyder, 2010). So far there have only been a few experimental studies that explore its influence on audience reception (Cabeza-Cáceres, 2013;Fernández i Torné & Matamala, 2015;Walczak & Fryer, 2018;Walczak & Szarkowska, 2012). ADV research has so far concentrated mainly on testing the reception of synthetic voices vs. human voices. Studies were conducted in Poland (Drożdż-Kubik, 2011;Mączyńska, 2011;Szarkowska & Jankowska, 2012;Walczak & Fryer, 2017;Walczak & Szarkowska, 2011), Japan and the USA (Kobayashi et al., 2010), as well as Spain (Fernández i Torné & Matamala, 2015).
Audience reception, which nowadays is an important line of AD research, has facilitated a thematic and methodological shift in the field (Jankowska, 2019) Initially, when investigating audience reception, researchers concentrated on preferences and comprehension, measured with subjective methods such as self-assessment questionnaires (Jankowska, 2019). Now we are witnessing a growing interest in the area of presence and emotional response (e.g., Fryer & Freeman, 2014;Ramos Caro, 2015;Walczak & Fryer, 2017). Also, researchers start to employ psychophysiological measures of emotional and cognitive processes associated with AD reception (e.g., Fryer, 2013;Matamala et al., 2020). In this article we follow these new methodological developments in AD audience reception studies to look into the effect that ADV, and in particular intonation, might have on the AD audience.

Audio description intonation
Intonation, which lies at the core of this article, arouses particular interest in the context of voice neutrality. Fryer (2016, p. 88) notices that 'a neutral delivery has come to be recognized as "the norm"' since describers have been traditionally 'encouraged to use a particularly neutral way of speaking'. This has been reflected in the available guidelines; however, as Machuca et al. (2020, p. 59) notice, guidelines lack clarity in this aspect as they 'seem to promote a neutral AD while advising to take into account the nature of the material' which makes them open to interpretation. This interpretation was made visible by Cabeza-Cáceres (2013) who analyzed AD guidelines and recordings from Spain, Germany, the UK and the USA, and put forward three intonation types: uniform, adapted or emphatic. He describes the first one as neutral, the second as adapted to the content of the original, and the third as emphasizing the emotional context of the original content (Cabeza-Cáceres, 2013).
Four studies have been of particular importance in this context. Kobayashi et al. (2010) measured audience preference for human, synthetic neutral, and synthetic emotional (happy and sad) voices applied to two videos: a comic cartoon and tragic drama. The authors point out a significant interaction between video type and voice type. In the case of the cartoon, human and synthetic neutral voices were rated significantly better than emotional synthetic voices, both happy and sad. In the case of the drama, the human voice was rated significantly better than synthetic voices and the synthetic happy voice was rated significantly worse than other voices. The authors conclude that inappropriate use of emotional voices may have a negative effect on video experience.
A similar conclusion was reached by Cabeza-Cáceres (2013), who tested the preference for three conditions of human intonation: uniform, adapted, and emphatic. Results show that while intonation does not affect comprehension, uniform intonation and emphatic intonation are associated with lower enjoyment as both intonation types provoke rejection to a similar extent. This is especially interesting given the fact that Spanish audiences are used to uniform intonation (Cabeza-Cáceres, 2013). Fryer and Freeman (2014) found out that, contrary to the synthetic voice, human voice can actively enhance presence and emotional response for some emotions. They concluded that prosody is an essential part of audio description as emotions can be 'effectively conveyed via the paralinguistic content of the describer's voice rather than the semantic content of the AD script' (pp. 105) This finding was corroborated by Walczak and Fryer in terms of presence (2018), which was rated higher for drama with human narrated AD.
In summary, these studies show that intonation is an influencing factor in audience experience, whether it is measured as preference, presence, or emotional reaction.

Measuring audience experience: presence, emotional response, and psychophysiological reaction
Audience studies, which are now an important research avenue in AD, are experiencing a visible shift from audience reception to audience experience studies that measure aspects such as presence (Fryer & Freeman, 2013;Walczak & Fryer, 2017, emotional (e.g., Fryer & Freeman, 2014), and psychophysiological response (e.g., Ramos Caro, 2016;Rojo et al., 2014).
Presence is a theoretical concept defined as 'perceptual illusion of non-mediation' (Lombard & Ditton, 1997) or in other words 'a subjective experience of being in one environment, while physically situated in other' (Walczak & Fryer, 2017, p. 8). It is one of the quality measures used in advanced broadcast and virtual environments (Lessiter et al., 2001). Both subjective (e.g., presence questionnaires, continuous assessment, qualitative measures, psychophysical measures, and subjective corroborative measures) and objective measures (e.g., psychophysiological measures, neural correlates, behavioral measures, and task performance measures) can be used to evaluate presence (van Baren & Ijsselsteijn, 2004). It is, however, argued that since presence is a subjective sensation, it should be primarily assessed using subjective measures (Sheridan, 1992). Currently, the most frequently used measures of presence are self-assessment questionnaires, applied post-test (van Baren & Ijsselsteijn, 2004).
Another important aspect of audience experience is the emotional reaction to a presented film. Here, we adopt the dimensional concept of emotions, following many other studies involving the presentation of emotion-inducing material, be it emotional images, film clips, or audio recordings (Bradley & Lang, 2007). According to the dimensional concept of emotions (Russell & Barrett, 1999), each emotional episode can be described on two basic dimensions: valence and arousal. Valence determines if the stimulus is pleasant or unpleasant, while arousal determines the intensity of the emotional response it evokes. For example, highly positive stimuli might be calming, therefore evoking low arousal, e.g., forest scenery, or energizing, therefore evoking high arousal, such as extreme sports. Both arousal and valence dimensions of a stimulus, be it an image, sound, or word, can be conveniently assessed using the Self-Assessment Manikin (SAM) Scale, a graphical 9-point rating scale devised by Bradley and Lang (1994) which became a standard tool for the evaluation of emotional stimuli.
In this work, we also examine the psychophysiological measure of the cognitive and emotional state of a person, namely the heart rate variability (HRV), i.e., the irregularities in the time that passes between consecutive heart beats (for review see Shaffer & Ginsberg, 2017). In healthy people, beat-to-beat intervals are not even, but are constantly changing within an optimal range, and can be described as complex, chaotic fluctuations (Goldberger, 1991). The HRV provides information not only on physical health, but also on the psychophysiological state of the individual. Stress and other negative emotional states may result in reduced HRV, while high HRV values indicate higher self-regulatory capacity, including emotional regulation (for reviews see Appelhans & Luecken, 2006;Mccraty & Shaffer, 2015). The reduction of HRV can be observed when participants are exposed to films inducing negative emotions. Specifically, viewing fear-evoking film clips is linked to decreased HRV (for review see Kreibig et al., 2007). However, studies on sadness-inducing clips provided mixed results, with one showing reduced HRV, one showing increased HRV, and five others showing no effect (for review see Kreibig et al., 2007).
Another psychological factor linked to HRV involves cognitive functions. The relation between attention and HRV has been well-documented across the entire lifespan. A study on infants showed that directing attention to the presented stimuli is related to decrease in HRV (Richards & Casey, 1991). In children and adolescents higher HRV is related to poorer performance in the sustained attention task (Griffiths et al., 2017). Also in adult populations a suppressed HRV was observed during engaging tasks with a high working memory load (Aasman et al., 1987;Hansen et al., 2003;Tattersall & Hockey, 1995) Together, it shows that low HRV might be indicative of enhanced attention engagement with the task.
In summary, among others, HRV may provide insight into emotional and cognitive states and into the engagement with stimuli like emotion-evoking pictures or films, which consists of both directing attention and emotional response. Moreover, it can be measured over longer periods of time, for example lasting 5 min (Shaffer & Ginsberg, 2017), which makes it an appropriate psychophysiological index in the studies using film clips, like the one presented here.

Overview of the current study
In the present study, we analyze how intonation might affect participants' evaluation of audio description. The primary goal of the current study was to explore audience reaction to two different intonation types put forward by Cabeza-Cáceres (2013), that is adapted and emphatic intonation.
From a validated database of emotional films (Schaefer et al., 2010) we selected 20 short clips -10 clips evoking positive emotions and 10 clips evoking negative emotions (see section 4.2 for more information on materials). Clips were presented with either adapted or emphatic intonation. We expected that the emphatic intonation would be rated higher, since it would correspond better with the emotional content of the clips. We also aimed to determine which intonation type would be perceived as more appropriate depending on whether the clip was negative or positive. We expected that more intense emotional experience and higher presence would result in a higher evaluation of AD. To this end, after each clip, we included a presence questionnaire. We also measured heart rate variability, as an index of emotional and attentional engagement, during each clip. To assess the impact of all those factors on AD ratings, we used mixed models as a statistical tool to check the direction and significance of the relation between each factor (intonation type, emotional valence, presence, heart rate variability) and AD evaluation. Additionally, after the entire experimental session, we conducted short interviews with the participants, as they could provide valuable information, allowing us to better understand the obtained results.

Participants
Overall 20 participants (6 female and 15 male), aged between 21 and 60 (M = 33.5), took part in the study. All participants had self-declared vision loss and were native spekers of Polish. We did not collect data on the type of vision loss as we follow the current market practice -AD is produced in one version for all its audiences. The study followed the ethical rules of empirical research with human participants and was approved by the Universitat Autònoma de Barcelona Ethics Committee. Participants were recruited by two NGOs based in Poland through advertising on its social media and newsletters. Each participant gave their informed consent prior to the experiment. All data collected during the study have been anonymized.

Materials
For the sake of the ecological validity of the experiment, we used excerpts from feature films that contain both non-verbal and verbal soundtrack. Clips for the experiment were selected from a validated database of emotional films (Schaefer et al., 2010). All clips in the database were divided into seven emotional categories: neutral state, amusement, anger, disgust, fear, sadness, tenderness, and rated for arousal on a 9-point SAM scale (Schaefer et al., 2010). For our study, we chose twenty clips, including ten evoking positive emotions (e.g., a moment of tenderness betwen lovers) and ten evoking negative emotions (e.g., a murder scene) (see Table 1). Duration of the selected clips varied between 1 m 09s and 5 m 45s. Total duration of negative and positive clips was very similar (M = 26 m 30s and M = 26 m 13s, respectively). While choosing the clips, we ensured that each of them had enough space for AD, that is to say, that there was enough space to insert AD between dialogues and/or important sounds. Regarding arousal, the clips selected for the experiment had ratings between 3.55 and 5.66 (M = 4.7, SD = 0.64) on a 9-point SAM (Bradley & Lang, 1994) scale ranging from calm (1) to excited (9). The arousal ratings did not differ significantly between valence categories; t(18) = 0.71, p = 0.48.
AD for all clips was drafted by a professional Polish describer. Sixteen of the selected clips contained foreign language dialogue. The remaining four clips did not contain any dialogue. For those clips which contained dialogue voice-over (VO) 1 was prepared by a professional Polish audiovisual translator. Both AD and VO were voiced in a professional recording studio. Following national AD guidelines and research findings (Szarkowska & Jankowska, 2015;Szymańska & Strzymiński, 2010;Żórawska et al., 2011), ADs in all clips were voiced by a female voice-talent and VO by a male voice-talent. Regarding the AD intonation types, the studio was instructed to record two ADs for each clip: one following the standard Polish AD intonation and one acting out emotions, following intonation used in dubbing. The standard AD voicing in Poland draws heavily from voice-over, which assumes a certain degree of adaptation of intonation to the original content 2 (see Jankowska et al., 2017 for a detailed description of AD voicing in Poland). In our experiment, following the classification of Cabeza-Cáceres (2013), this type of intonation will be referred to as adapted intonation. The second intonation type used in our experiment will be referred to as emphatic intonation as it emphasized the emotional context of the original content (Cabeza-Cáceres, 2013). In total we prepared 40 clips: 20 following the adapted intonation and 20 following the emphatic intonation.
After the recording, the clips were mastered so that the original soundtrack and the additional soundtracks (VO and AD) had the same mean volume level. Since our sample included both blind and partially sighted participants, we presented only the audio track of the clips to minimize the possible variables and equalize experimental conditions for all participants, in order to make the results more clear in interpretation in terms of the reception of ADV itself, rather than involving a potential interaction between access to video track and ADV.

Procedure
All participants were emailed information regarding the procedure and an informed consent form. Immediately prior to the experiment, a researcher read out loud the detailed information about the study, presented the device for measuring psychophysiological reactions andupon requestprovided additional explanations. Then, the informed consent form was read, and participant's verbal responses were recorded. The experimental procedure lasted approximately one hour and involved the presentation of 20 clips and a set of questions following each clip. Each participant listened to a given clip oncein only one, pseudo-randomly chosen, AD version (adapted or emphatic intonation). Out of 10 positive and 10 negative clips, half were presented with the emphatic intonation and half with the adapted intonation. The pseudo-random choice also ensured that the summed durations of the emphatic and the adapted clips in each experimental session did not differ by more than 5 min (on average 2 m 7s), and hence all participants spent approximately half of the experimental session listening to the emphatic intonation, and the other half to the adapted intonation. Moreover, each intonation version of every clip was presented to approximately half of the participants (no less than 8 and no more than 12). The pseudo-random choice of voicing version per each clip and per each participant was performed using an in-house MATLAB (Math-Works, Inc., Natick, MA) script, while randomization of the clips' order during the experimental sessions was controlled by the PsychoPy (Peirce, 2007).
After each clip, participants were asked 10 questions which concerned the rating of valence and arousal elicited by the given clip, the feeling of presence, and preferences regarding intonation type. Emotional experience was rated using the tactile version of SAM (Bradley & Lang, 1994)  SAM scale prepared in the relief printing technique. Participants rated emotional valence and arousal.
As the last step of the procedure, we collected demographic data (age, gender, experience in watching films with AD) and conducted a short interview to see whether the participants spotted any difference between the clips. Finally, we asked participants to state their preference for emphatic or adapted intonation and encouraged them to justify their choice. Sessions ended with debriefing.

Measurements
To measure the sense of presence, we used a modification of the short version of the ITC-Sense of Presence Inventory (ITC-SOPI). The short version of the ITC-SOPI had previously been used to measure presence in AD related research (Fryer & Freeman, 2013Walczak & Fryer, 2017. The questionnaire was shortened to limit the fatigue of the participants. We chose two out of three positive subscales: Sense of Physical Space ('I felt I was visiting the places in the scenes'; 'During the clip I had a sense of being in the scenes'; 'I felt surrounded by the scenes') and Engagement ('I felt myself being drawn in'; 'I lost track of time'; 'I paid more attention to the scenes than to my own thoughts'). The ITC-SOPI uses a 5-point Likert-type scale (1 = strongly disagree and 5 = strongly agree). Intonation type ('I liked the way AD was read') was rated on a 5-point Likert-type scale (1 = strongly disagree and 5 = strongly agree).
Heart rate was measured with the Biosignalsplux Explorer device (Biosignalsplux, Lisbon, Portugal) by ECG sensor supplied with the device. Disposable AgCl electrodes were placed on the left side of the thorax. Signal was digitized with frequency of 2 kHz using 16 bit wide analogue to digital conversion with the gain ratio of 1000 and 0.5-100 Hz bandwidth. R peaks of the QRS complex (Einthoven, 1912) were detected online using the Pan-Tomkins algorithm implemented into custom-written LabVIEW recording software (Lascu & Lascu, 2007). Time intervals between two consecutive R peaks (RR intervals) were converted to heart rate and stored for offline analysis. Then, the heart rate (in beats per minute, bpm) was cleaned for artifacts. Firstly, we compared consecutive values and excluded data points for which the difference in value between one data point and the previous one exceeded the set threshold (more than 20 units). Secondly, we excluded too large (heart rate over 115 bpm) and too small values (heart rate under 40 bpm). The heart rate variability (HRV) was assessed using the standard deviation of the interbeat intervals of normal sinus beats, called SDNN measure (Shaffer & Ginsberg, 2017), calculated in milliseconds. The heart rate data was preprocessed using MATLAB.

Statistical analysis
We have chosen a mixed model to analyze our data as it makes possible to account for individual differences in responses to experimental conditions by explicit specification of random effects in addition to fixed ones. In other words, the mixed model is well suited to the situation where idiosyncrasy in individual responsiveness to experimental conditions plays a significant role. Indeed, we expected that ratings of AD would not only be different for emphatic and adapted intonation (fixed effect), but also that the participants would exhibit unique patterns in their preferences for ADV (random effect). Moreover, the mixed model, unlike repeated measures ANOVA, has advantage of allowing for the seamless integration of nominal and continuous predictors, both present in our analysis, into one statistical model. Consequently, using SPSS version 24 (Armonk, NY: IBM Corp.), we performed a linear mixed model on the audio rating as a dependent, to be explained, variable with following four predicting variables (fixed factors): intonation type (adapted, emphatic) and valence (negative, positive), treated as categorical repeated measures factors as well as continuous factors of presence scale score and SDNN. We included a random intercept for each participant and a random slope for each participant for the factor of intonation (accommodating individual differences in response magnitude to voicing type). We assumed a variance components covariance matrix structure for the random factors and diagonal covariance matrix structure for the repeated measures factors. Effect sizes for each of the fixed effects were estimated with partial R 2 (Edwards et al., 2008). Selected post-hoc comparisons were conducted using Bonferroni corrected t-tests. ratings depended on the ADV type, within both emotional valence categories, i.e., positive (p = .019) and negative (p < .001). The post hoc comparisons within the intonation category, however, indicated that AD-rating differed between positive and negative clips only in the case of the emphatic intonation (p < .001). There was no significant difference for the adapted intonation (p = .27). These results indicate that the significant main effect of valence in terms of AD-rating can be entirely attributed to differences observed in emphatic intonation. When adapted intonation was used, participants rated it as equally adequate in both positive and negative clips.

Mixed model
Also, two continuous covariates were linked to the AD rating, that is, presence and HRV (measured as SDNN index). Presence scale was positively related to the AD rating F(1, 234.0) = 25.43, p < .001, R 2 = .09. The model predicted that with one point on the presence scale, AD ratings increase by 0.035 points (Figure 3). That means that, in line with our expectations, higher presence resulted in higher evaluation of AD. Additionally, the SDNN index was significantly negatively related to the AD ratings F(1, 253.4) = 8.9, p = .003, R 2 = .03. According to the obtained statistical model, the AD ratings decreased by 0.007 points for each unit increase in the SDNN (Figure 4). This means that the lower the heart rate variability, the better the AD ratings.

Interviews
All participants were able to detect a difference between the clips and attribute it to the intonation type. During the discussion, participants noticed that as a general rule, there seem to be three AD intonation types in Poland. They described the first one as 'too emotional ' and 'exaggerated' or 'acted'. The second as 'flat', 'synthetic' and 'robot-like'. And the third one was compared voice-over, which, according to the participants, shows some interpretation, but does not reveal emotions. As one of the participants said: I really like when AD is read with a little bit of interpretation. I mean I don't like it when it's acted, but I also don't like it when it's completely flat. Lektor-like 3 is the best way.
We believe that the intuitive categories put forward by the participants of our study can be fitted into the threefold classification put forward by Cabeza-Cáceres (2013), i.e., emphatic, uniform and adapted.
Participants were very clear and passionate about their preference for the AD intonation type. Most of them pointed to the adapted intonation as their preferred option since they found it unobtrusive and engaging at the same time, e.g.: I like it when AD is slightly interpreted. Not exaggerated, not actor-like but also not completely flat […]. Exaggerated acting draws too much attention, but with some acting and some intonation the AD is nicer to listen to. Some participants literally referred to the adapted intonation as the golden mean. As two of them said: If I had to choose between flat, slightly interpreted or acted, I would definitely choose the slightly interpreted one. I noticed this style in some of the clips I saw today. It's a good golden mean.
I would say that we should aim for the golden mean. It will fit with every movie. Both acted and flat reading can really destroy the film. It's irritating and discouraging.
While some adapted intonation in ADV is desirable, most participants noticed that the voice should be well matched with the film, and the intonation type should match the character of a scene. As one of the participants put it: The voice should be well suited to the film and also to the genre. To give you an example of such adaptation -I think that a more cheerful voice in romantic comedies would be good. Or in action or military filmthe voice could be more soldiery.
At the same time, all participants were very critical about the emphatic intonation. As some of them explained, it is disturbing and draws their attention away from the film: I noticed that the female voice-talent was getting too emotional sometimes. The script was excellent, but the reading was flustering. It was taking up my entire attention.
An important issue raised by many participants was the fact that emphatic intonation does not allow them to experience the film in their own way: I want to concentrate on what is happening and live it my way. I don't want the voice talent to tell me what I am supposed to feel. If AD is well written, we do not need voice-talent's emotions. Some interpretation in AD voicing is good, but it shouldn't be overdone.

Discussion
Results obtained in the experiment show that participants prefer the adapted intonation it is considered to be the golden mean. What is more, our results show that film genre, voice type, and emotional valence of a scene can affect the preference for a particular ADV. The interviews with the participants suggest that emphatic intonation could be beneficial for certain genres (e.g., romantic comedies, war films), but a poorlymatched voice (e.g., cheerful or comedy voice to a horror) can distract the audiences. When it comes to emotional valence, it seems that the issue of intonation type is especially sensitive when dealing with negative emotions. Participants judged emphatic intonation in negative clips as the least adequate. This could be due to the fact that emphatic intonation unnecessarily enhances emotions that are already very clearly expressed through the soundtrack or dialogues. As stressed by participants, in this particular case, emphatic intonation not only distracts them but also makes them feel as if they were told what they should feel.
All in all, we might conclude that similarly to scripting, the choice of intonation type should not be considered to be a 'binary opposition of objective vs. subjective' (Mazur & Chmiel, 2012); these are two extremes of a continuum.
Another finding of our study is the link established between the presence, HRV, and AD rating. The decrease in HRV with higher audio rating can be interpreted as engagement of attention in the presented clip, following previous studies that showed a lower HRV associated with sustained attention during a task (Griffiths et al., 2017;Richards & Casey, 1991). The attentional engagement in a clip might have also resulted in a deeper emotional engagement with the plot, leading to lower HRV. For example, studies showed that fear-evoking clips are associated with lower HRV (for review see Kreibig et al., 2007). The relationship between attentional and emotional engagement and its link to both physiological and self-reported indices of immersion should be addressed in further studies on audience reception of emotional clips. It seems that the more involved (higher presence scores) and focused (lower HRV) the audience was, the higher the AD rating was given. This observation is confirmed in the interviews with the study participants who repeatedly referred to the emphatic intonation as distracting, which might make presence more difficult to achieve.
Further research on ADV is needed since our results may be biased by some limitations. First of all, Polish audiences are used to voice-over and thus might prefer the lektor-like intonation, which, as study participants noticed, is per se neither flat nor emotionless. Different results might be obtained in countries with different reading traditions. However, results obtained by Cabeza-Cáceres (2013) may suggest that adapted intonation is the preferred option even for audiences used to uniform intonation. Another issue is that, in order to limit potential variables, all our clips were recorded by the same female voice-talent. As many of the participants noticed, the voice used was not well suited for all the clips. Participants perceived the voice as rather cheerful and claimed that it did not go well with the negative clips. An important limitation of our study is the choice of participants. As we already mentioned, we did not factor in the degree of sight loss or age of the participants, which varied quite considerably. This, however, is a common issue in studies involving participants with vision lossit is a group that is relatively hard to reach. Last but not least is the duration of the clips used in the experiment, which might be seen as a limitation of the study, however, we believe that the issue of the length of the clips used is justified by the aims of our study. We chose short clips for a number of reasons. Firstly, in an experimental paradigm with multiple clips, it is necessary to expose the participants to a variety of stimuli while maintaining a reasonable duration of the experiment. The use of longer clips could impact the number of stimuli employed and reduce the generalizability of the results. Secondlyto better control emotions to be elicited and to avoid co-elicitation of opposing emotions (e.g., fear vs. amusement). Thirdly, to reduce the impact of random variables, e.g., a peculiar reaction of a participant to a given clip. Short clips have been successfully used in psychophysiology and neuroscience (Bos et al., 2013;Schaefer et al., 2010) where it is claimed that clips that last between one and two minutes are long enough to provide the viewer with an understanding of the plot, engage their attention and change their affective state while clips longer than three minutes might may prompt 'carryover effects on the following excerpts' (Maffei & Angrilli, 2019, p. 3). For what concerns the optimal duration of presence stimulus, there are no clues in the relevant literatureauthors use stimuli of 100 s (Freeman et al., 2000), 10 min (Rigby et al., 2016) but also of 45 min (Troscianko et al., 2012). This issue is also not addressed by past authors in the field of audio descriptionthey have used various number of clips lasting between 2 and 12 min (Fryer & Freeman, 2014;Walczak & Fryer, 2017. Our results show that stimuli used in the experiment resulted in a certain level of presenceof course, it could have been higher if we had used longer clips, but this does not invalidate the findings of our study.

Conclusions
The current article examined the effects of different AD intonation types on audience experience. The most important result is that the less expressive, adapted intonation was rated as more appropriate, regardless of the emotional content of the film clip. Even though participants were more willing to accept emphatic intonation for positive clips and argued that, in some cases, emphatic intonation could make a given scene more attractive, on the whole, they gave higher ratings to those positive clips that were read with adapted intonation. The emphatic intonation proved to be particularly distracting and evaluated as non-appropriate in the case of clips evoking negative emotions. This might be linked to the poor matching of the voice to the emotional content of the negative clips, corroborating the findings of Kobayashi et al. (2010) who concluded that inappropriate use of emotional voices may have a negative effect on video experience.
Moreover, we observed a link between the individual experience (i.e., the sense of physical space, engagement, emotional reaction, the focus of attention) and the evaluation of the intonation type. It seems that participants gave higher ratings to the intonation type which allowed them to focus more on the emotional content of the film. This pattern of results is confirmed by the interviews with the participants, who stated that they preferred non-intrusive intonation. Interestingly, our study clearly shows that intonation does not need to mirror the emotions of the film but rather should allow listeners to make their own interpretation. However, an important point to remember is that the study participants made it very clear that adapted intonation is not synonymous with flat and emotionless reading. This is in line with some of the AD guidelines (e.g., ITC, 2000).
It is also clear that further research is needed to fully understand the nature of different AD intonation types. Their current definitions put forward by Cabeza-Cáceres (2013) are rooted in the available AD guidelines, which, as already mentioned, are vague. We agree with Matamala et al. (2020, p. 71) that 'we need to use linguistic tools to analyze prosodic values if we want to go beyond impressionistic suggestions and make research-based recommendations'. Other issues that need to be addressed are the relationship between intonation type, type of voice, and film genre as well as replication of our experiment on speakers of different languages who are used to different ADV traditions. Another avenue to explore is the relationship between AD intonation type and the cognitive effort required to process it.
Last, but by far not least, is the burning issue of methodology in audiovisual translation and media accessibility. The common consensus is that while there has been a rise in experimental research in audiovisual translation and media accessibility, it is still not consolidated (Díaz Cintas & Szarkowska, 2020;Orero et al., 2018). One of the ways towards consolidation is inviting researchers from other domains (O'Brien, 2013). This study brought together experts from psychophysiology and audiovisual translation. What at times seemed like a methodological clash has been an invaluable experience which hopefully will contribute to shaping methodological paradigms of audiovisual translation. Notes 1. Voice-over is the predominant AVT modality on Polish television (Szarkowska, 2009). 2. Even though voice-over is usually described as monotonous and flat (Bogucki, 2010) some researchers and practitioners underline that intonation and modulation are important aspects of quality voicing in voice-over (Chłopicki in personal communication, October 15, 2020;Woźniak, 2012). 3. Lektor is the name used in Poland for voice-talents who specialize in voice-over reading.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work has been supported by the Spanish Ministry of Economy, Industry and Competitiveness (FFI2015-64038-P, MINECO/FEDER, UE) within the research group Transmedia Catalonia

Notes on contributors
Anna Jankowska, PhD, is Research professor at the University of Antwerp and former Assistant Lecturer in the Chair for Translation Studies and Intercultural Communication at the Jagiellonian University in Krakow (Poland) and visiting scholar at the Universitat Autònoma de Barcelona within the Mobility Plus program of the Polish Ministry of Science andHigher Education (2016-2020). Her recent research projects include studies on accessibility technologies, user experience, the viability of translating audio description scripts from foreign languages and multiculturalism in audio description. She is also the founder and president of the Seventh Sense Foundation which provides audio description and subtitles for the deaf and hard of hearing. Anna is member of ESIST and Editor-in-Chief of the Journal of Audiovisual Translation. Joanna Pilarczyk, PhD, is a Post-doc at the Emotion and Perception Lab at the Institute of Psychology, the Jagiellonian University in Krakow, Poland. Her PhD thesis concerned the impact of image features on the viewers' emotional response. Her current projects focus on the role of beliefs and motivations in the processing of social and emotional stimuli. In her projects, she uses eye-tracking, pupillometry, electrocardiography, and neuroimaging.
Kinga Wołoszyn, MA, sobtained her Master's degree in psychology at the Jagiellonian University (JU) in Kraków. Her PhD thesis concerning the processing of emotional material, within the framework of embodied cognition, at the Psychophysiology Lab. With the team, she has been conducting research on the physiological and attentional aspects of the processing of emotional natural scenes and its neural bases. In her experiments, she uses various methods, including electroencephalography (EEG), functional magnetic resonance imaging (fMRI), eye-tracking, pupillary response, electromyography (EMG), and electrocardiography (ECG). She is also a collaborator in the New Approaches to Accessibility project.
Michal Kuniecki, PhD, is an Assistant Professor at the Institute of Psychology at the Jagiellonian University in Krakow (Poland). He studies emotion and visual perception. His interests include the role of formal features of visual stimuli and their meaning in engaging spatial attention and eliciting emotional responses. In his work, he utilizes a whole array of psychophysiological methods, such as EEG, fMRI, eye-tracking, pupillary response, and heart rate.