A methodological investigation of the Intermodal Preferential Looking paradigm: Methods of analyses, picture selection and data rejection criteria

The Intermodal Preferential Looking paradigm provides a sensitive measure of a child’s online word comprehension. To complement existing recommendations (Fernald, Zangl, Portillo, & Marchman, 2008), the present study evaluates the impact of experimental noise generated by two aspects of the visual stimuli on the robustness of familiar word recognition with and without mispronunciations: the presence of a central ﬁxation point and the level of visual noise in the pictures (as measured by luminance saliency). Twenty-month-old infants were presented with a classic word recognition IPL procedure in 3 conditions: without a ﬁxation stimulus (No Fixation – noisiest condition), with a ﬁxation stimulus before trial onset (Fixation, intermediate), and with a ﬁxation stimulus, a neutral background and equally salient images (Fixation Plus – least noisy). Data were systematically analyzed considering a range of data selection criteria and dependent variables (proportion of looking time towards the target, longest look, and time-course analysis). Critically, the expected pronunciation and naming interaction was only found in the Fixation Plus condition. We discuss the impact of data selection criteria and the dependent variable choice on the modulation of these effects across the different conditions. © 2015 The Authors. Published by Elsevier Inc. This is an open access article under the


Introduction
Over the last four decades, a considerable amount of energy and creativity has been devoted to designing and testing numerous experimental methods to investigate early speech perception and language comprehension in young children.One of the most popular methods is the head-turn preference paradigm (Polka & Bohn, 1996;Werker, Polka, & Pegg, 1997;Werker & Tees, 1983, 1984) which is primarily used with infants aged from 5 to 16 months of age to investigate listening preferences and discrimination.The study of word recognition or word learning from the age of 12 months (Schafer & Plunkett, 1998) relies on two paradigms, the Switch task (Stager & Werker, 1997;Werker, Fennell, Corcoran, & Stager, 2002) and the Intermodal Preferential Looking paradigm (IPL, Bailey & Plunkett, 2002;Golinkoff, Hirsh-Pasek, Cauley, & Gordon, 1987;Swingley & Aslin, 2000), also called looking-while-listening procedure (Fernald, Zangl, Portillo, & Marchman, 2008).
The standard procedure consists in presenting pairs of images horizontally on a screen for several seconds and, mid-trial, playing a target word or a carrier sentence.A trial is thus divided in a pre-naming and a post-naming phase (for longer postnaming windows, see Zangl, Klarman, Thal, Fernald, & Bates, 2005).For the duration of the experiment, eye movements are recorded by cameras mounted above each image (or eye-tracking when available).These eye movements are then timelocked onto each trial and traditionally manually coded frame by frame.With the growing use of eye trackers, gaze coding is now often automatic.Fernald and colleagues (2008) provided a comprehensive review of the procedure and improvements added progressively since the first introduction of the paradigm, and listed factors that need to be controlled: both images have to be matched for size and visual salience; auditory stimuli have to be controlled for duration across items; target side has to be counterbalanced overall.One of their recommendations is that across all participants both objects in a given trial should be used as target and as distracter, since it is the best control to avoid any preference for one stimulus over another.Although desirable, such a control is not always possible given the restricted choice of stimulus items in young children and the need for a sufficient number of trials per participant.Some experiments have controlled for this possible preference effect (Mani & Plunkett, 2007;Swingley & Aslin, 2000, 2002), presenting the same visual stimuli at least twice, while others have not (Durrant, Delle Luche, Cattani, & Floccia, 2014;Floccia, Delle Luche, Durrant, Butler, & Goslin, 2012;Mani et al., 2008) but still found comparable results.One possible compromise, as suggested by Fernald et al. (2008), is to ensure an equal preference for both pictures during the pre-naming phase by monitoring looking times in silence during a pilot experiment.However, to our knowledge, such a pretest or control for the absence of a pre-naming image bias is not reliably reported in the literature (with the exception of the results section in Swingley, Pinto, & Fernald, 1999).Another way of controlling for pre-naming visual preferences is to take into account looking behaviour in the pre-naming phase in the statistical analyses, which is frequently reported (e.g., Mani & Plunkett, 2007;Meints, Plunkett, & Harris, 1999).
Finally, Fernald et al. argue that, contrary to adult visual experiments, a central fixation point right before naming is not necessary as children would not follow such an implicit instruction (note that White & Morgan, 2008, used a centering stimulus before the pre-and the post-naming phases, while Gurteen et al., 2011, used a centering light before post-naming).
Despite the excellent review by Fernald et al. (2008) the relatively recent addition of the IPL paradigm to the field of developmental psycholinguistics means that researchers often face choices regarding the procedure itself and the methods of analyses, all of which can have important consequences on the observation of an experimental effect.For example, as demonstrated by Arias-Trejo and Plunkett (2010), choosing a distracter image which is perceptually close to the target image (e.g., a balloon paired with an egg) can result in uninvited interference effects so that 18-to 24-month-olds fail to identify the target image.The consequences of other methodological choices (such as the use of a fixation point) on the robustness of the experimental effects are largely unknown.As we will show below, a review of the recent literature reveals a great deal of variation in many aspects of the procedure, as well as in the selection of the dependent variables used for the analysis of looking times.Appendix A provides details on methodological aspects such as the presence of a central fixation stimulus or the duration of the pre-and post-naming phases across a range of studies that have used the IPL methodology.The aim of the current study is to complement and extend Fernald et al.'s review by examining how the different methodological choices and the different methods of looking time analyses impact on the observation of significant results.In three experiments testing familiar word recognition with 20-month-olds, we manipulated the level of visual noise (with saliency as measured by luminance and the presence of a central fixation point), and provided a systematic and thorough analysis of looking times using those methods most representative of the current literature.The main objective of this paper is to provide researchers with some data-grounded recommendations about the best practices when using the IPL procedure.

Central fixation point
A review of the literature using the preferential looking paradigm shows that around half the experiments use a centering stimulus at the beginning of each trial, visual or auditory (Curtin, 2010;Dittmar, Abbot-Smith, Lieven, & Tomasello, 2008;Meints et al., 1999;Meints, Plunkett, Harris, & Dimmock, 2002;Meints, Plunkett, Harris, & Dimmock, 2004;Schafer & Plunkett, 1998, see Appendix A) while the other half do not report such a practice (Bailey & Plunkett, 2002;Ballem & Plunkett, 2005;Mani et al., 2008;Mani & Plunkett, 2007, 2008).This stands in contrast with ERP studies, and adult experiments generally, where a fixation stimulus is systematically presented to centre the participant's attention before trial onset (Kuipers & Thierry, 2011).When a central fixation point is used, trials are always triggered by the experimenter, but not systematically when such cue is not used, in which case trials are sometimes automatically interspaced (Ramon-Casas et al., 2009;Swingley, 2003Swingley, , 2007)).
Although it is not possible, from these studies, to draw direct comparisons between results obtained with and without fixation points given the variety of investigated topics, one can estimate that centering attention, even furtively, should benefit the procedure and the quality of the data, especially since it ensures that the child is attentive to the screen immediately before trial onset.The presentation of an image in the centre of the screen after termination of a trial has multiple advantages: (i) since the trial is triggered only if the child is looking at the centre, it ensures the child is attentive and active; (ii) attention is attracted back to the middle of the screen, giving the same weight to the probability that the first look will be at the target or the distracter once the trial begins; (iii) looking at the centre right before trial onset should encourage the child to explore all new stimuli that appeared in her peripheral vision.This is of particular importance when considering that trials where the child does not look at both images in the pre-naming phase can be discarded in the statistical analyses (e.g., Mani & Plunkett, 2007); and (iv) by having something to look at for the whole duration of the experiment, the entire procedure becomes dynamic and eventful, maintaining the child's interest.One of the aims of this study will be to verify if looking behaviour is affected by the potential noise reduction provided by a fixation point in a classic IPL task.et al. (2008) recommended controlling the visual stimuli for size, animacy and salience (in the sense of visually engaging images, especially by matching objects for animacy).Regarding saliency, experimenters decide on the pictures without any objective measures.Visual stimuli are often static colour photographs of objects on a white or grey background (e.g., Mani & Plunkett, 2007;Swingley & Aslin, 2000), sometimes a mix of realistic drawings and photographs (Swingley, Pinto, & Fernald, 1999), or quite exceptionally line drawings, coloured (White & Morgan, 2008;White, Morgan, & Wier, 2005) or plain (with made up animals, Mather & Plunkett, 2011).So as to enhance interest in the visual stimuli, pictures sometimes move in synchrony on a vertical axis (Swingley, 2003;Swingley & Aslin, 2002, 2007).In word learning studies, made up objects (obtained by editing colour photographs as in Schafer & Plunkett, 1998) are visually comparable to real objects.To our knowledge, only Gurteen et al. (2011) have presented real objects to the participating infants.

Fernald
Such variability in the selection of visual stimuli in the literature has been enabled by the possibility of retrieving images and photographs from the internet, departing from the perceptually controlled line drawings from Snodgrass and Vanderwart (1980) or their coloured version (Rossion & Pourtois, 2004) to achieve more naturalistic representations.One way to control for the quantity of information provided by photographs is to remove any background, even though the percept is less naturalistic looking.This practice is supported by Meints et al. (2004) who showed that background affected word recognition.Younger children (15 months) do not recognize a sheep when it is pictured with a naturalistic background (e.g., a sheep on grass), while they do when the sheep is presented without background or on an unusual -or less rich -background.Older children recognize the sheep regardless of background, although the distracting effect of the typical background was still observed to a certain extent.
Perhaps estimating the basic visual salience of the stimuli would be another, quantifiable, step towards ensuring that target images are not more attractive than their corresponding distracters, to complement experimenter judgments.Note that this is a purely visual control of salience, away from the subjective salience discussed by Fernald et al. (2008) or the more cognitive saliency maps (for a review, see Althaus & Mareschal, 2012).This will be achieved in the current study by performing cross-correlations of the pairs of images presented (Chinga & Syverud, 2007).Images are transformed into a matrix containing the luminance of each pixel, then into a vector.The cross-correlation compares then the two vectorized images: a high correlation score will mean that the two images are similar in salience.By contrasting a more or less visually noisy set of pictures, we will examine its effect on looking behaviour.

Methods of analysis
Regardless of the task the participant is engaged in, all experiments in the IPL literature divide an experimental trial into a pre-and a post-naming phase, based usually on the onset of the target word (with the exception of priming studies in which there is no pre-naming phase, e.g., Arias-Trejo and Plunkett, 2009).The selection of analysable data as well as the choice of dependent variable varies according to research groups and, on occasion, differs for a single experiment (see Appendix A).The most typical time window of interest starts 367 ms after word onset (and less frequently word offset): indeed it has been established that a minimum of 233 ms is necessary to obtain stimulus-related saccades, and that this latency depends on age, vocabulary size or task complexity (Fernald et al., 2008;Fernald, Perfors, & Marchman, 2006;Zangl & Fernald, 2007;Zangl et al., 2005).This time window usually ends at 2000 ms, as it is generally considered that later looking behaviour is no longer related to the processing of the auditory stimulus.When plots of the time course for proportions of looks to the target are included, usually at end of the result section (Arias-Trejo & Plunkett, 2010;Fernald et al., 2008;Swingley & Aslin, 2000), the time window can then be justified a posteriori, the visual inspection confirming that roughly 2000 ms after word onset, looking behaviour resumes to chance level (that is, equal looks to the target and the distracter).However, since latency is a function of at least age (Zangl & Fernald, 2007) and vocabulary size (Fernald et al., 2006), the whole looking behaviour can also be influenced by task difficulty.This is the case for example when mispronunciations of words are minor (Mani et al., 2008;Mani & Plunkett, 2011a;Swingley, 2007;Swingley & Aslin, 2002).A systematic time window, regardless of data distribution in the post-naming phase, may overlook meaningful looks if children are still looking at the target after 2000 ms.It seems recommendable (see Fernald et al., 2008) that the first, and not the last, step in data analysis should be the systematic plot of the unfolding looking behaviour, so as to ascertain that the window of analysis comprises all the word processing and task related looks.This is common practice in EEG or MEG experiments, since electrophysiological markers can vary in location and time period (e.g., Bastiaansen, van der Linden, ter Keurs, Dijkstra, & Hagoort, 2005;Hagoort & Brown, 2000).Perhaps a way to enhance the precision of the results would be to determine statistically the exact time window when the two conditions differ (target vs. distracter in simple naming tasks, correct vs. incorrect pronunciation in mispronunciation experiments).To our knowledge, with the IPL paradigm, only one instance of such an analysis has been published so far (von Holzen & Mani, 2012, see Maris & Oostenveld, 2007 for a detailed explanation); the authors showed that in addition to standard comparisons of looking times averaged across pre-and post-naming trials, it was possible to identify accurately a specific time window where performance between conditions differed (in their case, 1140-1580 ms after target word onset).There are two advantages for this method: first, representing time course plots gives a dynamic evaluation of visual/linguistic processing; secondly, this data driven method is objective and prevents a priori judgement of the data.
In relation to the looking behaviour, two types of dependent variables are usually considered: proportion of target looking (taking into account, or not, the pre-naming phase, see Appendix A), and latency of shifts to the target.They are respectively assimilated to a correct response and a reaction time (see Fernald et al., 2008).Other measures have also been used, and usually provide comparable direction of results, such as total looking time (that is, the sum of looks to the target during the post-naming phase minus those to the distracter), or the longest look (longest single fixation to the target).
Data filtering or pre-processing is where the greatest variation across experiments is observed (see Appendix A).In word recognition tasks, with or without a mispronunciation element, words are selected so that they are likely to be known a priori by all participants according to standardized norms (thus reducing considerably the number of potential stimuli, re.Ramon-Casas et al., 2009;Swingley et al., 1999), or by at least 50% of children of the corresponding age (e.g., Styles & Plunkett, 2008).Then, some authors further filter the data by analyzing only trials where parents report the words as known by the child (e.g., Mani & Plunkett, 2011b).Whether it is necessary or not to check for infants' knowledge of distracter will be addressed here.On the one hand, if the child does not know the distracter, she might be looking more at the target once it has been named simply because the target object is the only object for which she has a name, and not because she recognizes the link between the label and this object; this would artificially inflate the target looking time.On the other hand, a child who does not know the distracter's name would look longer at its picture in mispronunciation trials, in the spirit of the study by White and Morgan (2008) whereby unknown objects were presented as distracters.This may be an advantage for strengthening mispronunciation effects.The impact of filtering out trials where the child does not know both the target word and the distracter will be evaluated in the current study.
The criteria used to select the attended trials is also variable, with some considering only long enough fixations to the images (1500 ms in each phase, Bailey & Plunkett, 2002), or fixation to both images in the pre-naming phrase or at least, throughout the trial (Mani & Plunkett, 2007).
Finally, data cleaning is achieved by excluding participants not contributing to all experimental conditions (Fernald et al., 2006; second analysis in Styles & Plunkett, 2008), or whose data points fall outside normality (Fernald et al., 2006; see also Mani & Plunkett, 2007).Although all these types of pre-processing or filtering allow for cleaner datasets, comparability across experiments would benefit from consistent practice, preferably on the measures that are the most conservative.Falling on an agreement on exclusion criteria or on the best-suited age-specific time window would be desirable.Indeed, some analyses only include trials where the child is looking at a picture (final analysis in Fernald et al., 1998), while others reject children not contributing to all conditions (e.g., Fernald et al., 2006; second analysis in Styles & Plunkett, 2008), set a minimum looking time (e.g., Bailey & Plunkett, 2002;Ballem & Plunkett, 2005) or only include trials were both pictures are fixated (Mani & Plunkett, 2007, 2008).
The goal of the present research is to examine how the different methods of data analyses are resistant to methodological alterations, or noise, such as image quality and the presence of a central fixation stimulus.For this purpose, we ran three versions of a classic IPL procedure testing the detection of mispronunciation of familiar words, varying the pictures' saliency and background, and the presence of a fixation point.For each experiment, we evaluated how the degree of visual noise, together with the different criteria for data selection, modified the different dependent variables.
Here, the stimuli were comparable to those in Mani and Plunkett (2007), in which 15-to-24-month-olds were presented with two images on both sides of a screen and heard, mid-trial, correctly pronounced in a carrier sentence such as "Look, dog!" for half of the trials, or as a mispronounced version of the target word ("Look, bog!") in the other half of trials.If children recognize lexical entries of familiar words only if they are pronounced correctly (as is expected from the age of 18 months, e.g., Mani & Plunkett, 2007), a naming effect should be observed in correctly pronounced trials, but not (or significantly less) in mispronunciation trials resulting in an interaction between naming and pronunciation -the key result in these studies.The participants in the current study are 20 month olds, and so results comparable to 18 month olds can be expected, if not stronger, because their lexical repertoire increases steadily and their phonological sensitivity seems stable around these ages (for a developmental Switch task, see Werker et al., 2002).
As presented earlier, we hypothesize that adding a fixation stimulus between trials should enhance the quality of the data.Indeed, since experimental trials are then only triggered when the child is attentive to the screen, post-hoc measures of attentiveness (by checking the videos, re.Fernald et al., 2008; or excluding trials with looks <1500 ms, re.Bailey & Plunkett, 2002) are less critical yet desirable.The added advantage of having a centred fixation stimulus is that it should entice the participants into looking at the two images that appear in their peripheral vision at trial onset.
The second manipulated methodological choice is the uniformity of picture background and the choice of visual stimuli.As tested by Meints et al. (2004), a typical background is distracting and reduces the naming effect in younger children, and to a certain extent in 18-month-olds.To capitalize on these findings and fully appreciate the extent of any distraction relating to a naturalistic background on the naming effect, we compared two sets of images, one where the content was naturalistic (mostly with a background), and one where there was no background, leaving the stimuli in isolation.In addition, we examined the effect of image saliency on looking behaviour, especially in the pre-naming phase, which, to our knowledge, has never been investigated experimentally in infants, despite recommendations to control for it (Fernald et al., 2008).
To sum up, we manipulated the presence/absence of a central fixation point together with the uniformity of picture background and picture salience, to evaluate the effect of experimental noise in an auditory word recognition task.In the first condition (No Fixation), no central fixation stimulus was used and no attempt was made to control for the picture background colour or salience.This is the noisiest condition.In the second condition (Fixation, intermediate noise condition), a central fixation point was added and the same images were used as in the No Fixation condition.In the third condition (Fixation Plus), the central fixation point was augmented by a systematic absence of background colour for pictures and target-distracter pairs were selected so that their salience was highly correlated.This is the least noisy condition.
We expect that adding the fixation stimulus should encourage more looks to both images in the pre-naming phase, leading to cleaner data in terms of trial rejection, and more balanced looks towards target and distracter (as seen for example by shorter "longest look" measures).The adjunction of better controlled image background and saliency should also contribute to enhance the identification of objects, reducing the "back and forth" between target and distracter, especially during the post-naming phase.This would translate into longer "longest look" measures but also in a larger naming effect in correct trials.
To fully estimate the impact of experimental stimuli manipulations on data quality, and therefore get a clear picture of the robustness of the method, we will provide different types of analyses.First, we will evaluate results for the classic 367-2000 ms post-naming time window and will examine effects of conditions on the different dependent variables used in the literature.We will also look at the impact of the different criteria used for trial rejection on the results.Then we will present a newer type of analysis that takes the time course of looking behaviour into consideration and thus avoids averaging over the whole post-naming window (von Holzen & Mani, 2012).While time course is seemingly the crucial aspect of the IPL paradigm or other visual paradigms, it has been rarely exploited in IPL experiments yet should provide with informative and complementary results.

Methods
In this experiment the classic mispronunciation IPL paradigm is used: two objects are presented side by side on a screen and one is named halfway through the trial.Pronunciation is correct for half the trials (e.g., "Look!Bed!"), and incorrect for the other half ("Look!Bud!", for bed).We developed three versions.In the No Fixation condition, no central fixation stimulus was used and images were simply controlled for suitability and size but not for background, colour or perceptual saliency (they mostly had a naturalistic background).In the second, Fixation condition, a fixation stimulus was added before the start of every trial, and in the last condition, Fixation Plus, this was augmented by controlling the background colour of the pictures and their relative saliency.

Participants
Participants from the three conditions were matched for gender, age and vocabulary scores as measured by the Oxford Communicative Development Inventory (Hamilton, Plunkett, & Schafer, 2000).
The population in the Fixation condition served as the baseline for the selection of participants in the two other conditions as they were matched on their OCDI scores in comprehension.

Stimuli
The stimuli were 32 monosyllabic consonant initial words taken from the OCDI (see Appendix B), half targets and half distracters.All are imageable nouns.Target words were judged as known by at least 40% of 20-month-olds (database from the Oxford Babylab), and they were paired with a distracter sharing the same onset consonant and phonemic structure (e.g., dog and duck).Out of the 16 targets, participants heard 8 labels that were correctly pronounced and 8 that were mispronounced.Following Mani and Plunkett (2007), mispronunciations were obtained by changing one phoneme on one or more dimension, either on the onset consonant or the vowel (4 trials each), leading to a pseudoword or a very rare word in infant-directed speech (e.g., bud).
The speech stimuli were recorded in an enthusiastic child friendly manner by a female native speaker of British English.Recordings were conducted in a sound-attenuated booth, digitized at a rate of 44.1 kHz and a resolution of 16 bits.The recorded tokens were matched so that the duration, amplitude and F0 of the correctly pronounced labels and their mispronunciations did not differ significantly.
The tokens were then spliced onto a carrier sentence "Look!Target word!", with onset of the test word starting 2500 ms into the trial.Note that the auditory stimuli are identical across conditions.
The visual stimuli were photographs of the targets and distracters retrieved from the web and judged by the experimenters as good exemplars of the chosen categories.For the Fixation and Fixation Plus conditions, a smiley face was presented between trials at the centre of the screen until fixated by the participant, and followed by the next trial (triggered by the experimenter).For the Fixation Plus condition, image quality was manipulated: as exemplified in Appendix C, we systematically removed background colour and matched pictures (pairs of target and distracter) for visual saliency.This was checked with Pearson's correlation conducted on the vectorized image pairs (Chinga & Syverud, 2007).The saliency of target-distracter pairs presented to the No fixation and Fixation conditions was not well correlated (0.13 on average, for the absolute value of the Rs), whereas a reasonable correlation for the Fixation Plus condition was found (0.35 on average, which is a good correlation magnitude, Hemphill, 2003).
Images were projected onto a screen 1.20 m away from the child, each of the stimuli image measured 52 cm diagonally and were separated by 43 cm, so that both images were comprised within 48 • of visual angle with 10 • of gap between them.The smiley face was centred and measured 14 cm diagonally.

Procedure
After written consent was obtained, both the participant and the parent were invited into the room set up for the IPL.An image was presented on the screen in the dimly lit room so as to entice the child into sitting in a high chair.A short animated cartoon was played on the screen to keep her entertained and looking at the screen while the experimenter adjusted the cameras on her face.The parent sat behind the child and was asked not to intervene in any way so as not to influence the child.The experimenter could hear the stimuli being presented (so that she could also hear possible parental intervention) but could not see the screen.
Each of the 16 trials (plus two for training) were started manually by the experimenter when the child was looking at the screen (anywhere for the No Fixation condition) or at the centre of the screen, that is, at the smiley face (for the Fixation and Fixation Plus condition).Participants were presented with 8 correct labels and 8 mispronounced labels.Order of presentation, pronunciation type and side of the target were counterbalanced across participants.

Scoring
The digital scoring system developed by Meints and Woodford (2008) was used to synchronize videos with trial onsets.Eye movements (left image, right image, middle or away) frame by frame (40 ms) were scored by skilled coders trained by the first author and naïve to the items being presented to the participants.Each eye fixation was coded, and an independent skilled coder scored 10% of the pool of data randomly for each group.Agreement between coders was high with an intraclass correlation coefficient of 0.936 (Shrout & Fleiss, 1979).
Looking times on each image were then automatically extracted for the pre-and post-naming phases, providing the proportions of looks to the target and the distracter in the pre-and post-naming phases, as well as longest look measures and frame by frame eye position for the time-course analysis.

Time course plot and windows of analysis
Following recommendations by Fernald et al. (2008), proportions of looks to the target as a function of time and pronunciation type were plotted in Fig. 1 for each condition (No Fixation, Fixation and Fixation Plus).Visual inspection of the plots suggests that the usual 367-2000 ms window post-naming seemed adequate, although it looks like the naming effect extended after 2000 ms post-naming for the No Fixation and the Fixation conditions.The final time course analysis (as in von Holzen & Mani, 2012) will be the best test to determine when pronunciation effects occur.
It can be seen from the plots that at the end of the pre-naming phase, looks to the target were, on average, below the expected 50% (corresponding to no preference for targets or distracters) for the No fixation and the Fixation conditions, while the proportion was more balanced for the Fixation Plus condition.While Fernald et al. (2006) recommend an average of 50% of looks to the target in the pre-naming phase to avoid any bias, it is often normalized by subtracting pre-naming measures from post-naming measures (see, among others, Mani & Plunkett, 2007) or computing a salience score like in Swingley and Aslin (2007).We will return to this observation in the discussion.

Selection of trials
First of all, we only analyzed trials where the target was known to the participant.As children were matched for vocabulary knowledge across the three conditions, we expected the proportions retained for each condition to be comparable: 95.0% for the No Fixation condition (304/320 trials), 87.8% for the Fixation condition (281/320), and 93.4% for the Fixation Plus condition (299/320).However, a chi-square test run on the raw scores revealed a significant difference between conditions ( 2 (2) = 12.55, p = .002).A general linear model of the data with vocabulary (target known vs. unknown) and condition (No Fixation, Fixation, Fixation Plus) as factors confirms that more words were known in the No Fixation condition compared to the Fixation condition (z = 3.001, p = .003),but not between the No Fixation and the Fixation Plus conditions (z < 1, n.s.).This was very likely due to a sampling effect, since the Fixation group is the only one where two participants did not know 6/16 words, while in the other groups the maximum words that were unknown did not exceed 4/16.The 2 test excluding these two participants shows that then the three groups no longer differed ( 2 (2) < 1, n.s.).
The second step in data selection often involves an inspection of looking time distribution trial per trial.The procedure presented by Fernald et al. (2008) includes a pre-screening of the recorded videos.Another way of retaining attended trials is to select trials where the child is looking at both pictures, either necessarily in the pre-naming phase (strict criterion), or at some point throughout the trial (lax criterion).The strict criterion is likely to result in cleaner data; however it could be that the lax criterion is preferable: training trials should be sufficient for the child to understand that pictures appear in the two corners of the screen.Consequently, upon hearing "Look!Dog!", shifts to the target in the post-naming phase when the child did not look at the target picture in the pre-naming phase can still be considered as a sign of word recognition (re.mutual exclusivity in monolinguals, Houston-Price et al., 2010).In what follows, we present data based on the lax criterion, but we also provide in Appendix C.1-3 the analyses based on the strict criterion.
With the lax criterion (looks at the target and distracter at some point throughout the trial), we retained 98.7% of the known trials in the No Fixation condition (300/304), 95.4% in the Fixation condition (268/281) and 97.0% in the Fixation Plus condition (290/299).A Pearson's chi-square test with Yates correction shows that the rejection rate did not differ across conditions ( 2 (2) = 4.52, p = .10).The strict criterion (looks to both target and distracter during the pre-naming phase) retained 88.8% of the trials in the No Fixation (270/304), 90.4% of the Fixation trials (254/281) and 88.0% of the Fixation Plus trials (263/299).Again, a chi-square test revealed no main effect of condition ( 2 (2) < 1, n.s.).Note that all children were included in the analysis as they contributed to all conditions (re.Fernald et al., 2006).

Dependent variables
IPL studies in the literature have presented a whole range of different dependent variables (see Table 1), and included pre-naming as a factor (Mani & Plunkett, 2007;Styles & Plunkett, 2008) or as a salience score (Swingley & Aslin, 2007;White & Morgan, 2008).In the first case, the two resulting independent variables are naming (pre-vs.post-naming) and pronunciation (correct vs. incorrect), and in the second case only pronunciation.Results from the pronunciation effect with the salience score are then identical to the post-hoc analysis of the naming × pronunciation interaction.As such, we will not present the salience score analysis.It is worth noticing that sometimes the pre-naming phase is not explicitly analyzed (Fernald et al., 2008;Ramon-Casas et al., 2009;Schafer & Plunkett, 1998;Swingley, 2003).
The most widely used dependent variables are either the longest look measure (LLK) or the proportion of total looks towards the target (PTL).The PTL is usually calculated by dividing the total looking time to the target (T) by the total amount of looks to the target and distracter, that is T/(T + D), in the pre-and post-naming phases.A significant increase in the PTL in the post-compared to the pre-naming phase is taken as evidence of a naming effect.A significant decrease or absence of difference will show an absence of naming effect, that is, no evidence of word recognition.The LLK, on the other hand, represents the longest single fixation on the target (and also distracter).Successful word recognition, or the naming effect, should lead to an increase of LLK in the post-naming compared to the pre-naming phase.
Descriptive statistics are presented for the analysis where children looked at both images at some point during the whole trial (lax criterion), with proportion of looks to the target (PTL, Fig. 2) and longest look (LLK, Table 1) measures for the target and distracter, in the pre-and post-naming phases.
Statistics are reported for the overall LLK and PTL as dependent variables, with pronunciation (correct, incorrect) and naming (pre-, post-naming) as within-participant factors, and condition (No Fixation, Fixation, and Fixation Plus) as between-participant factor.We also report effects and interactions of pronunciation and naming for each condition separately, as each condition could potentially constitute a stand-alone experiment.
Since the p values appear to be comparable for both measures (with the exception of the triple interaction, significant with LLK, and driven by the interaction between naming and pronunciation for the Fixation Plus condition), in the first following section we only report values for the PTL measure.

Table 1
Mean PTL to the target (in %), longest looks to the target and the distracter in the pre-and post-naming phases, for all conditions (standard error in brackets).The lax criterion was applied.Detailed descriptive statistics for the other measure (LLK) can be found in Table 1, and the corresponding ANOVAs are described in the following sections.

Conditions
Comparisons of measures will be discussed in the next section.

Comparison between dependent measures
So far, we ran a first series of analyses with condition, pronunciation and naming as factors, followed by a second series of analyses broken down by condition, with pronunciation and naming as factors.For the two dependent variables, we observed an overall agreement in the statistical tests, in particular for the naming variable, where the largest effect was expected overall.The more sensitive result, namely the interaction between pronunciation and naming, is clearly absent in the No Fixation and Fixation conditions, and robust in the Fixation Plus condition.
A rerun of all these analyses on trials retained with the strict attention criterion (that is, trials where children look at both pictures in the pre-naming phase) produced a comparable pattern of results (see Appendix D), except for the main effect of pronunciation that became weaker (from marginal to non-significant for the No Fixation condition, and from significant to marginal for the Fixation Plus condition).

Time course analysis
Time course plots allowed us initially to visualize the time window where the naming effect was most likely taking place, ensuring that using the classic 367-2000 ms window would not miss out interesting data points.The following tests will provide us with a more precise measure as to when the two pronunciations elicit quantitatively different looking time behaviour.In order to analyze the time course of the proportions of look, we followed the methods advocated by von Holzen and Mani (2012), a non-parametric random permutation analysis (Maris & Oostenveld, 2007) to test effect of pronunciation across time, so as to identify the time period when looking times were significantly different.With this method however, and contrary to the classic PTL analysis including naming as a factor, pre-naming behaviour is not taken into consideration in relation to post-naming.This is why we considered another analysis of the time course, focusing on post-naming data corrected by the pre-naming data (PTL during the post-naming phase -average PTL in the whole pre-naming phase).In short, we ran two analyses, one similar to von Holzen and Mani (2012), and one where the average PTL to the target in the pre-naming phase (between 367 and 2000 ms) is subtracted from each data point.The latter will reveal any potential pronunciation effect that would have been masked by a pre-naming bias.
The procedure, best described in Maris and Oostenveld (2007), identifies the time period where looking behaviour differs between the correct and incorrect pronunciation, that is, the naming effect.In the first step of the procedure individual paired sample t-tests were performed at each time sample, and used to identify significant (p < .05)t-values.In step-two, clusters were identified by finding significant t-values that were contiguous across time.For each such cluster, a cluster-level t-value was calculated as the sum of all single sample t-values within the cluster.Analysis thereafter was based on these clusters and their associated cluster level t-value, rather than the individual (and highly non-independent) t-values.Since cluster level t-values could not be tested for significance against a standard t distribution, in step three of the procedure, the significance of each cluster was calculated by comparing its cluster-level t-value to a Monte Carlo distribution of cluster level t-values generated from the cluster with the largest cluster-level t-value.To do this each of the original paired sample t-tests that were used to generate this cluster were repeated, but with the data items of each pair randomly assigned between the two conditions.This was performed 1000 times to generate a Monte Carlo distribution of 1000 summed t-values corresponding to the null hypothesis.The summed t-values of these randomized tests provided a null distribution against which the actual cluster-level t statistic of each of the observed cluster could be compared.Thus, for each observed cluster, a Monte Carlo p-value was calculated as the proportion of the null distribution which had a cluster-level t statistic that exceeded the actual cluster-level t-statistic.
The first series of analyses were conducted on the whole trials (pre-and post-naming phases) and revealed no significant difference between correct and incorrect trials for the No Fixation condition (identified cluster: 900-1060 ms after target word onset; cluster t statistics = 12.13, Monte Carlo p = .50)and the Fixation condition (here, the two pronunciation conditions do not differ significantly from each other in any time window, therefore no Monte Carlo estimate was calculated) (Fig. 1a and b).For the Fixation Plus condition however (Fig. 1c), the two types of pronunciation differed significantly between 900 and 1580 ms post stimulus onset (cluster t statistics = 48.73,Monte Carlo p = .04),with an increase in looks at the target when it was correctly named.
The second series of analyses was conducted on the post-naming phase only, after subtracting the overall PTL to the target in the pre-naming phase.This is a similar approach to the inclusion of the salience score by Swingley and Aslin (2007), where they analyze PTL in the post-naming phase, minus PTL in the pre-naming phase; instead this time it is applied on each time frame.Again, no significant difference was obtained in the No fixation condition (all t-tests reveal ps > .05).In the Fixation condition a cluster between 2420 and 2460 ms post word onset was identified but was, however, non significant (cluster t statistics = 4.33, Monte Carlo p = .96).For the Fixation Plus condition, we found a significant difference window which was comparable to the first series of analyses, with more looks to the correctly pronounced target from 740 to 1940 ms after stimulus onset (cluster t statistic = 105.34,Monte Carlo p = .002).
The same analyses conducted on the dataset selected with the strict criterion reveal rather comparable results, with perhaps more sensitivity.On the whole duration of the trial, the No Fixation condition reveals difference cluster between 980 and 1700 ms after the word onset, but Monte Carlo simulations failed to confirm that this difference is significant (cluster t statistics = 12.76, p = .53).For the Fixation condition no cluster was identified (all t-tests reveal ps > .05).For the Fixation Plus condition, we replicate the significant effect of pronunciation, from 980 to 1700 ms after stimulus onset (cluster t statistics = 54.55,p = .03).

Discussion
The goal of the present study was to evaluate how infants' looking behaviour in the Inter-modal Preferential Looking paradigm varies as a function of noise generated by two simple methodological modifications, and how different methods of analysis best account for the resulting behavioural changes.Three groups of 20-month-olds were tested for recognition of correctly and incorrectly pronounced familiar words, in conditions that varied in terms of level of visual noise.In the first, No Fixation condition, no central fixation stimulus was used and images were simply equated on size and judged as good exemplars, with no attempt to control for background and salience.In the second and third conditions (Fixation and Fixation Plus), a central fixation stimulus appeared between trials to attract the child's gaze to the centre before the onset of the next trial (and not at the onset of the post-naming phase as in Portillo et al., 2007, cited in Fernald et al., 2008).In addition, in the third, Fixation Plus condition, children were presented with images without any background and target-distracter pairs were matched for visual salience.To analyze the resulting data we examined the impact of different trial selection criteria, compared two dependent measures (LLK and PTL) and performed a time course analysis using a combination of Monte Carlo estimate and cluster analysis, to identify the time period where mispronunciation affected behaviour.
Following numerous studies using a similar paradigm (e.g., Mani & Plunkett, 2007;Swingley, 2003;Swingley & Aslin, 2000;White & Morgan, 2008), the expected result at 20 months was a naming effect restricted to the correctly pronounced targets, that is, an increase in looks to the target in the post-naming phase as compared to the pre-naming phase, only when the target word is pronounced correctly.We also expected that each methodological modification (addition of a central fixation and increased control of images) would contribute to enhance the quality of the data.The central point of the study was to determine which method of analysis would prove the most robust across methodological variations, and which would be the most sensitive.
Results overall revealed that all three groups showed a main effect of naming, that is, children fixated the target image longer after it was named.However, only the Fixation Plus group behaved as predicted by the literature, they showed a naming effect restricted to words correctly pronounced, just like in Mani and Plunkett (2007) or Swingley and Aslin (2000).Our main interpretation of these results is that the combination of the central fixation point and the selection of bettercontrolled images contributed to enhance the quality of data, and to reduce experimental noise, in the pre-naming phase, which in turn resulted in less variable post-naming data, as will be discussed below.We suspect that other parameters could act to reduce similarly the level of unwanted noise, such as the use of the same items to act as targets and distracters (as recommended by Fernald et al., 2008, see Mani & Plunkett, 2007;Swingley & Aslin, 2000) or the selection of the most frequent words in a child's vocabulary (e.g., Swingley & Aslin, 2000).
By progressively reducing the noise in the visual stimuli in the Fixation Plus condition, exploration of the visual stimuli during the pre-naming phase was more balanced with a PTL to the target around the expected 50%, despite a slight bias towards the distracter across all conditions.This bias must be due to reduced familiarity with the distracter objects (across all groups, participants were reported knowing the names of the target in 283 trials, and the name of the distracter in 246 trials), which thus worked as initial attracters.However, the pre-naming imbalance was not strictly comparable in the No Fixation (Fig. 1a) and the Fixation (Fig. 1b) conditions: whereas it was observed from the very onset of the pre-naming phase in the No Fixation condition, in the Fixation condition children looked equally long at targets and distracters from the onset of the pre-naming phase, and it is only after about 300 ms that the preference for distracters emerged.Given that the only methodological difference between the No Fixation and the Fixation conditions was the adjunction of a central fixation stimulus at the onset of each trial, it is quite likely that this central fixation point contributed to reducing the imbalance between target and distracter looks during the pre-naming phase.However, controlling for the quality of images had a cumulative positive effect, as seen in the Fixation Plus condition.Not only were pre-naming looks between targets and distracters more balanced, but the expected naming effect was obtained earlier, and was more robust than in the Fixation condition.What changed between the two conditions was a disappearance of the background and a quantitatively controlled balance in visual salience of the target-distracter pairs.Whether background control contributed more than saliency control to the sharpening of infants' behaviour remains undetermined in this study.
At this point, these results allow us to add to the recommendations of Fernald et al. (2008) and from the literature using the IPL paradigm: a fixation stimulus helps by centring the child's attention before trial onset, and carefully selected images even out the probability of looking at both images in the pre-naming phase, enhancing the sensitivity of the method.
Crucially, our central aim was to compare how these different methodological choices would impact on the robustness of data through the lens of different analytical choices such as the criteria for data selection and the dependent variables.
The literature shows that the criteria used for data inclusion or rejection in the pre-processing phase varies substantially across experimenters.A most reasonable practice -which we did not question -is to include only trials where the target word is known by the participant, as attested by parental questionnaire.More questionable is the practice of rejecting trials during which the child has not looked at both the distracter and the target at some point: does it have to be at some point during the entire trial (lax criterion), or during the pre-naming phase only (strict criterion)?We have shown that the two criteria, which measure the level of attentiveness during each trial, do not lead to fundamentally different results.Unsurprisingly more trials were rejected due to the application of the strict criterion (2.9% for the lax criterion vs. 11.0%for the strict criterion), resulting in a loss of experimental power.However, a close inspection of results in the PTL section in the Results section and Appendix D.2 shows that the size of the main effect of condition is larger when the strict criterion is applied.In contrast, applying the lax criterion results in larger sizes for all other effects, including pronunciation and naming as well as the crucial naming × pronunciation interaction.This suggests that the overall behavioural adjustments due to methodological changes may be enhanced with the use of the strict criterion, but not the quality of the key effects (naming modulated by pronunciation).With the strict criterion, we ensure that children have seen the target and the distracter during the pre-naming phase.Upon hearing the label they would know that a mispronounced name does not correspond to any picture; they can then use a 'better match' strategy based on phonological overlap and look slightly longer at the target.This translates into a relative decrease in the size of the pronunciation effect as compared to the same data analyzed with the lax criterion.In contrast, with the lax criterion, we also include trials where the child has only checked the target during pre-naming,1 and therefore, can reasonably assume that the mispronounciation can refer to the unchecked item.This results in a slightly higher number of looks towards the distracter during the post-naming phase.To sum up, data may suggest that we do not measure the same behaviour or strategy if we apply the strict or the lax criterion: for the former, we may measure a better-fit strategy based on the degree of phonological overlap, whereas in the latter, children may produce a response based on a Mutual Exclusivity-type principle (Halberda, 2003).If this speculative assumption was corroborated by further research, this should be kept in mind when deciding for one criterion over the other, depending on the theoretical goals of the experiment.
We have seen that it is common practice to exclude trials in which children do not know the name for the target object; is it justified to also exclude trials where the child does not know the distracter (as done by, among others, Swingley et al., 1999)?On the one hand, this could have some advantage: the children would be more likely to look away from an incorrectly named target image and attach the mispronounced label to the distracter (White & Morgan, 2008), strengthening the mispronunciation detection effect.On the other hand, unknown distracters could result in children showing a familiarity effect rather than a naming effect (looking at the named target simply because they have a name for it, not because it has been named with its specific label).Quite pragmatically, excluding trials in which the child does not know the distracter would possibly result in a loss of experimental power.This was indeed the case for the No Fixation and Fixation conditions, but not in the more robust condition where the critical interaction was replicated.Therefore it appears that the application of this criterion does not substantially modify the quality of the data, at least not in the current study.It should be kept in mind however that, similarly to what was discussed above for the use of the lax vs. strict criteria, knowing, or not knowing, the distracter label may modify the strategy that the child uses in the procedure.An unknown distracter promotes the use of the Mutual Exclusivity principle whereas a known distracter encourages the use of a best-match strategy based on the degree of phonological overlap.
Regarding the choice of the dependent variable, the literature often presents side by side analyses based on the proportion of looks to the target (PTL) and on the longest look to the target (LLK), as they usually show similar results.The same conclusion can be applied here, although with a caveat.A close inspection of statistics in the result section shows that in most analyses, effect sizes are larger for the LLK measures than for the PTL ones.This could be due to the fact that in the vast majority of cases, the longest look is also the first look towards the target, and during that period which lasts about 700 ms (see Table 1), the child computes all the information that is needed to correctly identify the target.Possibly all further looks towards the target are either verification or random noise, which is incorporated in the PTL measure but not in the LLK variable, resulting in less variable data in the latter than in the former measure.
Finally, we questioned the importance of adjusting the time window of analysis, and investigated the relevance of a time course analysis.Depending on the age and/or vocabulary size of the participants, it is common practice to adjust the onset of the post-naming phase to account for variation in gaze shift latency (Fernald et al., 2008).In addition, many factors can influence the processing time of the target and distracter pictures, starting with the nature and complexity of the auditory stimulus, the visual properties of the stimuli (e.g., Arias-Trejo & Plunkett, 2010), or the type of distracter (familiar vs. unfamiliar; White & Morgan, 2008).Therefore a time window fixed a priori may not be the most accurate.Of course, selecting for each experiment a time window based on the visual inspection of the data would be unacceptable as it would lead to a strong human bias.One way around this is to generalize the use of the time course analysis as reported here which allows the identification of time windows where the naming or pronunciation effects are indeed significant.It seems that the statistical analysis of the time course provided an accurate estimate of looking behaviour, since it allowed us to distinguish between a very short lived pronunciation effect (in the No Fixation condition) and an enduring one (with the Fixation Plus condition).This mirrored the outcome of the classic mean-based looking times analyses, namely a robust interaction pronunciation × naming in the Fixation Plus condition and none in the No Fixation condition.While very promising to estimate the speed of word recognition (like Durrant et al., 2014;Fernald et al., 2006), this approach needs to evolve to establish the minimal temporal window where pronunciation differences are meaningful (re. the very short lived pronunciation effect in the No Fixation condition).
In summary, observations based on infants' behaviour in a classic IPL task can vary quite substantially depending on the methodological parameters chosen during the pre-processing period or data analysis.Perhaps the vulnerability of the data are best illustrated in Fig. 2c which displays the results of the Fixation Plus condition.Correctly and incorrectly pronounced words produced different looking times for about 700 ms, as revealed by the time course analysis.This is a rather short window of interest as compared to the entire duration of the trials, for example as compared to head-turn procedures which typically generate differences of about 2 s of looking times between conditions (e.g., Mattys, Jusczyk, Luce, & Morgan, 1999;Jusczyk & Aslin, 1995).It can be argued that IPL is a more direct and precise measure of auditory processing than head turn paradigms as it does not rely on an experimenter's intervention during the session (whereas head turn set-ups usually do: Floccia et al., 2012;Nazzi, Jusczyk, & Johnson, 2000;Schmale & Seidl, 2009).Yet this augmented precision perhaps makes the IPL tool more prone to vary with methodological noise.To borrow an example from physical instruments, a digital thermometer might be more precise than a mercury one, yet thanks to its inertia the latter is more likely to give robust repeated measures than the former.It is our hope that this methodological study will contribute to sharpen the use of this invaluable paradigm in the quest of infants' representation and processing of visual and auditory information.

Appendix D.
Replication of the analyses (descriptive statistics and ANOVAs) with the strict criterion of trial inclusion (whereby only trials with looks at both the target and the distracter in the pre-naming phase are included).
D.1.Mean PTL to the target (in %) and LLK to the target and the distracter (in ms) in the pre-and post-naming phases, for all conditions (standard error in brackets).

Fig. 1 .
Fig. 1.Time course plot (in ms, with SD) of the PTL to the target for the No Fixation (a), the Fixation (b) and the Fixation Plus (c) conditions, with correct (solid line) and incorrect (dash line) pronunciations.The vertical line represents the onset of the target word and the start of the post-naming phase, and the grey rectangle the time period where the two pronunciation conditions differ.Significance as ascertained by the time course analysis technique (Section 3.4, point iii) is indicated by an asterisk.

Fig. 2 .
Fig. 2. Mean PTL for each condition and pronunciation type, for the 367-2000 ms time window, during pre-naming (light grey) and post-naming (dark grey).The double asterisk shows a difference with p < .0001.The lax criterion was used, and error bars represent 1 SE.
.2. Statistics (F, p and partial Á 2 p values) for the factors condition, pronunciation, naming and their interactions.Dependent variables are the LLK and the PTL to the target.Significant effects are indicated in bold, marginal effects (p < .10) in italics.p values where significance levels have changed compared to the lax criterion are underlined Statistics (F, p and partial Á 2 p values) for the factors pronunciation, naming and their interaction, broken down by condition (No fixation, Fixation and Fixation Plus).Dependent variables are the Longest Look and the Proportion of Looks to the Target.Significant effects are indicated in bold, marginal effects (p < .10) in italics.p values where significance levels have changed compared to the lax criterion are underlined. D