Special

of


Parent reports
Looking behaviour Pupillary arousal a b s t r a c t Developmental research utilizes various different methodologies and measures to study the cognitive development of young children; however, the reliability and validity of such measures have been a critical issue in all areas of research practices.To address this problem, particularly in the area of research on infants' interests, we examined the convergent validity of previously reported measures of children's interests in natural object categories, as indexed by (1) parents' estimation of their child's interest in the categories, (2) extrinsic (overt choices in a task), (3) intrinsic (looking time toward objects), and (4) physiological (pupil dilation) responses to objects of different categories.Additionally, we also examined the discriminant validity of all the aforementioned measures against the well-established and validated measure of parents' estimations of children's vocabulary knowledge.Children completed two tasks: (a) an eye-tracking task, where they were presented with images from a range of defined categories, which collected indices of looking time and pupillary activity; (b) a sticker-choice task, where they were asked to choose between two sticker-images from two different categories belonging to the range of categories assessed in the previous task.Parents completed two questionnaires to estimate (i) their child's interests and (ii) vocabulary knowledge in the categories presented.We first analyzed the discriminant validity between the two parent measures, and found a significant positive association between them.Our successive analyses showed no strong or significant associations between any of our measures, apart from a significant positive association between children's looking time and parents' estimations of children's vocabulary knowledge.From our findings, we conclude that measures of infants' interests thus far may not have sufficient reliability to adequately capture any potential relationship

Introduction
One of the main challenges of developmental research is that the population under study allows few ways for researchers to "peek into the mind" of the young child to infer cognition: the young infant cannot talk (fluently), point reliably, read, write or fill out a questionnaire or complete a computerised task independently.How is the researcher interested in following early development then to make reliable and valid inferences about early cognition and behaviour?Testament to the ingenuity of the field, infancy research has developed a number of improvised research methods, such as infant sucking (Eimas & Miller, 1980;Floccia et al., 1997;Jusczyk, 1985;Kuhl & Miller, 1975), head-turning (H€ ohle et al., 2004;Jusczyk & Aslin, 1995;Kuijpers et al., 1998), preferential looking (Baillargeon et al., 1985;Golinkoff et al., 1987Golinkoff et al., , 2013;;Houston-Price et al., 2007), and standardised normed parent reports (Fenson et al., 2000;Lam et al., 2003;Sivrikova et al., 2020) as potential indices of infant cognitive development.However, while there is some standardisation of the practices in infancy research (Eason et al., 2017), there remain limitations in the reliability (Byers-Heinlein et al., 2021) and validity of the methods (Kucker & Chmielewski, 2022) used.Given such concerns and the growing need for estimates of children's interest in their natural environment (see below), the current paper will examine the convergence of methods used to measure children's interest in natural objects.In this Introduction, we will first describe the concept of interest in childhood and early childhood research, and then the literature on the methods used to determine children's early and developing interests, followed by the literature exploring the reliability and validity of different measures of early cognition and the rationale and design of the current study.

"Interest" in developmental research
Interest has been considered a motivational prerequisite for learning and is said to arise from an individual's interaction with their environment (Hidi et al., 2004).Infants have been shown to persistently keep certain objects in their field of view or arm's reach, thus establishing a sustained relationship with these objects in their surroundings (Fink, 1991;Keller & Boigs, 1987).Toddlers learn and retain novel wordeobject associations better when they are interested in the category that the object belongs to (Ackermann, Hepach, & Mani, 2020; see Mani & Ackermann, 2018 for a review).Later in development, children produce better ideas when writing about a topic of high interest relative to a topic of low interest (Hidi & McLaren, 1991).The development of an individual's interest in a particular activity can be supported by the context in which the activity is performed (e.g., Hidi & McLaren, 1991;Schraw & Dennison, 1994), the feedback from the environment (Renninger et al., 2004;Renninger & Hidi, 2002), and a person's ability to regulate their motivation to complete the task (Sansone et al., 1992;Sansone & Smith, 2000).Despite suggestions that interest differs theoretically and empirically (Ainley, 2019;Grossnickle, 2016;Shin & Kim, 2019) from other constructs such as curiosity, attention, preference, and knowledge, the terms are often used interchangeably; and scales assessing these constructs, e.g., curiosity and interest, correlate moderately (Tang et al., 2022).Theoretically, fullfledged individual interest is characterised by a predisposition to reengage with certain content (Hidi & Renninger, 2006), while curiosity is typically described as a (potentially one-off) drive towards further, often new, information about a specific object either due to a knowledge gap (Loewenstein, 1994), the complexity or the novelty of the situation (see Dubey & Griffiths, 2020 for a rational approach to curiosity that subsumes such differences).Interest has also often been causally related to knowledge (acquisition; Rotgans & Schmidt, 2017), with interest either driving individuals to learn more about a topic or individuals becoming more interested in a topic, the more they learn about it.There is, therefore, a need to at least control for knowledge in examining the influence of interest in further knowledge acquisition (c.f.Ackermann, Hepach, & Mani, 2020).
The four-phase model of interest development describes the triggering and maintenance, and subsequently, the emergence of full-fledged individual interests (Hidi & Renninger, 2006).This model links the interest triggered by the situation in which the activity takes place (situational interest) to a person's predisposition to reengage with certain content (individual interest).Thus, interest in an object or activity is triggered by something in the environment or situation, i.e., experience with the object or activity of interest (triggered situational interest), which subsequently leads to more focused attention to the activity or content over an extended period of time (maintained situational interest).The maintenance of such a situational interest develops over time into an individual's independent interest in the task/content, where they begin to independently gather more knowledge about the topic without external motivation (emerging individual interest).This leads to a well-developed interest, which is characterized by extensive knowledge and positive feelings toward the content, and a preference to reengage in the topic of interest, given a choice (well-developed individual interest).
Well-developed individual interests in infants and very young children have been documented by research on extremely intense interests (EIIs); which tend to be longlasting, noticeable to non-family members, displayed in a range of different contexts and leads children to acquire a remarkable degree of knowledge about objects of interest (Chi & Koeske, 1983;DeLoache et al., 2007).In observations of children's development of EIIs, children seem to have their interest triggered by one object or item; as reengagement with this one item continues, this interest and reengagement generalises to similar items belonging to the same object category e for example, a child who developed an interest in brooms soon developed an interest in all types of brushes (DeLoache et al., 2007).These developments appear in their daily interactions and learning settings: children have been shown to influence their learning environments by varying their attentional focus, with children preferring a stimulus that is learnable (Gerken et al., 2011) and appropriately complex (Kidd et al., 2012(Kidd et al., , 2014)).Children also show a propensity to engage with certain stimuli over others, use pointing gestures to elicit labels for objects they are interested in from their caregivers, thereby skewing language input towards these objects (Begus & Southgate, 2012;Franco et al., 2009;Liszkowski et al., 2004, see Mani & Ackermann, 2018 for a review), which subsequently results in improved short-and long-term language learning and demonstrates the tangible effect that behaviour has on knowledge (Begus et al., 2014;Colonnesi et al., 2010;Goldin-Meadow et al., 2007;Lucca & Wilbourn, 2019).
Different stages of interest, as defined by the four-stage model.may overlap with other constructs.Thus, triggered situational interest is characterised by increased attention to a particular object in specific situations, while maintained individual interests are defined as a preference to re-engage with certain kinds of objects.At the same time, the overlap between states characterised as interest and other constructs, especially across different stages of interest, suggest potential problems e in certain contexts e with regards to isolating the construct of interest from related constructs both theoretically and empirically.Taken together this brief overview highlights the considerable research on the development and maintenance of children's interests in specific objects or people in their environment.Such interests have typically been explored and identified by a variety of different methods, which we describe next.

1.2.
Interest measures in infancy research

Children's choice behaviour
Examining how children selectively explore their environment and choose to engage with certain items over others (exploratory and choice behaviour) can be used to index children's interest in particular aspects of their environment.Henderson and Moore (1979), for instance, used an 18-itemdrawer box and indexed children's interest in the items within the box by measuring the time they took to explore the different items, the manner of manipulating the objects, and the number of questions the children asked about the individual toys as an index of their curiosity and explorative behaviour (see Henderson (1988) for an extension and replication).Other studies have utilised natural settings such as grocery stores, and the tactile exploration and choices children make in these settings as an index of their curiosity toward these items (Fortner-wood & Henderson, 1997).Danovitch et al. (2021) reported that children choose books e but not cards e related to topics they want to learn more about (due to not having enough information about these topics) relative to topics they already know enough about.However, they also note that children's choices were not related to their self-reported interest in the topics.Touchscreen choice tasks with children suggest that children who are allowed to choose the objects they want to hear the label of, learn more about these objects and retain the learned information better, relative to children who were passively provided with the labels of these objects (Partridge et al., 2015;but see Ackermann, Lo, et al., 2020 for conflicting results).Taken together, it appears that the literature on children's exploratory choice behaviour is mixed, with some studies highlighting the systematicity of such behaviour and others' reporting greater randomness in children's exploratory behaviour (c.f.Sumner et al., 2019).

Children's looking behaviour toward objects on screen
One measure that has been particularly commonly used in the infant literature is the looking behaviour of young infants and children when they are presented with a range of audio-visual stimuli.Indeed, LoBue et al. (2020) suggest that over 47% of the studies published in a leading developmental journal considered infants' looking behaviour as an implicit index of early cognition.Looking time can be characterised in many ways, including but not limited to: total looking time, direction of the first look to stimuli, and average duration of fixations to the target.Analyses of infants' looking behaviour have been used variedly to examine children's object recognition (Fantz, 1964), visual acuity (Atkinson et al., 1974), category formation (Behl-Chadha, 1996), language comprehension (Golinkoff et al., 1987), theory of mind (Onishi & Baillargeon, 2005), and violation of expectation (Paulus, 2022).
Researchers have measured the extent to which certain stimuli maintain the visual attention of children, by examining how long infants looked toward the stimulus, either in a single trial of a predetermined duration (Kagan & Lewis, 1965) or until the infant herself looked away (Cohen, 1972).Looking behaviour is often interpreted as an index of infants' preference for a particular stimulus, either due to the novelty of the presented stimulus (Damon et al., 2021;Fantz, 1964), the familiarity of the presented stimulus (Bremner et al., 2007;Slater, 1995), the match of the visually and auditorally presented stimulus (Forbes & Plunkett, 2019;Golinkoff et al., 1987), or children's mere preference for certain kinds of images, with studies showing that children dishabituate slower, i.e., look away, from preferred relative to non-preferred stimuli (Hyman et al., 1975).However, the use of looking time as a measure has also been called into question often, due to suggestions that looking behaviour may result merely from 'attentional inertia', i.e., a progressive increase in engagement with a stimulus once a look towards an object is established (Richards & Anderson, 2004;Richards & Cronise, 2000), looking behaviour may not actually capture the intended construct and can be interpreted in many different ways as conducive or opposing the hypothesis under question (Aslin, 2007;Fisher-Thompson, 2017;Paulus, 2022) and that similar patterns of looking behaviour may actually be explained by different underlying processes, further complicating interpretation of looking behaviour in early development (Lobue & Deloache, 2011;LoBue et al., 2020).

Children's pupillary arousal toward objects on screen
The rapid development of high-precision automated eyetrackers capable of tracking infant eyes has led to a steady increase in the number of studies using pupillary activity as a tool to understand underlying cognitive processes.In particular, such studies examine the extent to which the infant pupil dilates immediately in response to a time-locked critical event (phasic) or slower gradual response to a neutral stimulus (tonic; Hepach & Westermann, 2016;Sirois & Brisson, 2014) as an index of arousal (Bradley et al., 2008), violation of expectation (Gredeb€ ack & Melinder, 2010;Jackson & Sirois, 2009), surprise (Preuschoff et al., 2011), and prosocial behaviour (Hepach et al., 2013).Studies with adults report that adults' self-reported curiosity in finding out the answer to trivia questions was correlated with pupillary arousal both prior to and following the presentation of the answer to the question.In keeping with the adult literature linking pupillary activity to attention, interest, and arousal (Beatty, 1982;Hess & Polt, 1960), the authors interpret this association as indexing the link between interest and pupillary activity.Extending these findings to developmental research, Ackermann, Hepach, and Mani (2020) found that 30-month-olds' learning of novel wordeobject associations was associated with their pupillary arousal to objects from the category that the novel object belonged to.These results are comparable to those reported by Kang et al. (2009) suggesting an association between children's pupillary activity and their interest in or arousal about natural object categories.The mechanisms underlying this association may then be linked either to children's or adults' greater familiarity with topics or objects they are more interested in, or the greater effort they invest in processing information of high interest or more low-level arousal due to the perceptual characteristics of the stimuli from highinterest categories.
Given the different measures thus far used to examine children's interest, arousal, and knowledge in previous developmental studies, the question arises as to the validity of the different measures employed to date.In other words, to what extent do we find any evidence that the different measures tap into the same construct?In what follows, we discuss the literature surrounding the validity of infant measures in the literature to-date.

Reliability and validity issues in early childhood research
The reliability of infant measures has been sparingly reported in previous literature, and when reported, measurement reliability has performed poorly (see Byers-Heinlein et al., 2021).Reliability alone is also not sufficient to determine the validity of a measure used (Flake et al., 2017).Take the commonly used metaphor of volley of arrows flying towards a target: if the arrows hit a particular spot on the target repeatedly, we are in a zone of high metaphorical reliability.On the other hand, if an arrow hits the bullseye, that would speak to metaphorical validity.Only if the volley of arrows hits the bullseye multiple times, could our metaphorical measure be considered reliable and valid.In other words, even with high measurement reliability, we cannot be certain that the measure indexes the desired effect of interest; in fact, by trying to maximise reliability of measures, we may end up undermining the validity of the measurement (Zettersten et al., 2022).The reliability and validity of a measure has a direct impact on the magnitude of measurement error (the difference between true value and measured value), and consequently, the statistical power of an analyses, the required sample size to detect specific effects, and the size of effects detectable (c.f.Byers-Heinlein et al., 2021).In developmental research, the effects of the low reliability of measures is even greater, as researchers often cannot compensate for the loss of statistical power with a larger sample size or more data, leading to underpowered studies and smaller effect sizes.
Some of the solutions presented to improve the reliability and validity of infant research are to increase the number of trials, exclude low-quality data prior to analysis, utilise more sophisticated analyses (Byers-Heinlein et al., 2021), developing measurements collaboratively (Reinelt et al., 2022), using a greater variety of exemplars as stimuli (Visser et al., 2022;Zettersten et al., 2022) and using multiple outcome measures to measure the variable of interest (Havron, 2022;LoBue et al., 2020).However, there are issues with many of these solutions: for example, Zettersten et al. (2022) argue that with too many trials, one may actually measure children's willingness to perform the task rather than the actual construct of interest.Therefore, we need to carefully consider and tailor the method of measurement to ensure it consistently captures the intended variable of interest.
Against this background, there have been oft-repeated calls for infancy research to consider the possibility of measuring the same construct using a range of different methods (Aslin, 2007;Buss, 2011;LoBue & Adolph, 2019;Morris et al., 2006, to name a few).In keeping with such suggestions, a number of studies have taken to including multiple measures in their designs.For example, Berger et al. (2006) interpreted an association between looking time data and brain activity (related to error detection) as indicating that looking time data indexes violation of expectations in infants.Similarly, Gredeb€ ack et al. (2018) used pupillometry and looking time data as an index of young infants' action prediction, while Dunn and Bremner (2017) interpret differences in social looking and looking time measures in a violation of expectation task as evidence of the greater validity of social looking in indexing infant cognition.Further endorsing approaches that include multiple measures, Kagan et al. (2002) suggest that such approaches may allow us to more clearly define constructs in personality research that were previously assumed to be heterogenous.LoBue et al. ( 2020) detail how including additional behavioural measures triggered a dramatic turn in the threat perception literature: While infants look longer at more threatening animals than less threatening animals (spiders versus cockroaches), in an interaction task, they engage equally with animals regardless of how threatening they are.Thus, the increased attention to threatening animals in looking time tasks may not index sensitivity to threat e where we would assume that infants would then not engage with these animals e but rather greater interest or more attention to more threatening animals (LoBue & DeLoache, 2008;LoBue et al., 2020).
The latter study provides an ideal example of how the same measure can be interpreted in different ways and how the inclusion of multiple measures may allow us to draw more precise conclusions regarding the construct under observation.The current study, therefore, examined the extent to which different measures that have previously been assumed to index children's interest converge, in an attempt to provide a more precise characterisation of "interest" in early development.To the extent that we can never get an actual estimate of what a young prelinguistic child is interested in, the association between different measures may be interpreted as providing converging evidence of our understanding of the true construct (or measured construct e where both measures may index a construct different to what we want to measure).In contrast, the lack of association between different measures may either highlight the low reliability of one or both measures, or potentially distinct dimensions of the construct "interest" in the early development.

The current study
The current study will examine the convergence of four different measures of interest that have been utilised in the literature to date, namely, parental estimates of children's interest in natural object categories, children's explicit choice of objects belonging to these natural categories, children's looking time towards and pupillary arousal following images of objects belonging to these natural categories.By examining the association between the different measures, we seek to determine the convergent validity (i.e., correlating two measures measuring the same construct; Kucker & Chmielewski, 2022) of the four measures in an attempt to provide greater understanding regarding their use in future infant research about interest.Across all measures, we were also interested in the discriminant validity of the measures of interest under question and the more established measure of parental reports of children's vocabulary knowledge, in particular with regards to the number of words children knew in each of the categories presented across the different tests.By discriminant validity, we refer to the correlation between two measures assumed to tap into two different constructs, here interest and vocabulary knowledge (c.f., Kucker & Chmielewski, 2022).Although knowledge is distinct from interest, we note that the two constructs may be causally related (Hidi & Renninger, 2006;Rotgans & Schmidt, 2017), and therefore knowledge about a category may correlate with measures of children's interest in that category to some degree.Discriminant validity should thus be reflected in the different measures of interest correlating more strongly with each other than with the measure of category-specific vocabulary knowledge.
In particular, we asked parents to complete two questionnaires, one of which contained a series of questions regarding their perception of their child's interest in and familiarity with the six categories presented and the other was a standardised normed vocabulary questionnaire used to assess the parent's perception of their child's knowledge of a number of words in each of the six categories.Children then completed an eyetracking task and a sticker choice task.The eye-tracking task provides us with an index of children's looking behaviour and pupillary activity.In the sticker choice task, children were presented with pairs of stickers depicting exemplars from two different categories and allowed to choose one of the pair of stickers.

Method
We report how we determined our sample size, prerequisites for data exclusions, our inclusion/exclusion criteria, whether inclusion/exclusion criteria were established prior to data analysis, all manipulations, and all measures in the study, either in our preregistration (https://osf.io/j74hr 1 ) (more details in section 2.2), or in the data pre-processing section of our paper (section 2.7).We also state that all the data that is required for the replication and substantiation of all our analyses, outcomes and conclusions are published online in an open science repository (Madhavan et al., 2024).

Ethics
Ethics approval was granted by the Psychology Institute's Ethics committee and all caregivers provided informed written consent prior to participation in the study.Children were offered a book as a token of appreciation.

Preregistration
We preregistered our sample size (n ¼ 81), included predictor and response variables, hypotheses, and planned analyses on the Open Science Framework (https://osf.io/j74hr)prior to data analysis.The datasets and analysis scripts can be found on the OSF page of the project (https://osf.io/frpvj/).

Participants
Participants were children aged between 24 and 36 months (M age ¼ 28.77, SD age ¼ 3.76).All participants were monolingual, had been carried to full-term and had no diagnosed developmental disorders.
A total of 122 children (53 girls, 69 boys) participated in the experiment.Due to participant attrition in child research, and since the risk of participant dropout increases with the number of tasks we incorporate into the study run time, we tested more participants than our pre-registered sample of 81 participants.As we predicted, not all children provided data for each task, resulting in missing data for some tasks.Therefore, there was a difference in the final sample size analysed for each of our research questions.For all but one of the research questions, the sample size was higher than our pre-registered sample size (the final sample size for each research question is reported in the results section).

Stimuli
The stimuli composed of six categories of items: animals, bodyparts, clothing, food, furniture and vehicles.They were chosen on the basis of previous research on children's category-based word learning, with the exception of furniture (Ackermann, Hepach, & Mani, 2020;Borovsky et al., 2016).Within each category, we chose 6 items that were considered to be familiar to the majority of children by the age of 24 months based on parental reports (Frank et al., 2016).On examination of the number of children who knew all of the 36 familiar items chosen, 86 out of the 112 children (for whom we had data) knew all the items (76.7%).We ensured that chosen items were suitable examples of their respective categories and that they contained common, overlapping perceptual features.

Visual stimuli
Photorealistic images of the familiar images were used as visual stimuli.In addition, each item was represented by two exemplars, e.g., two different images of a bear.Therefore, visual stimuli comprised of a total of 72 images with 2 images for each of the 6 items in each of the 6 categories.The majority of images were obtained from the Bank of Standardized Stimuli (Brodeur et al., 2014) and the Moreno-Martı ´nez and Montoro (2012) databases.The remaining images were sourced from Google Images.Each image was edited to be identical in size.We standardised the size of each image to ensure that the area covered by the image (in pixels) was approximately 10% of a 1920 Â 1080 px screen.The resulting scaled images were centred against a grey background.
During the experiment, we also displayed scrambled, diffeomorphic transformations of the original images.This transformation renders an image unidentifiable whilst preserving the original image's perceptual characteristics (Stojanoski & Cusack, 2014).The diffeomorphic process used the parameters specified in the original paper.The scrambled image is, therefore, comparable to the original in terms of colour and luminosity, minimising incidental effects on the pupil, which is sensitive to changes in such properties (Hepach & Westermann, 2016).Finally, the study also used an attention getter of a spinning flower paired with an attractive bell-tone to maintain children's attention between trials.

Auditory stimuli
A female native speaker of German spoke the individual labels in infant-directed speech in carrier phrases including the appropriate definitive article for each object label, i.e., "der", "die" or "das".All stimuli were normalised to 70 dB and noisefiltered using Goldwave (Goldwave Inc., 2009).

Sticker stimuli
The same photorealistic images used as visual stimuli for the eye-tracking task were also used to create stickers for the choice task.The images were centered on a square grey background, scaled down to fit a 5cmx5cm sticker paper, whereupon they were printed to use as the sticker stimuli for the sticker choice task.

Parental questionnaires
Parents completed an adapted version of the FRAKIS vocabulary inventory (Szagun et al., 2009), which only included words from the categories used in the study, to estimate children's knowledge in each category.Parents completed an additional questionnaire to provide an estimate of their perception of their child's interest in each category.Parents here reported the following for each category, a) how curious their child was about objects from each category, e.g., different kinds of animals, clothing, b) how much enjoyment their child gets from objects from each category, c) how many questions their child asks about objects from each category and d) how much time their child spends with objects from each category.Answers were reported on a seven-point Likert scale (1 ¼ not at all, 7 ¼ extremely).

Procedure
Children completed the eye-tracking task and the sticker task in one session.Tasks are described in the order that they were administered.The order of the tasks were counterbalanced across participants, in order to avoid familiarity or exposure effects.Parents were given the questionnaires to complete in the week leading up to the appointment, which they brought with them to the appointment.Gaze and pupil data were recorded using a Tobii X3-120 eye tracker with a sampling rate of 120 Hz.Stimuli were presented using Tobii Studio on a 40inch screen with a resolution of 1920 Â 1080 pixels.The actual images onscreen were centred and scaled to 960 Â 720 pixels, resulting in a black border around each image (Fig. 1).

Eye-tracking task: looking time and pupillary activity
The eye-tracking task examined children's looking time towards images from different categories and their pupillary arousal to these images, using a method similar to Ackermann and colleagues (2020a).Here, participants were sequentially presented with all 72 images and their corresponding labels, across 12 blocks, with each block containing one item from each of the six categories.Each block comprised therefore of c o r t e x 1 7 5 ( 2 0 2 4 ) 1 2 4 e1 4 8 six trials where participants saw a single item in each trial.
The first six blocks presented children with exemplars of the 36 items (6 items per category).The second six blocks presented different exemplars of the same 36 items, so all items had been presented once before they were repeated.The ordering of items within and between blocks were counterbalanced across participants.Trials had a total duration of 6000 ms and began with a scrambled image of the object, presented for 2000 ms (scrambled phase), followed by the unscrambled image (unscrambled phase), presented for 4000 ms.The object was labelled such that the onset of the object's label was exactly at 4000 ms, preceded by its definitive article (in keeping with Ackermann, Hepach, & Mani, 2020).Blocks were interspersed by an attention-grabbing stimulus, with the duration of presentation of the attention-grabbing stimulus determined by the experimenter who continued with the task only when the child was fully attentive.The child's attentiveness was determined through a combination of a live-feed video of the child and the live-feed eye-tracking data.This task took approximately 7.5 min.

Sticker task
Following the eye-tracking task, children completed a sticker task where they were allowed to choose one of a pair of stickers as an index of their interest in the category to which the object on the sticker belonged.Children sat at a table facing the experimenter and were presented with 15 trials, where they were allowed to choose one sticker of two.Within a trial, the two stickers were always from different categories, such that the choices represented a choice between categories.Category-pairs were counterbalanced so that each category was presented in competition with another category equally.For example, an animal exemplar competed with a vehicle an equal number of times as a food exemplar.In addition, across participants the presentation order was randomised and the side to which stickers were presented was counterbalanced to eliminate any possible side bias.
Each trial started with the experimenter placing two stickers in front of the child, approximately 5e10 cm apart, whilst simultaneously asking them, "Was gef€ allt dir besser?" ("Which one do you like better?"), to encourage children to c o r t e x 1 7 5 ( 2 0 2 4 ) 1 2 4 e1 4 8 choose a sticker.Choices were typically coded according to whether the child picked up one of the stickers, pointed at or touched one of the stickers relative to the other.Children were given no time limit to make a choice.Therefore, the decision to move to the next trial was at the discretion of the experimenter.The task was video-recorded to allow for coding to be verified offline.The task took approximately 2 min.

2.7.
Pre-processing 2.7.1.Eye-tracking task: looking behaviour and pupillary activity The eye-tracker provided us with data of where children were looking on the screen at a sampling rate of 120 Hz, i.e., one data point approximately every 8 ms.This data was coded with regards to the pixel coordinates of the objects presented on the screen, in order to calculate the amount of time children spent looking at each individual object on the screen.For both looking behaviour and pupillary activity data, we only included fixations, which was defined by the Tobii eyetracker's default (>100 ms), and when the eye-tracker was able to report the gaze behaviour with high validity (Tobii validity score <2).
To calculate looking time toward each item in each trial, we aggregated gazes toward the object on screen only when the eye movement type was a fixation, and the eyes could be tracked reliably (valid eye-tracking data).We then calculated the amount of time (in ms) that children spent looking at the individual objects during the unscrambled phase, which was then divided by the total duration of the unscrambled phase to provide an estimate of children's proportion of looking toward individual objects.We excluded any trials where children did not look at the object on screen throughout the unscrambled phase.We then averaged the proportion of looking time across the two different exemplars of each item in order to provide a single score for each item.
The eye-tracker also provided us with children's pupil diameter at a sampling rate of 120 Hz, i.e., one data point approximately every 8 ms.As with the eye-tracking data, data was coded with regards to the pixel coordinates of the objects presented on the screen, in order to calculate pupillary activity during looking towards each individual object on the screen.For each bin, we exported the size of the left and right pupils only where there were fixations toward the object on screen.We then filtered the pupil data using a threshold filter (separately for both eyes), which calculated the difference in pupil size between two adjacent samples and excluded the top ten percent of differences within the pupil size.This was done to exclude large deviations in pupil size from one sample to the next, which are likely to be artefacts.We then interpolated missing data points with a sample size of 4 (Hepach et al., 2012) and finally calculated the average pupil size of both eyes.We performed a baselinecorrection on the pupillary data, whereupon we subtracted the averaged pupillary diameter from the last 500 ms of the scrambled phase from the last 2000 ms of the unscrambled phase, i.e., from the onset of the label.We then averaged the baseline corrected pupillary response across the two different exemplars of each item in order to provide a single score for each item.

Sticker choice task
We only included those trials where children were coded as actually making a choice, either by pointing, touching or taking a sticker.Children were excluded from the analysis if they showed a persistent side-bias, for example by only picking stickers from the left-hand side.

Parent reports
We report the pre-processing of parental category interest questionnaire data in the next section due to initial examination of the data leading to a deviation from the preregistration.With regards to the parental vocabulary questionnaire, we calculated the proportion of words that the child was reported to know in the category (understood and/or produced) relative to the total number of words from the category included in the standardised questionnaire.

Analysis plan
We first modelled the association between the parental interest questionnaire and the standardised, normed parental vocabulary questionnaire in order to examine the discriminant validity of the two parental reports.Next, we modelled the association between the child's explicit sticker choice and parent reports of their child's interests, based on previous suggestions that parent reports correlate with children's choices (e.g., Mata et al., 2008).We then proceed to our two other measures, looking time and pupillometry.Given the more prolific use of looking time in developmental research, we first modelled the association between the looking time measure and the explicit sticker choice and parent interest measure.Lastly, we modelled the association between the pupil measure and the other measures to establish the convergent validity of the four measures.Our rationale for modelling these associations one-by-one was three-fold: firstly, by doing so, we establish the convergent validity of pairs of measures; and secondly, this will lead us to build simpler statistical models which would lead to easier interpretation of the results, and thirdly, to ensure that we do not enter correlated predictors into the same model.

2.9.
Model construction for analysis 2.9.1.Hypothesis 1 e discriminant validity of parent reports (model 1) The four dimensions (familiarity, curiosity, questioning and joy, measured on a scale from 1 to 7) according to which parents estimated their children's interests were only weakly correlated.We, therefore, refrained from subjecting them to a PCA (as stated in the preregistration).Instead, we summed the interest scores per child across dimensions and subsequently divided by 28 (the largest theoretically possible interest score), such that the resulting score ranged from 0 to 1.This was our measure of parent's perception of their child's interest in different natural categories (referred to hereafter as parent estimates of interest in the text, and in Tables 1e4as parent interest).The response was not overdispersed given the model (dispersion parameter .90).
To estimate the extent to which our dependent variable parent estimates of interest was predicted by our independent c o r t e x 1 7 5 ( 2 0 2 4 ) 1 2 4 e1 4 8 variable of parent's estimation of the child's category-specific vocabulary knowledge (referred to hereafter as category-specific vocabulary size in text, and in Tables 1, 3 and 4 as cat.spec.voc;coded as proportion of words in a category known by the child relative to overall words pertaining to that category present in the questionnaire, which results in a value between the range 0 and 1), we fitted a Generalised Linear Mixed Model (GLMM; Baayen, 2008) with a beta error distribution and logit link function (Bolker, 2008;McCullagh & Nelder, 1989).While we originally pre-registered a Gaussian error distribution for this model, we used a beta error distribution instead, since our dependent variable of parent estimates of interest was bound between zero and one.The beta error distribution has precisely these limits and is thus recommended in case of such response variables.Besides the fixed effects predictor of category-specific vocabulary size, the model comprised random intercept effects for participant and category label (the six different object categories we utilised in this study).The random intercept effects for participant and category label account for the potential non-independence of observations of the dependent variable made for the same level of, for instance, participant.Hence their inclusion avoids pseudoreplication (Hurlbert 1980).To avoid an overconfident model and keep Type I error rates at a nominal level of .05,we included random slopes (Barr et al., 2013;Schielzeth & Forstmeier, 2009) of category-specific vocabulary size within participant.Random slopes account for the possibility that, for instance, participants do not only vary with regard to their average response but also with regard to how much they are affected by variation in category-specific vocabulary size.Importantly, it had been shown that neglecting such random slopes can dramatically inflate type I error rate (Barr et al., 2013;Schielzeth & Forstmeier, 2009).Originally, we also included a random slope of category-specific vocabulary size within category label.However, we excluded this term following convergence issues and given that the contribution of the random slope was estimated to be essentially zero.Similarly, we also excluded parameters for correlations between random intercepts and slopes as the respective model did not converge.Model parameters for the correlations among random intercepts and slopes account for the possibility that these are correlated.Such correlations are quite likely, particularly in the presence of potential bottom or ceiling effects.For instance, a participant being very interested (as reported by the parents) will have a relatively high random intercept and, at the same time, their interest might be little affected by category-specific vocabulary size (as reported by the parents), simply because their interest is at or close to ceiling.
The sample analysed with this model comprised a total of 665 parent estimates of interest scores of 111 children for 6 categories.

Hypothesis 2 e convergent validity of parent reports and children's explicit choice (model 2)
To estimate the extent to which our dependent variable children's explicit choice of stickers from different categories during the sticker choice task (sticker choice) was predicted by our independent variable parent estimates of interest (which was calculated as described for the first model, and then rescaled to have a mean of 0 and a standard deviation of 1), we fitted a model with parent estimates of interest as the sole fixed effects predictor included.As random intercept effect we included participant, category label, and object label (the different objects from each category that we presented the children with) nested within category label.Furthermore, we included random slopes of parent estimates of interest within all 3 grouping factors.In the initial full model, all absolute correlation parameters between random intercepts and slopes appeared to be essentially 1, which is indicative of them being unidentifiable (Matuschek et al., 2017).In fact, such large absolute correlation parameters frequently do not arise from random intercept and slopes being strongly correlated, but rather from having too few observations per level of a random effects factor to reliably estimate them.In such a case, model fits (assessed via log-likelihood) of a model with and one without the correlation parameters will be very similar.Since the loglikelihoods of the model including the correlation parameters and model not including them differed by less than .2,we report the results for the model excluding the correlation parameters (but retaining the random slopes).
To account for the fact that the child could always choose only one of the two stickers presented in each trial, the data comprised of two rows per trial; one for each of the two objects presented, and the response indicating which of the two options the child chose.Since the data consequently include two values of the response for each trial, we determined significance using a permutation test (Adams & Anthony, 1996;Manly, 2017).Such a permutation test allows for a reliable significance test, despite the fact that each decision of a child is represented by two rows in the data set, one for the object chosen, and one for the object not chosen.We randomised the choice of the two objects in each trial and fitted the model to the randomised dataset.We conducted 1000 such permutations, including the original data as one permutation, and extracted the estimate obtained for the parent estimates of interest score.We then determined significance of parent estimates of interest as the proportion of permutations revealing an absolute estimate at least as large as that of the original data.
The sample analysed with this model comprised a total of 2814 datapoints (i.e., 1407 choices) made by 95 participants, choosing among 36 objects belonging to 6 categories.

Hypothesis 3 e convergent validity of children's looking behaviour, parent reports and children's explicit choice (model 3)
To estimate the extent to which our dependent variable of children's looking time to objects from particular categories was predicted by our independent variables category-specific vocabulary size (calculated as above, then scaled to have a mean of 0 and sd of 1), parent estimates of interest (calculated as above, then scaled to have a mean of 0 and sd of 1), and children's choices in the sticker task (firstly transformed into ranks for each category for each child, and then divided by the maximum possible rank, and then scaled to a mean of 0 and sd of 1), we fitted a GLMM with beta error structure and logit link function, comprising these three predictors as fixed effects.Since children's looking time is calculated as a continuous proportion (total amount of time children spent c o r t e x 1 7 5 ( 2 0 2 4 ) 1 2 4 e1 4 8 looking at the object relative to time of trial), we utilise this particular mixed model with this error distribution.Category-specific vocabulary size was included, since the first model in Hypothesis 1 did not indicate a correlation between category-specific vocabulary size and parent estimates of interest strong enough to exclude one of the predictors from this model.As random intercept effects, we included participant, category label, and object label (nested within category label).We included random slopes of all predictors within all three grouping factors.We omitted parameters for correlations among all random intercepts and slopes as the model including these parameters did not converge.As an overall test of the effect of the three fixed effects predictors and to avoid cryptic multiple testing (Forstmeier & Schielzeth, 2011), we compared this full model with a null model lacking the three fixed effects predictors while otherwise being identical.We also checked the Variance Inflation Factor (VIF) of the predictors, to avoid multicollinearity.The VIF of all the predictors were less than our preregistered threshold of 4; therefore, we proceeded with the model containing all fixed effects predictors.To rule values of the dependent variable being exactly 0 or 1 (which cannot be modeled with a beta error distribution) we transformed the response variable as suggested by Smithson and Verkuilen (2006).
The response was not overdispersed given the model (dispersion parameter .79).The sample analysed with this model comprised a total of 2411 observations of 83 participants' looking times to 36 objects from 6 categories.

Hypothesis 4 e convergent validity of children's pupillary activity, looking behaviour, parent reports and children's explicit choice (model 4)
To estimate the extent to which our dependent variable pupillary activity, as an index of children's interest in the different natural object categories, was predicted by category-specific vocabulary size, parent estimates of interest, children's proportion of looking time, and children's choices in the sticker task, we fitted a Linear Mixed Model (LMM) with these three predictors as fixed effects.To control for repeated observations, we further included random intercept effects of participant, category label, and object label (nested within category label).We included random slopes of all four predictors within all three grouping factors.We omitted parameters for correlations among all random intercepts and slopes within category, as all absolute correlation parameters within category appeared to be essentially 1.This led to a reduction in loglikelihood of about .3.We fitted the model with a Gaussian error structure and identity link and using maximum likelihood.The model fitted based on all data revealed two absolute residuals much larger than the other ones (see Supplementary Materials S2 for Residual plots).Since such outliers could make the model conservative, we excluded them from the data (see the Supplementary Materials S2 for the results of the including these two outliers).We also checked the Variance Inflation Factor (VIF) of the predictors, to uncover potential multicollinearity.The VIF of all the predictors were less than our preregistered threshold of 4; therefore, we proceeded with the model containing all fixed effects predictors.
We evaluated whether the assumptions of normality and homogeneity of the residuals were fulfilled by visual inspection of QeQ plot and residuals plotted against fitted values.Neither of these indicated severe violations of these assumptions (after exclusion of the two outliers).The dataset analysed with this model comprised a total of 1775 baseline-corrected pupil dilation datapoints from 78 participants as they viewed 36 objects from 6 categories.

Exploratory psychometric analyses
In order to obtain a better picture of the association between our four measures of interest, we followed the Editor's suggestion to construct a "Latent variable model", with which we modelled the four interest measures as latent variables.We first attempted a Confirmatory factor analysis (CFA), which failed to converge.Our further attempts to examine the data by determining the Kaiser-Meyer-Olkin measure of sampling adequacy (KMO) was also unsuccessful.Further details and discussion of this analytic approach can be found in the Supplementary Materials (S5).Finally, we examined a Multi Trait Multi Method Matrix (MTMM), developed by Campbell and Fiske (1959), as a way of assessing the convergent validity of our four measures of interest (Alwin, 1973).We constructed this matrix by correlating the four interest measures and our measure of children's category specific vocabulary, for each of the six object categories with one another, creating a 30 Â 30 matrix which provides correlations among the measures of interest and vocabulary for each of the six different object categories.

Results
While we took all efforts to follow the analyses outlined in our preregistration as closely as possible, there were some deviations in our actual analyses.We have annotated all changes to our preregistration within our OSF project (https:// osf.io/eysn2).

Item-level descriptives
Here, we provide the descriptive statistics (median and quartiles) in the form of plots pooled by the six different categories we presented the children for all of the five measures utilized in this study.Instead of providing a plot of descriptives for the predictors and responses for the four different Linear Models each, since the participant size varies from model to model, we provide the plot descriptives for all the data we have for each of the variables, and therefore the means and standard deviations for each variable for each model will be slightly different from the ones we provide below.The variable transformations, numerical means and standard deviations, and additionally the median and quartile values for these items can be found in the Supplementary Materials (S6) (see Fig. 2).

Hypothesis 1-discriminant validity of parent reports
Model 1 suggested that parents' estimates of children's interest increased with category-specific vocabulary size (c 2 (1) ¼ 7.65, p ¼ .006;Table 1; Fig. 3), i.e., that there was an association between parents' estimation of the categories that their child was interested in and parents' estimation of the number of words they believed their child to know in each of these categories.The predictor cat.spec.vocwas z-transformed to a mean of zero and a standard deviation (sd) of one; mean and sd of the original predictor were .83and .21,respectively.Indicated are estimates, together with standard errors, 95% confidence intervals as well as minimum and maximum of model estimates after excluding random effects one at a time.

Hypothesis 2 -convergent validity of parent reports and children's explicit choice
Overall, we could not detect an association between parent estimates of interest and children's choices in the sticker task, i.e., between parents' estimation of the categories they believed their child to be interested in and the child's choice of stickers from this category (permutation test p ¼ .14)(Table 2, Fig. 4).

3.4.
Hypothesis 3 e convergent validity of children's looking behaviour, parent reports of interest, and children's explicit choice and discriminant validity of parent reports of children's vocabulary size Overall, there was no significant difference between the full and null model (c 2 (3) ¼ 7.22, p¼.06).Category-specific vocabulary size was positively correlated with children's proportion of looking; there was no significant association between children's proportion of looking and any other predictor (Table 3,Figs. 5e7).Thus, we found no convergent validity between parental measures of children's interests as estimated by their caregivers, children's explicit choice of items from that category and children's looking time to objects from that category.Furthermore, given that categoryspecific vocabulary size and children's looking time were more strongly related than looking time and the other measures of interest, we find no evidence for the discriminant validity of the parent reports of category knowledge and our measures of children's interests.).Dots show observations whereby the area of the dots (i.e., how big the dot is/ how much space the dot occupies on the graph) depict the number of observations with the exact same rating in both variables (range: 1 to 47).The dashed line and grey polygon depict the fitted model and its 95% confidence limits.Note: For readers wishing the see a different visualisation of this and the following data, we have provided plots where each individual data point is represented by a single dot, for ease of interpretation in the Supplementary materials (S7).The predictor parent interest was z-transformed to a mean of zero and a standard deviation (sd) of one; mean and sd of the original predictor were .73 and .20,respectively.Indicated are estimates, together with standard errors, 95% confidence intervals as well as minimum and maximum of model estimates after excluding random effects one at a time.

Hypothesis 4e convergent validity of children's pupillary activity, looking behaviour, parent reports of children's vocabulary size and interest, and children's explicit choice
Overall, there was no significant difference between the full and null model (c 2 (4) ¼ 1.46, p¼.74).Inspection of model results and data revealed that the baseline corrected measure of children's pupillary dilation was not significantly associated with any of the predictor variables (Table 4 ; see supplementary materials S2 for model figures).

Results for random effects
In all models, some of the random effects were estimated to contribute to the response at least to some extent.Model 1 attributed considerable variation in the response to differences between categories and participants; Model 2 attributed considerable variations in the response to differences between categories, and Model 3 attributed considerable variations in the response to differences between categories, participants and objects.Only in Model 4, were none of the random intercepts effects estimated to contribute For data visualisation purposes, dots show the proportion of stickers chosen by a child for that category as a function of parent's estimation of their child's interest in that category, whereby the area of the dots (i.e., how big the dot is/how much space the dot occupies on the graph) depicts the number of observations with the exact same rating in both variables (range: 4 to 60).The dashed line and grey polygon depict a fitted model and its 95% confidence limits.All predictors were z-transformed to a mean of zero and a standard deviation (sd) of one; mean (sd) of the original predictors were: parent interest: .73(.20); cat.spec.voc:.84(.20); sticker choice .58(.27).Indicated are estimates, together with standard errors, 95% confidence intervals, significance tests as well as minimum and maximum of model estimates after excluding random effects one at a time.
c o r t e x 1 7 5 ( 2 0 2 4 ) 1 2 4 e1 4 8 considerably to the response (Tables in Supplementary Materials S1).The contribution of random slopes was almost invariably estimated to be smaller than the contribution of random intercepts, still, some of them seemed to have contributed to the response.

The Multi Trait Multi Method Matrix (MTMM)
In the MTMM, we correlated the interest measures per object category, and the object category per interest measure, we found an average absolute correlation of about .23.The largest absolute correlation was about .67,and only five absolute correlations were larger than .6 and six absolute correlations larger than .5.All of these stronger correlations were observed within measure correlations, e.g., correlations between looking time to food and looking time to clothing (Table 5).

Discussion
The current study investigated the association between different measures of interest that have been utilised in the literature to date, in order to examine the convergent validity of these measures.The aim of the study was to provide greater understanding about the measures used in developmental research in order to guide methodological choices in future infancy research.We chose four measures of interest that have been commonly used in previous research, namely, parents' estimations of children's interest in different natural categories, children's explicit choice of objects from particular categories over other objects, their looking behaviour toward objects from different categories, and their pupillary activity during their fixations toward these objects.Furthermore, given the robustness of parental vocabulary questionnaires in developmental research, we also examined the discriminant validity of parent reports of children's vocabulary size and the other measures.The analyses showed weak correlations between parent reports of category-specific vocabulary knowledge, parent reports of children's interest, and children's looking behaviour to various objects of different categories.
We found no other associations between any of the measures.We discuss each of these validation steps separately, followed by our interpretation of the implication of these results for future research.

Discriminant validity of parent reports
The first analysis examined the relationship between parent reports of children's category-specific vocabulary size and children's interests.Our results showed that parent reports of category knowledge, i.e., the number of words that they believed The dashed line and grey polygon depict a fitted model and its 95% confidence limits, at the average of the other predictor.
c o r t e x 1 7 5 ( 2 0 2 4 ) 1 2 4 e1 4 8 their child knew in each category, was significantly and positively associated with how interested they believed their child to be in that category.In other words, we find low discriminant validity of parental reports of children's interests and their vocabulary size.In what follows, we discuss potential reasons for the reduced discriminant validity of these measures.Language is acquired knowledge, and parents, as primary caregivers, provide vital language input to their children, thereby shaping their vocabulary growth (Bergelson et al., 2023).Caregivers also shape the environment around children, by providing them with toys and playthings, and further input through activities such as visits to museums (DeLoache et al., 2007;Leibham et al., 2005).Caregivers spend extended periods of time with their children during their daily activities and playtimes, observing their children's play and exploratory behaviour, allowing them to track their children's development (c.f., Kartushina et al., 2022), selective attention towards and interest in certain objects and activities.These shared periods may provide insight on the reported association between caregiver estimates of children's category knowledge and category interests.
However, we cannot determine the directionality of the association between category specific vocabulary and parent estimations of children's interest reported above.On the one hand, the association may be explained by suggesting that parents use their estimate of the number of words their child knows in a category as an index of their child's interest in that category e "my child knows many animal words, therefore, she must be interested in animals" e thereby conflating knowledge with interest.On the other hand, parents may incorrectly assume that their child knows more or less words in a category, because their child is more or less interested in that category, and provide inflated or diminished reports of their child's category knowledge.Indeed, there is evidence to suggest that parents may not be able to estimate their children's vocabulary accurately, and may underestimate the number of words known to their child (Houston-Price et al., 2007;Venker et al., 2016).
At the same time, we note that variation in category knowledge may, itself, be indirectly tapping into children's interests to the extent that children have been reported to know more about objects and categories that they are more interested in (Chi & Koeske, 1983;DeLoache et al., 2007;Johnson et al., 2004).In other words, even if parents used their estimate of their child's category knowledge to estimate their child's interest in that category, this may provide us with an indirect measure of the construct under examination, i.e., the child's interest in different categories.Ongoing work in our lab is currently examining this possibility with a longitudinal study of child and parent indices of interest (Madhavan & Mani, in prep.).Thus, in an ideal world where both measures provide accurate estimates of the constructs under examination, children may actually know more words in the categories they are more interested in, due to their intrinsic motivation to learn more about these categories and increased attention to objects from these categories.
Our conclusions are necessarily limited by the current lack of evidence for the reliability of parent reports of children's interests.However, we note that upcoming work (Madhavan & Mani, in prep.)suggests that parent estimates of children's interests may have robust reliability e such that parental reports of their children's interests correlate across development.Thus, further examining the potential reasons for the low discriminant validity between parent reports of their child's interests and vocabulary may provide promising insights for the field.

Parent reports of children's interests and children's explicit choice
The linear mixed effects model revealed no association between children's explicit choice of objects (stickers) from different categories and parent estimations of children's interest in objects from these categories.While Fig. 4, which illustrates the relationship between children's explicit choice of objects from different categories and parent estimations of their child's interest in these categories suggests a slight positive association between these variables, the linear model suggested that much of the variation in children's  Table 5 e Multi Trait Multi Method Matrix with all variables measured.
Note: We have separated the rows and columns by method, and visually highlighted the mono-method blocks on the grand diagonal (in blue) and the hetero-method blocks (in yellow).The darker yellow squares index the validity diagonals within each hetero-method block (mono-trait hetero-method).The abbreviations anim refers to the object category of animals, cloth to clothing, bodyp to bodyparts, furn to furniture, and vehic to vehicles.
c o r t e x 1 7 5 ( 2 0 2 4 ) 1 2 4 e1 4 8 explicit choices were accounted for by the random effect of category, i.e., there is significant variation in children's choices across categories but their choices do not co-vary with parent estimations of children's reports.In simple terms, this would suggest that children overwhelmingly chose certain categories above others with little opportunity for variation in these choices to be explained by parents' estimation of their child's interest in these categories.Alternatively, parents may be better at estimating their child's true interest in certain categories but not in others, but the correlations in Table 5 do not suggest that this is the case.In other words, we have no robust evidence of an association between children's explicit choice of objects from different categories and parents' estimation of children's interest in these categories.
While this finding calls into question the extent to which parent estimates provide a true index of children's developing interests, ongoing work suggests that parent estimates of children's interests may be more roughly associated with children's behaviour in naturalistic social interactions, where children appear to be more engaged with objects from categories that their parents reported their child to be interested in (Madhavan & Mani, in press).Thus, despite the lack of association between parent estimates of children's interests and children's explicit choice of objects from categories of interest, parent reports of children's interests may still be tapping into some dimensions of children's interests that future research may be well advised to examine.

4.3.
Children's looking behaviour, parent reports of category knowledge and category interest, and children's explicit choice We found a significant association between the amount of time children spent looking towards objects from different categories and parents' estimate of their child's vocabulary knowledge in those categories.Thus, we found no evidence of the discriminant validity of these two measures.Furthermore, we found no significant association between the amount of time children spent looking towards objects from different categories and parents' estimation of their child's interest in those categories.Neither did we find a significant association between the amount of time children spent looking towards objects from different categories and children's choice of stickers of those very objects in the explicit choice task.In other words, we found little evidence for the convergent validity of the three measures of interest examined.
First, we consider the relationship between the amount of time children spent looking towards objects from different categories and children's choice of stickers of those very objects in the explicit choice task.We note that there was a critical difference in the design of the two tasks that may impact the responses obtained.In particular, the choice task required children to choose one of two stickers depicting objects from two different categories across 15 trials, while the looking time task looked at the amount of time children spent looking at a single image presented in isolation on the screen.We had assumed that counterbalancing the pairs of categories presented together across trials would allow us to create a ranking of categories e since each category was paired with a token from each of the other five categories.However, it remains possible that the individual items presented in category pairs may influence children's choices in a manner that raises doubts as to the generalisability (Yarkoni, 2022) of the ranking of categories obtained in this task (and indeed, the other tasks employed in the current study).The single object presentation of items in the looking time task may, therefore, provide a more unbiased estimate of children's attention to objects from this category.However, we note that we are equally limited regarding claims of the generalisability of the findings to other items from the same category.
This explanation may also tap into the lack of convergent validity between parental reports of children's interests and children's looking behaviour.In the looking time task, children were presented with individual objects, with the average looking time to different category members serving as an index of their interest in the category.In contrast, parents rated their child's interest in the category, which may capture a more global measure of their interest in the category.Thus, while the parental estimate may tap into more individual interests of the child e my child is interested in animals e the looking behaviour may be tapping into more situational interests with regards to the child's interest in a particular animal.
Note that we find a significant positive association between category-specific vocabulary size and children's looking behaviour and, as noted earlier, a positive association between category-specific vocabulary size and parents' estimates of children's interests.Yet we do not find a significant relationship between children's looking behaviour and parent estimates of their child's interests.We explain this pattern of results with recourse to the size of the effects reported, all of which are quite small, and the potential low reliability of many of the measures examined here.Of the measures examined here, only parental reports of vocabulary knowledge has been demonstrated to have adequate reliability, with ongoing work finding some evidence for longitudinal reliability of parent measures of children's interests (Madhavan & Mani, in prep.) but not of looking time measures (Schreiner et al. under review, see also DeBolt et al., 2020;Nighbor et al., 2017).Thus, given that parent estimates of vocabulary size is associated with the other two measures and has greater reliability, it is, indeed, likely that we find stronger correlations between this more reliable measure and the other measures than between the different measures of interest.Thus, our conclusions are contingent on further evidence of the reliability of the different measures, which we highlight as a pressing need for future research.
Relatedly, we note potential issues of ceiling effects in our looking behaviour data, i.e., many children look at the objects on screen for the entirety of the trial.2In our design of the looking time measure, we tried to ensure that individual trials were not too long (our trials lasted 4000 ms).While most research has examined their response of concern within this timeframe, we see that with simple single stimuli, most children look at the object for more than 4 s; they also only look away on an average of 4 s after the object appears on screen (e.g., Cohen, 1972;Kagan & Lewis, 1965).Thus, our trial duration may be too short to capture variability in the extent to which looking time duration reflects interest.However, we note that our dispersion parameters for the model examining the looking time data, and posterior predictive checks suggest that the model is only 1-inflated to a minor extent.Nevertheless, future studies may consider increasing the length of the trials to examine the extent to which looking behaviour in longer trials provides a more valid and reliable measure of interest.

4.4.
Children's pupillary activity, looking time behaviour, explicit choices and parent estimates of children's interest and category knowledge Our final analysis examined the convergent validity of pupillary activity as an index of children's interests relative to the other measures examined above.In short, we found no associations between the category-specific pupillary response and any of the four predictors entered into the model.We note that this mirrors earlier research by Ackermann, Hepach, and Mani (2020) who similarly report no association between children's pupillary activity and parental estimates of category knowledge and category interest, although differences in children's pupillary arousal to category members (and parents' estimates of their child's interests in different categories) modulated children's learning of novel category members.As argued above, the lack of an association could be explained with regards to parental estimates tapping into more longterm individual interests of the child while pupillary activity may be tapping into children's token-induced arousal for specific category members.
What is puzzling, however, is the absence of a relationship between children's looking behaviour and pupillary arousal.Above, we argued that the disconnect between the looking behaviour and parental estimates of children's interests could stem from the disconnect we enlist here with regards to situational and individual interests.This being the case, we would expect there to be an association between looking behaviour and pupillary arousal.The lack of convergent validity here, especially, given the discriminant validity of looking behaviour and category knowledge, suggests either that there may be issues with the reliability of the measures under consideration or that the two measures may be tapping into different aspects of children's engagement with the images presented.Given that high reliability likely yields stronger correlations between the measures under investigation, it is noteworthy that there are reports of limited/little-to-no reliability of looking time measures, while there have been some recent efforts that have demonstrated the reliability of pupillometry as a marker of attention in developmental research (e.g., Calignano et al., 2023;Neagu et al., 2023).Alternatively, the two measures may be tapping into different aspects of children's engagement with the images.For example, the pupillary measure may be tapping into the involuntary arousal following exposure to particular category members, while the looking time behaviour may capture children's disengagement from category members, which may in turn be influenced by their knowledge about the category (as suggested above).
At the very least, the lack of convergent validity across the measures included highlights the need for future research to, firstly, actively taking steps to establish the reliability of the measures, and secondly, better understand the different aspects of the construct interest that the different measures may be tapping into and ensure to include multiple measures in future studies of interest on learning outcomes.Other possible avenues could include longitudinal studies that may allow us to better determine how these measures play out and predict development.Alternatively, future studies could examine the extent to which single item looking behaviour appropriately captures the cognitive processes and states underlying the behaviour or the extent to which the pupillary measure varies across the duration of the trial.34.5.
The Multi Trait Multi Method matrix (MTMM) We tried to obtain a more comprehensive picture of the association between our four measures of interest, using additional exploratory psychometric analyses.However, our efforts were unsuccessful for various reasons (see Supplementary Materials S5).Thus, we constructed a Multi-Trait Multi-Method matrix to visualise the correlations within and across methods (our four measures of interest) and 'traits' (here, our six object categories).Critically, the MTMM showed that the correlations in the validity diagonals (i.e., mono-trait, hetero-method correlations in the dark yellow diagonals within Table 5) were small.Indeed, they were often smaller than the hetero-trait, hetero-method correlations, i.e., the other yellow squares within a single blackbordered box.This supported the results reported above of limited convergent validity across measures even for the same trait, i.e., the same object category, examined in the matrix.We also found considerable evidence for method variance, given that hetero-trait, mono-method correlations (light blue squares) were often higher than hetero-trait, hetero-method correlations (light yellow squares), showing that the relationship between the object categories (trait factors) within the same interest measure was stronger than the relationship between the interest measures (the method factors).The MTMM further underscored our conclusions from the results of the linear models; i.e., we found limited evidence for convergent validity of the different measures of interest.

Conclusion
Taken together, our analysis of the convergent and discriminant validity of the different measures that have previously been assumed to tap into the construct of interest in early development suggests that the associations between these measures are oftentimes neither strong nor significant.These conclusions were supported by the results of the MTMM matrix showing little evidence for convergent validity across the measures examined.While these measures were included in the current study because they have all been implicated with regards to developmental outcomes in previous studies, the lack of convergent validity speaks to the real need for further research to examine the reliability of the measures.Indeed, without information pertaining to the reliability of any single psychological measure, it becomes more difficult to interpret the convergent validity of different measures assumed to tap into the same construct.Only after this can we better disentangle what the measures may be tapping into, especially with regards to different aspects of the construct of interest that they may be tapping into, as well as ensuring that future research includes multiple measures to better capture the role of these different aspects of interest in learning.Especially with regards to the measures that showed an association with category-specific vocabulary size e parental estimates of interest and looking behaviour e there is a clear need for further investigation of either the discriminant validity of these measures or potential causal mechanisms that may explain the low discriminant validity we find.This does not necessarily suggest that the four measures we tested do not index active interest in young children.We merely highlight the possibility that these measures are either too unreliable for us to determine whether they are measuring the construct of interest at all, or to take this further, index different aspects of the construct of infant interest.We note, in particular, the need to disentangle immediate situational interests from more long-term individual interests and the extent to which they may be differentially related to children's knowledge and play different roles in developmental outcomes.
It is vital for infancy research to use measures and methodologies that actually measure what we want to measure; whether the research concerns group-level or individual differences questions, and ensure our research is replicable (Kucker & Chmielewski, 2022).Our attempt to tackle the considerable issue of validity in infant research measures was only the first in a series of steps we must undertake to improve the quality of our research practices.The current study is not without its limitations, in both design and interpretation; but future and upcoming studies can build upon this study to further tackle the validity issue and similar issues in the infant research field.

Fig. 1 e
Fig. 1 e Schematic of the two tasks performed by the children.

Fig. 2 e
Fig. 2 e Median and quartiles of the measures (y-axis) with respect to the six different categories (x-axis) for each of the five variables.The x-axis label body.prefers to the object category of body parts.The horizontal line depicts the median, while error bars depict quartiles.

Fig. 3 e
Fig.3e Parents' estimation of children's interest in different categories as a function of parents' estimation of children's category-specific vocabulary size (Model 1).Dots show observations whereby the area of the dots (i.e., how big the dot is/ how much space the dot occupies on the graph) depict the number of observations with the exact same rating in both variables (range: 1 to 47).The dashed line and grey polygon depict the fitted model and its 95% confidence limits.Note: For readers wishing the see a different visualisation of this and the following data, we have provided plots where each individual data point is represented by a single dot, for ease of interpretation in the Supplementary materials (S7).

Fig. 4 e
Fig.4e Proportion of objects from different categories explicitly chosen by children as a function of parent's estimation of children's interest in those categories (model 2).For data visualisation purposes, dots show the proportion of stickers chosen by a child for that category as a function of parent's estimation of their child's interest in that category, whereby the area of the dots (i.e., how big the dot is/how much space the dot occupies on the graph) depicts the number of observations with the exact same rating in both variables (range: 4 to 60).The dashed line and grey polygon depict a fitted model and its 95% confidence limits.

Fig. 5 e
Fig. 5 e Proportion of looking time toward objects (looking behaviour) from different categories by children as a function of parents' estimation of the number of words known to children in those categories (cat.spec.voc)(Model 3).Dots show the proportion of looking time per cat.spec.vocscore whereby the area of the dots (i.e., how big the dot is/how much space the dot occupies on the graph) depicts the number of observations with the exact same rating in both variables (range: 1 to 134).The dashed line and grey polygon depict a fitted model and its 95% confidence limits, at the average of the other predictor.

Fig. 6 e
Fig. 6 e Proportion of looking time toward objects (looking behaviour) from different categories by children as a function of parents' estimation of children's interest (parent interest) in those categories (Model 3).Dots show the proportion of looking time per parent interest score whereby the area of the dots (i.e., how big the dot is/how much space the dot occupies on the graph) depicts the number of observations with the exact same rating in both variables (range: 1 to 35).The dashed line and grey polygon depict a fitted model and its 95% confidence limits, at the average of the other predictor.

Fig. 7 e
Fig.7e Proportion of looking time toward objects (looking behaviour) from different categories as a function of children's choice of objects (sticker choice) from those categories during the sticker task (Model 3).Dots show the proportion of looking time per sticker choice ranking whereby the area of the dots (i.e., how big the dot is/how much space the dot occupies on the graph) depicts the number of observations with the exact same rating in both variables (range: 1 to 68).The dashed line and grey polygon depict a fitted model and its 95% confidence limits, at the average of the other predictor.

Table 1 e
Results for the GLMM examining parent interest.

Table 2 e
Results for the GLMM examining sticker choice.

Table 3 e
Results for the GLMM examining looking behaviour.

Table 4 e
Results for the GLMM examining pupillary activity.Indicated are estimates, together with standard errors, 95% confidence intervals, significance tests as well as minimum and maximum of model estimates after excluding random effects one at a time.