Across their first years of life, children learn words at an accelerating pace (Frank, Braginsky, Yurovsky, & Marchman, 2021). While many children will only produce their first word at around 1 year of age, most children show signs of understanding many common nouns (e.g., mommy) and phrases (e.g., Let’s go bye-bye!) much earlier in development (Bergelson & Swingley 2012, 2013; Tincoff & Jusczyk, 1999). Although early word understanding is a critical element of first language learning, the processes involved are less directly apparent in children’s behaviors and are less accessible to observation than developments in speech production (Fernald, Zangl, Portillo, & Marchman, 2008; Hirsh-Pasek, Cauley, Golinkoff, & Gordon, 1987). To understand a spoken word, children must process the incoming auditory signal and link that signal to relevant meanings—a process often referred to as word recognition. One of the primary means of measuring word recognition in young infants is using eye-tracking techniques that gauge where children look in response to linguistic stimuli (Fernald, Zangl, Portillo, & Marchman, 2008). The logic of these methods is that if, upon hearing a word, a child preferentially looks at a target stimulus rather than a distractor, the child is able to recognize the word and activate its meaning during real-time language processing. Measuring early word recognition offers insight into children’s early word representations: children’s speed of response (i.e., moving their eyes; turning their heads) to the unfolding speech signal can reveal children’s level of comprehension (Bergelson, 2020; Fernald, Pinto, Swingley, Weinberg, & McRoberts, 1998). Word recognition skills are also thought to build a foundation for children’s subsequent language development. Past research has found that early word recognition efficiency is predictive of later linguistic and general cognitive outcomes (Bleses, Makransky, Dale, Højen, & Ari, 2016; Marchman et al., 2018).

While word recognition is a central part of children’s language development, mapping the trajectory of word recognition skills has remained elusive. Studies investigating children’s word recognition are typically limited in scope to experiments in individual labs involving small samples tested on a handful of items. The limitations of single datasets makes it difficult to understand developmental changes in children’s word knowledge at a broad scale.

One way to overcome this challenge is to compile existing datasets into a large-scale database in order to expand the scope of research questions that can be asked about the development of word recognition abilities. This strategy capitalizes on the fact that the looking-while-listening paradigm is widely used, and vast amounts of data have been collected across labs on infants’ word recognition over the past 35 years (Golinkoff, Ma, Song, & Hirsh-Pasek, 2013). Such datasets have largely remained isolated from one another, but once combined, they have the potential to offer general insights into lexical development. There has been a long history of efforts to aggregate data in a unified format in developmental and cognitive psychology, generating projects that have often had a tremendous impact on the field. Prominent examples in language research include the English Lexicon Project, which provides an open repository of psycholinguistic data for over 80,000 English words and non-words in order to support large-scale investigations of lexical processing (Balota et al., 2007); the Child Language Data Exchange System (CHILDES), which has played an instrumental role in the study of early language environments by systematizing and aggregating data from naturalistic child–caregiver language interactions (MacWhinney, 2000); and WordBank, which aggregated data from the MacArthur-Bates Communicative Development Inventory, a parent-report measure of child vocabulary, to deliver new insights into cross-linguistic patterns and variability in vocabulary development (Frank, Braginsky, Yurovsky, & Marchman, 2017b, 2021). In this paper, we introduce Peekbank, an open database of infant and toddler eye-tracking data aimed at facilitating the study of developmental changes in children’s word recognition.

Measuring word recognition: The looking-while-listening paradigm

Word recognition is traditionally studied in the looking-while-listening paradigm (Fernald, Zangl, Portillo, & Marchman, 2008, alternatively referred to as the intermodal preferential looking procedure, Hirsh-Pasek, Cauley, Golinkoff, & Gordon, 1987). In these studies, infants listen to a sentence prompting a specific referent (e.g., Look at the dog!) while viewing two images on the screen (e.g., an image of a dog—the target image—and an image of a bird—the distractor image). Infants’ word recognition is evaluated by how quickly and accurately they fixate on the target image after hearing its label. Past research has used this basic method to study a wide range of questions in language development. For example, the looking-while-listening paradigm has been used to investigate early noun knowledge, phonological representations of words, prediction during language processing, and individual differences in language development (Bergelson & Swingley, 2012; Golinkoff, Ma, Song, & Hirsh-Pasek, 2013; Lew-Williams & Fernald, 2007; Marchman et al., 2018; Swingley & Aslin, 2002).

While this research has been fruitful in advancing understanding of early word knowledge, fundamental questions remain. One central question is how to accurately capture developmental change in the speed and accuracy of word recognition. There is ample evidence demonstrating that infants become faster and more accurate in word recognition over the first few years of life (e.g., Fernald et al., 1998). However, precisely measuring developmental increases in the speed and accuracy of word recognition remains challenging due to the difficulty of distinguishing developmental changes in word-recognition skill from changes in knowledge of specific words. This problem is particularly thorny in studies with young children, since the number of items that can be tested within a single session is limited and items must be selected in an age-appropriate manner (Peter et al., 2019). More broadly, key differences in the design choices (e.g., how distractor items are selected) and analytic decisions (e.g., how the analysis window is defined) between studies can obscure developmental change if not appropriately taken into account.

One approach to addressing these challenges is to conduct meta-analyses aggregating effects across studies while testing for heterogeneity due to researcher choices (Bergmann et al., 2018; Lewis et al., 2016). However, meta-analyses typically lack the granularity to estimate participant-level and item-level variation or to model behavior beyond coarse-grained effect size estimates. An alternative way to approach this challenge is to aggregate trial-level data from smaller studies measuring word recognition with a wide range of items and design choices into a large-scale dataset that can be analyzed using a unified modeling approach. A sufficiently large dataset would allow researchers to estimate developmental change in word recognition speed and accuracy while generalizing across changes related to specific words or the design features of particular studies.

A related open theoretical question is understanding changes in children’s word recognition at the level of individual items. Looking-while-listening studies have been limited in their ability to assess the development of children's processing of specific words. One limitation is that studies typically test only a small number of trials for each item, reducing power to precisely measure the development of word-specific accuracy (DeBolt, Rhemtulla, & Oakes, 2020). A second limitation is that target stimuli are often yoked with a narrow set of distractor stimuli (i.e., a child sees a target with only one or two distractor stimuli over the course of an experiment), leaving ambiguous whether accurate looking to a particular target word can be attributed to children’s recognition of the target word or their knowledge about the distractor. Aggregating across many looking-while-listening studies has the potential to meet these challenges by increasing the number of observations for specific items at different ages and by increasing the size of the inventory of distractor stimuli that co-occur with each target.

Replicability and reproducibility

A core challenge facing psychology in general, and the study of infant development in particular, are threats to the replicability and reproducibility of core empirical results (Frank et al., 2017a; Nosek et al., 2022). In infant research, many studies are not adequately powered to detect the main effects of interest (Bergmann et al., 2018). This issue is compounded by low reliability in infant measures, often due to limits on the number of trials that can be collected from an individual infant in an experimental session (Byers-Heinlein, Bergmann, & Savalei, 2021). One hurdle to improving power in infant research is that it can be difficult to develop a priori estimates of effect sizes and how specific design decisions (e.g., the number of test trials) will impact power and reliability. Large-scale databases of infant behavior can aid researchers in their decision-making by allowing them to directly test how different design decisions affect power and reliability. For example, if a researcher is interested in understanding how the number of test trials could impact the power and reliability of their looking-while-listening design, a large-scale infant eye-tracking database would allow them to simulate possible outcomes across a range of test trials, providing the basis for data-driven design decisions.

In addition to threats to replicability, the field of infant development also faces concerns about analytic reproducibility—the ability for researchers to arrive at the same analytic conclusion reported in the original research article, given the same dataset. A recent estimate based on studies published in a prominent cognitive science journal suggests that analyses can remain difficult to reproduce, even when data are made available to other research teams (Hardwicke et al., 2018). Aggregating data in centralized databases can aid in improving reproducibility in several ways. First, building a large-scale database requires defining a standardized data specification. Recent examples include the brain imaging data structure (BIDS), an effort to specify a unified data format for neuroimaging experiments (Gorgolewski et al., 2016), and the data formats associated with ChildProject, for managing long-form at-home language recordings (Gautheron, Rochat, & Cristia, 2021). Defining a data standard—in this case, for infant eye-tracking experiments—supports reproducibility by guaranteeing that critical information will be available in openly-shared data and by making it easier for different research teams to understand the data structure. Second, open databases make it easy for researchers to generate open and reproducible analytic pipelines, both for individual studies and for analyses aggregating across datasets. Creating open analytic pipelines across many datasets also serves a pedagogical purpose, providing teaching examples illustrating how to implement analytic techniques used in influential studies and how to conduct reproducible analyses with infant eye-tracking data.

Peekbank: An open database of developmental eye-tracking studies

What all of these open challenges share is that they are difficult to address at the scale of a single research lab or in a single study. To address this challenge, we developed Peekbank, a flexible and reproducible interface to an open database of developmental eye-tracking studies. The Peekbank project (a) collects a large set of eye-tracking datasets on children’s word recognition, (b) introduces a data format and processing tools for standardizing eye-tracking data across heterogeneous data sources, and (c) provides an interface for accessing and analyzing the database. In the current paper, we introduce the key components of the project and give an overview of the existing database. We then provide two worked examples of how researchers can use Peekbank. In the first example, we examine a classic result in the word-recognition literature, and in the second, we aggregate data across studies to investigate developmental trends in the recognition of individual words.

Design and technical approach

Database framework

One of the main challenges in compiling a large-scale eye-tracking database is the lack of a shared data format: both labs and individual experiments can record their results in a wide range of formats. For example, different experiments encode trial-level and participant-level information in many different ways. Therefore, we have developed a common tabular format to support analyses of all studies simultaneously.

As illustrated in Fig. 1, the Peekbank framework consists of four main components: (1) a set of tools to convert eye-tracking datasets into a unified format, (2) a relational database populated with data in this unified format, (3) a set of tools to retrieve data from this database, and (4) a web app (using the Shiny framework) for visualizing the data. These components are supported by three packages. The peekds package (for the R language, R Core Team, 2021) helps researchers convert existing datasets to use the standardized format of the database. The peekbank module (Python) creates a database with the relational schema and populates it with the standardized datasets produced by peekds. The database is served through MySQL, an industry standard relational database server, which may be accessed by a variety of programming languages, and can be hosted on one machine and accessed by many others over the Internet. As is common in relational databases, records of similar types (e.g., participants, trials, experiments, coded looks at each timepoint) are grouped into tables, and records of various types are linked through numeric identifiers. The peekbankr package (R) provides an application programming interface, or API, that offers high-level abstractions for accessing the tabular data stored in Peekbank. Most users will access data through this final package, in which case the details of data formatting, processing, and the specifics of connecting to the database are abstracted away from the user.

Fig. 1
figure 1

Overview of the Peekbank data ecosystem. Peekbank tools are highlighted in green. * indicates R packages introduced in this work

Database schema

The Peekbank database contains two major types of data: (1) metadata regarding experiments, participants, and trials, and (2) time course looking data, detailing where a child is looking on the screen at a given point in time (Fig. 2).

Fig. 2
figure 2

The Peekbank schema. Each darker rectangle represents a table in the relational database. Arrows indicate linked records across tables. AOIs are areas of interest in an eye-tracking experiment, in this case information about the position of target and distractor stimuli on the screen

Metadata

Metadata can be separated into four parts: (1) participant-level information (e.g., demographics), (2) experiment-level information (e.g., the type of eye-tracker used to collect the data), (3) session information (e.g., a participant’s age for a specific experimental session), and (4) trial information (e.g., which images or videos were presented onscreen, and paired with which audio).

Participant information

All information about individual participants in Peekbank is completely de-identified under United States law, containing none of the key identifiers listed under the “Safe Harbor” standard for data de-identification. All participant-level linkages are made using anonymous participant identifiers.

Invariant information about individuals who participate in one or more studies (e.g., a participant’s first language) is recorded in the subjects table, while the administrations table contains information about each individual session in a given study (see Session information, below). This division allows Peekbank to gracefully handle longitudinal designs: a single participant can complete multiple sessions and thus be associated with multiple administrations.

Participant-level data includes all participants who have experiment data. In general, we include as many participants as possible in the database and leave it to end-users to apply the appropriate exclusion criteria for their analysis.

Experiment information

The datasets table includes information about the lab conducting the study and the relevant publications to cite regarding the data. In most cases, a dataset corresponds to a single study.

Information about the experimental design is split across the trial_types and stimuli tables. The trial_types table encodes information about each trial in the design of the experiment,Footnote 1 including the target stimulus and location (left vs. right), the distractor stimulus and location, and the point of disambiguation for that trial. If a dataset used automatic eye-tracking rather than manual coding, each trial type is additionally linked to a set of area of interest (x, y) coordinates, encoded in the aoi_region_sets table. The trial_types table links trial types to the aoi_region_sets table and the trials table. Each trial type record links to two records in the stimuli table, identified by the distractor_id and the target_id fields.

Each record in the stimuli table is a (word, image) pair. In most experiments, there is a one-to-one mapping between images and labels (e.g., each time an image of a dog appears it is referred to as dog). For studies in which there are multiple potential labels per image (e.g., dog and chien are both used to refer to an image of a dog), images can have multiple rows in the stimuli table with unique labels. This structure is useful for studies on synonymy or using multiple languages. It is also possible for an image to be associated with a row with no label, if the image appears solely as a distractor (and thus its label is ambiguous). For studies in which the same label refers to multiple images (e.g., the word dog refers to an image of a dalmatian and a poodle), the same label can have multiple rows in the stimuli table with unique images.

Session information

The administrations table includes information about the participant or experiment that may change between sessions of the same study, even for the same participant. This includes the age of the participant, the coding method (eye-tracking vs. manual-coding), and the properties of the monitor that was used. For participant age, we include the fields lab_age and lab_age_units to record how the original lab encoded age, as well as an additional field, age, to encode age in a standardized format across datasets, using standardized months as the common unit of measurement (see the Peekbank codebook for details on how ages are converted into months).

Trial information

The trials table includes information about a specific participant completing a specific instance of a trial type. This table links each record in the time course looking data (described below) to the trial type and specifies the order of the trials seen by a specific participant.

Time course data

Raw looking data is a series of looks to areas of interest (AOIs), such as looks to the left or right of the screen, or to (x, y) coordinates on the experiment screen, linked to points in time. For data generated by eye-trackers, we typically have (x, y) coordinates at each time point, which we encode in the xy_timepoints table. These looks are also recoded into AOIs according to the AOI coordinates in the aoi_region_sets table using the add_aois() function in peekds, and encoded in the aoi_timepoints table. For manually coded data, we typically have a series of AOIs (i.e., looks to the left vs. right of the screen), but lack information about exact gaze positions on-screen; in these cases, the AOIs are recoded into the categories in the Peekbank schema (target, distractor, other, and missing) and encoded in the aoi_timepoints table; however, these datasets do not have any corresponding data in the xy_timepoints table.

Typically, timepoints in the xy_timepoints table and aoi_timepoints table need to be regularized to center each trial’s time around the point of disambiguation—such that 0 is the time of target word onset in the trial (i.e., the beginning of dog in Can you find the dog?). We re-centered timing information to the onset of the target label to facilitate comparison of target label processing across all datasets.Footnote 2 If time values run throughout the experiment rather than resetting to zero at the beginning of each trial, rezero_times() is used to reset the time at each trial. After this, each trial’s times are centered around the point of disambiguation using normalize_times(). When these steps are complete, the time course is ready for resampling.

To facilitate time course analysis and visualization across datasets, time course data must be resampled to a uniform sampling rate (i.e., such that every trial in every dataset has observations at the same time points). All data in the database is resampled to 40 Hz (observations every 25 ms), which represents a compromise between retaining fine-grained timing information from datasets with dense sampling rates (maximum sampling rate among current datasets: 500 Hz) while minimizing the possibility of introducing artifacts via resampling for datasets with lower sampling rates (minimum sampling rate for current datasets: 30 Hz). Further, 25 ms is a mathematically convenient interval for ensuring consistent resampling; we found that using 33.333 ms (30 Hz) as our interval simply introduced a large number of technical complexities. The resampling operation is accomplished using the resample_times() function. During the resampling process, we interpolate using constant interpolation, selecting for each interpolated timepoint the looking location for the earlier-observed time point in the original data for both aoi_timepoints and xy_timepoints data. Compared to linear interpolation (see e.g., Wass, Smith, & Johnson, 2013), which fills segments of missing or unobserved time points by interpolating between the observed locations of timepoints at the beginning and end of the interpolated segment, constant interpolation has the advantage that it is more conservative, in the sense that it does not introduce new look locations beyond those measured in the original data. One possible application of our new dataset is investigating the consequences of other interpolation functions for data analysis.

Processing, validation, and ingestion

Although Peekbank provides a common data format, the key hurdle to populating the database is converting existing datasets to this format. Each dataset is imported via a custom import script, which documents the process of conversion. Often various decisions must be made in this import process (for example, how to characterize a particular trial type within the options available in the Peekbank schema); these scripts provide a reproducible record of this decision-making process. Our data import repository (available on GitHub at https://github.com/langcog/peekbank-data-import) contains all of these scripts, links to internal documentation on data import, and a set of generic import templates for different formats.

Many of the specific operations involved in importing a dataset can be abstracted across datasets. The peekds package offers a library of these functions. Once the data have been extracted in a tabular form, the package also offers a validation function that checks whether all tables have the required fields and data types expected by the database. In an effort to double-check the data quality and to make sure that no errors are made in the importing script, we also typically perform a visual check of the import process, creating a time course plot to replicate the results in the paper that first presented each dataset. Once this plot has been created and checked for consistency and all tables pass our validation functions, the processed dataset is ready for reprocessing into the database using the peekbank library. This library applies additional data checks, and adds the data to the MySQL database using the Django web framework.

To date, the import process has been carried out by the Peekbank team using data offered by other research teams. Data contributors are also welcome to provide import scripts to facilitate contribution. However, creating these scripts requires familiarity with both R scripting and the specific Peekbank schema, and writing an import script can be somewhat time-consuming in practice. To support future data contributions, import script templates and examples are available for both manually coded datasets and automatic eye-tracking datasets for research teams to adapt to their data. These import templates walk researchers through each step of data processing using example datasets from Peekbank and include explanations of key decision points, examples of how to use various helper functions available in peekds, and further details about the database schema.

Current data sources

The database currently includes 20 looking-while-listening datasets comprising N= 1594 total participants (Table 1). The current data represents a convenience sample of datasets that were (a) datasets collected by or available to Peekbank team members, (b) made available to Peekbank after informal inquiry or (c) datasets that were openly available. Most datasets (14 out of 20 total) consist of data from monolingual native English speakers. They span a wide age spectrum with participants ranging from 9 to 70 months of age, and are balanced in terms of children’s assigned sex (47.30% female; 50.40% male; 2.30% unreported). The datasets vary across a number of design-related dimensions, and include studies using manually coded video recordings and automated eye-tracking methods (e.g., Tobii, EyeLink) to measure gaze behavior. All studies tested familiar items, but the database also includes 5 datasets that tested novel pseudo-words in addition to familiar words. Users interested in a subset of the data (e.g., only trials testing familiar words) can filter out unwanted trials using columns available in the schema (e.g., using the column stimulus_novelty in the stimuli table).

Table 1 Overview of the datasets in the current database

Versioning and reproducibility

The content of Peekbank will change as we add additional datasets and revise previous ones. To facilitate reproducibility of analyses, we use a versioning system by which successive releases are assigned a name reflecting the year and version, e.g., 2022.1. By default, users will interact with the most recent version of the database available, though the peekbankr API allows researchers to run analyses against any previous version of the database. For users with intensive use-cases, each version of the database may be downloaded as a compressed .sql file and installed on a local MySQL server.

Peekbank allows for fully reproducible analyses using our source data, but the goal is not to reproduce precisely the analyses—or even the datasets—in the publications whose data we archive. Because of our emphasis on a standardized data importing and formatting pipeline, there may be minor discrepancies in the time course data that we archive compared with those reported in original publications. Further, we archive all of the data that are provided to us—including participants that might have been excluded in the original studies, if these data are available—rather than attempting to reproduce specific exclusion criteria. We hope that Peekbank can be used as a basis for comparing different exclusion and filtering criteria—as such, an inclusive policy regarding importing all available data helps us provide a broad base of data for investigating these decisions.

Interfacing with Peekbank

Peekbankr

The peekbankr API offers a way for users to access data from the database and flexibly analyze it in R. The majority of API calls simply allow users to download tables (or subsets of tables) from the database. In particular, the package offers the following functions:

  • connect_to_peekbank() opens a connection with the Peekbank database to allow tables to be downloaded with the following functions

  • get_datasets() gives each dataset name and its citation information

  • get_subjects() gives information about persistent participant identifiers (e.g., native languages, sex)

  • get_administrations() gives information about specific experimental administrations (e.g., participant age, monitor size, gaze coding method)

  • get_stimuli() gives information about word–image pairings that appeared in experiments

  • get_trial_types() gives information about pairings of stimuli that appeared in the experiment (e.g., point of disambiguation, target and distractor stimuli, condition, language)

  • get_trials() gives the trial orderings for each administration, linking trial types to the trial IDs used in time course data

  • get_aoi_region_sets() gives coordinate regions for each area of interest (AOI) linked to trial type IDs

  • get_xy_timepoints() gives time course data for each participant’s looking behavior in each trial, as (x, y) coordinates on the experiment monitor

  • get_aoi_timepoints() gives time course data for each participant’s looking behavior in each trial, coded into areas of interest

Once users have downloaded tables, they can be merged using join commands via their linked IDs. A set of standard merges are shown below in the “Peekbank in action” section; these allow the common use-case of examining time course data and metadata jointly.

Because of the size of the XY and AOI data tables, downloading data across multiple studies can be time-consuming. Many of the most common analyses of the Peekbank data require downloading the aoi_timepoints table, thus we have put substantial work into optimizing transfer times. In particular, connect_to_peekbank offers a data compression option, and get_aoi_timepoints by default downloads time courses via a compressed (run-length encoded) representation, which is then uncompressed on the client side. More information about these options (including how to modify them) can be found in the package documentation.

Shiny app

One goal of the Peekbank project is to allow a wide range of users to easily explore and learn from the database. We therefore have created an interactive web application—peekbank-shiny—that allows users to quickly and easily create informative visualizations of individual datasets and aggregated data (https://peekbank-shiny.com/). peekbank-shiny is built using Shiny, a software package for creating web apps for data exploration with R, as well as the peekbankr package. All code for the Shiny app is publicly available (https://github.com/langcog/peekbank-shiny). The Shiny app allows users to create commonly used visualizations of looking-while-listening data, based on data from the Peekbank database. Specifically, users can visualize:

  1. 1.

    the time course of looking data in a profile plot depicting infant target looking across trial time

  2. 2.

    overall accuracy, defined as the proportion target looking within a specified analysis window

  3. 3.

    reaction times in response to a target label, defined as how quickly participants shift fixation to the target image on trials in which they were fixating on the distractor image at onset of the target label

  4. 4.

    an onset-contingent plot, which shows the time course of participant looking as a function of their look location at the onset of the target label

Users are given various customization options for each of these visualizations, e.g., choosing which datasets to include in the plots, controlling the age range of participants, splitting the visualizations by age bins, and controlling the analysis window for time course analyses. Plots are then updated in real time to reflect users’ customization choices. A screenshot of the app is shown in Fig. 3. The Shiny app thus allows users to quickly inspect basic properties of Peekbank’s datasets and create reproducible visualizations without incurring any of the technical overhead required to access the database through R.

Fig. 3
figure 3

Screenshot of the Peekbank Shiny app, which shows a variety of standard analysis plots as a function of user-selected datasets, words, age ranges, and analysis windows. Shown here are mean reaction time and proportion target looking over time by age group for two selected datasets

OSF site

In addition to the Peekbank database proper, all data is openly available on the Peekbank OSF webpage (https://osf.io/pr6wu/). The OSF site also includes the original raw data (both time series data and metadata, such as trial lists and participant logs) that was obtained for each study and subsequently processed into the standardized Peekbank format. Where available, the OSF page also includes additional information about the stimuli used in each dataset, including in some instances the original stimulus sets (e.g., image and audio files).

Peekbank in action

In the following section, we provide examples of how users can access and analyze the data in Peekbank. First, we provide an overview of some general properties of the datasets in the database. We then demonstrate two potential use-cases for Peekbank data. In each case, we provide sample code to demonstrate the ease of doing simple analyses using the database. Our first example shows how we can investigate the findings of a classic study. This type of investigation can be a very useful exercise for teaching students about best practices for data analysis (e.g., Hardwicke et al., 2018) and also provides an easy way to explore looking-while-listening time course data in a standardized format. Our second example shows an exploration of developmental changes in the recognition of particular words. Besides its theoretical interest (which we will explore more fully in subsequent work), this type of analysis could in principle be used for optimizing the stimuli for new experiments, especially as the Peekbank dataset grows and gains coverage over a greater number of items. All analyses are conducted using R [Version 4.1.1; R Core Team (2021)].Footnote 3

General descriptives

One of the values of the uniform data format we use in Peekbank is the ease of providing cross-dataset descriptions that can give an overview of some of the general patterns found in our data. A first broad question is about the degree of accuracy in word recognition found across studies. In general, participants demonstrated robust, above-chance word recognition in each dataset (chance = 0.5 due to the two-alternative forced-choice design of looking-while-listening trials). Table 2 shows the average proportion of target looking within a standard critical window of 367–2000 ms after the onset of the label for each dataset (Swingley and Aslin, 2002). Proportion target looking was generally higher for familiar words (M = 0.66, 95% CI = [0.65, 0.67], n = 1543) than for novel words learned during the experiment (M = 0.59, 95% CI = [0.58, 0.61], n = 822).

Table 2 Average proportion target looking in each dataset

A second question of interest is about the variability across items (i.e., target labels) within specific studies. Some studies use a smaller set of items (e.g., eight nouns, Adams et al., 2018) while others use dozens of different items (e.g., Garrison, Baudet, Breitfeld, Aberman, & Bergelson, 2020). Figure 4 gives an overview of the variability in proportion looking to the target item for individual words in each dataset. Although all datasets show a gradual rise in average proportion target looking over chance performance, the number of unique target labels and their associated accuracy vary widely across datasets.

Fig. 4
figure 4

Item-level variability in proportion target looking within each dataset (chance = 0.5). Time is centered on the onset of the target label (vertical line). Colored lines represent specific target labels. Black lines represent smoothed average fits based on a general additive model using cubic splines

Investigating prior findings: Swingley and Aslin (2002)

Swingley and Aslin (2002) investigated the specificity of 14–16-month-olds’ word representations using the looking-while-listening paradigm, asking whether recognition would be slower and less accurate for mispronunciations, e.g., opal (mispronunciation) instead of apple (correct pronunciation).Footnote 4 In this short vignette, we show how easily the data in Peekbank can be used to visualize this result. Our goal here is not to provide a precise analytical reproduction of the analyses reported in the original paper, but rather to demonstrate the use of the Peekbank framework to analyze datasets of this type. In particular, because Peekbank uses a uniform data import standard, it is likely that there will be minor numerical discrepancies between analyses on Peekbank data and analyses that use another processing pipeline.

figure a

We begin by retrieving the relevant tables from the database, aoi_timepoints, administrations, trial_types, and trials. As discussed above, each of these can be downloaded using a simple API call through peekbankr, which returns dataframes that include ID fields. These ID fields allow for easy joining of the data into a single dataframe containing all of the information necessary for the analysis.

figure b

As the code above shows, once the data are joined, condition information for each timepoint is present and so we can easily filter out filler trials and set up the conditions for further analysis.

figure c

The final step in our analysis is to create a summary dataframe using dplyr commands. We first group the data by timestep, participant, and condition and compute the proportion looking at the correct image. We then summarize again, averaging across participants, computing both means and 95% confidence intervals (via the approximation of 1.96 times the standard error of the mean). The resulting dataframe can be used for visualization of the time course of looking.

Figure 5 shows the average time course of looking for the two conditions, as produced by the code above. Looks after the correctly pronounced noun appeared both faster (deviating from chance earlier) and more accurate (showing a higher asymptote). Overall, this example demonstrates the ability to produce this visualization in just a few lines of code.

Fig. 5
figure 5

Proportion looking at the correct referent by time from the point of disambiguation (the onset of the target noun) based on data from Swingley and Aslin (2002) imported into the Peekbank database. Colors show the two pronunciation conditions; points give means and ranges show 95% confidence intervals. The dotted line shows the point of disambiguation and the dashed line shows chance performance

Item analyses

A second use-case for Peekbank is to examine item-level variation in word recognition. Individual datasets rarely have enough statistical power to show reliable developmental differences within items. To illustrate the power of aggregating data across multiple datasets, we select the four words with the most data available across studies and ages (apple, book, dog, and frog) and show average recognition trajectories.

Our first step is to collect and join the data from the relevant tables including timepoint data, trial and stimulus data, and administration data (for participant ages). We join these into a single dataframe for easy manipulation; this dataframe is a common starting point for analyses of item-level data.

figure d

Next, we select a set of four target words (chosen based on having more than 100 children contributing data for each word across several 1-year age groups). We create age groups, aggregate, and compute timepoint-by-timepoint confidence intervals using the z approximation.

figure e

Finally, we plot the data as time courses split by age. Our plotting code is shown below (with styling commands removed for clarity). Figure 6 shows the resulting plot, with time courses for each of three (rather coarse) age bins. Although some baseline effects are visible across items, we still see clear and consistent increases in looking to the target, with the increase appearing earlier and in many cases asymptoting at a higher level for older children.

figure f
Fig. 6
figure 6

Time course plot for four well-represented target items in the Peekbank dataset, split by three age groups. Each line represents children’s average looking to the target image after the onset of the target label (dashed vertical line). Error bars represent 95% CIs

This simple averaging approach is a proof-of-concept to demonstrate some of the potential of the Peekbank dataset. An eye-movement trajectory on an individual trial reflects myriad factors, including the age and ability of the child, the target and distractor stimuli on that trial, the position of the trial within the experiment, and the general parameters of the experiment (for example, stimulus timing, eye-tracker type and calibration, etc.). Although we often neglect these statistically in the analysis of individual experiments—for example, averaging across items and trial orders—they may lead to imprecision when we average across multiple studies in Peekbank. For example, studies with older children may use more difficult items or faster trial timing, leading to the impression that children’s abilities overall increase more slowly than they in fact do. Even in our example in Fig. 6, we see hints of this confounding—for example, the low baseline looks to apple may be an artifact of an attractive distractor being paired with this item in one or two studies. In future work, we hope to introduce model-based analytic methods that use mixed effects regression to factor out study-level and individual-level variance in order to recover developmental effects more appropriately (see e.g., Zettersten et al., (2021) for a prototype of such an analysis).

Discussion

Theoretical progress in understanding child development requires rich datasets, but collecting child data is expensive, difficult, and time-intensive. Recent years have seen a growing effort to build open source tools and pool research efforts to meet the challenge of building a cumulative developmental science (Bergmann et al., 2018; Frank, Braginsky, Yurovsky, & Marchman, 2017b; Sanchez et al., 2019; The ManyBabies Consortium 2020). The Peekbank project expands on these efforts by building an infrastructure for aggregating eye-tracking data across studies, with a specific focus on the looking-while-listening paradigm. This paper presents an overview of the structure of the database, shows how users can access the database, and demonstrates how it can be used both to investigate prior experiments and to synthesize data across studies.

The current database has a number of limitations, particularly in the number and diversity of datasets it contains. With 20 datasets currently available in the database, idiosyncrasies of particular designs and condition manipulations still have a substantial influence on the results of particular analyses, as discussed above in our item analysis example. Expanding the set of distinct datasets will allow us to increase the number of datasets that contain specific items, leading to more robust generalizations across the many sources of variation that are confounded within studies (e.g., item set, participant age range, and specific experimental parameters). A critical next step will be the development of analytic models that take this structure into account in making generalizations across datasets.

A second limitation stems from the fact that the database represents a convenience sample of data readily available to the Peekbank team, which leads the database to be relatively homogeneous in a number of key respects. First, the datasets primarily come from labs that share similar theoretical perspectives and implement the looking-while-listening method in similar ways. The current database is also limited by the relatively homogeneous background of its participants, both with respect to language (almost entirely monolingual native English speakers) and cultural background (Henrich, Heine, & Norenzayan, 2010; Muthukrishna et al., 2020). Increasing the diversity of lab sources, participant backgrounds, and languages will expand the scope of the generalizations we can form about child word recognition, while also providing new opportunities for describing cross-lab, cross-cultural, and cross-linguistic variation.

Towards the goal of expanding our database, we invite researchers to contribute their data. On the Peekbank website, we provide technical documentation for potential contributors. Although we anticipate being involved in most new data imports, as discussed above, our import process is transparently documented and the repository contains examples for most commonly used eye-trackers. Contributing data to an open repository also can raise questions about participant privacy. Potential contributors should consult with their local institutional review boards for guidance on any challenges, but we do not foresee obstacles because of the de-identified nature of the data. Under United States regulation, all data contributed to Peekbank are considered de-identified and hence not considered “human subjects data”; hence, institutional review boards should not regulate this contribution process. Under the European Union’s Generalized Data Protection Regulation (GDPR), labs may need to take special care to provide a separate set of participant identifiers that can never be re-linked to their own internal records.

While the current database is focused on studies of word recognition, the tools and infrastructure developed in the project can in principle be used to accommodate any eye-tracking paradigm, opening up new avenues for insights into cognitive development. Gaze behavior has been at the core of many key advances in our understanding of infant cognition (Aslin 2007; Baillargeon, Spelke, & Wasserman, 1985; Bergelson and Swingley 2012; Fantz 1963; Liu, Ullman, Tenenbaum, & Spelke, 2017; Quinn, Eimas, & Rosenkrantz, 1993). Aggregating large datasets of infant looking behavior in a single, openly accessible format promises to bring a fuller picture of infant cognitive development into view.