A Developmental Surveillance Score for Quantitative Monitoring of Early Childhood Milestone Attainment: Algorithm Development and Validation

Background: Developmental surveillance, conducted routinely worldwide, is fundamental for timely identification of children at risk of developmental delays. It is typically executed by assessing age-appropriate milestone attainment and applying clinical judgment during health supervision visits. Unlike developmental screening and evaluation tools, surveillance typically lacks standardized quantitative measures, and consequently, its interpretation is often qualitative and subjective. Objective: Herein, we suggested a novel method for aggregating developmental surveillance assessments into a single score that coherently depicts and monitors child development. We described the procedure for calculating the score and demonstrated its ability to effectively capture known population-level associations. Additionally, we showed that the score can be used to describe longitudinal patterns of development that may facilitate tracking and classifying developmental trajectories of children. Methods: We described the Developmental Surveillance Score (DSS), a simple-to-use tool that quantifies the age-dependent severity level of a failure at attaining developmental milestones based on the recently introduced Israeli developmental surveillance program. We evaluated the DSS using a nationwide cohort of >1 million Israeli children from birth to 36 months of age, assessed between July 1,


Introduction
With growing awareness to the high prevalence of developmental, behavioral, or social delay among young children, and the importance of early intervention to mitigate this risk [1][2][3][4][5], many international organizations have recommended routine developmental surveillance for all children [2,6,7]. This process is typically conducted by evaluating the children's ability to attain a battery of age-appropriate milestones at routine clinic visits during the first few years of their life [2]. Interpreting the results of such evaluations is not straightforward. For a specific milestone, one can establish the population's age-dependent norms of attaining the milestone and use them to assess the level of concern in case a child fails to attain it, similar to the way physical growth measures are monitored [8][9][10]. However, unlike physical growth norms, which are continuous and whose trajectories over time are readily understood, success or failure at attaining a developmental milestone is a binary measure, and it is not obvious how to integrate the results of multiple different milestones across several developmental domains to quantitatively monitor and assess a child's development over time.
The assessment of child development can be done at varying level of details using 3 different types of tools: surveillance (or monitoring), screening, and evaluation. Developmental surveillance is based on milestone attainment checklists and is used worldwide by pediatricians and health care providers at routine encounters, as well as by educators and parents. Screening requires a more formal and elaborated assessment, typically done by caregivers or health care professionals at specific ages. Finally, developmental evaluation is an in-depth examination, typically done by a trained specialist, which aims to provide a formal diagnosis of the child. Importantly, surveillance is based on developmental norms, whereas screening tools are validated against a "gold standard" obtained from evaluation.
A commonly used screening tool is the Denver II Screening Tool [8,11], where the outcome is either "normal" or "suspicious," based on how many milestones were failed and the general rate of failure for them. A common alternative is the Ages & Stages Questionnaires (ASQ-3) [12] screening tool, where caregivers select 1 of 3 answers for an array of questions, and the total score identifies the child's development as being "on schedule," requiring "learning activities and monitor," or needing "further assessment." Both of these screening tools take about 20 minutes to administer, depending on the age of the child and the experience of the person administering them. A widely used developmental evaluation tool is the Bayley Scales of Infant and Toddler Development [13], which typically takes 30-70 minutes to complete and yields a numerical score for each developmental category, as well as an estimate for a child's developmental age-that is, at what age do neurotypical children exhibit a similar level of milestone attainment.
Previous work [14] has attempted to combine and standardize the results of 12 commonly used screening and evaluation tools into a single metric. However, doing so for surveillance tools is more challenging. There is a lack of standardization at this level of assessment, and the quantification of developmental surveillance assessments has not been previously suggested. At best, surveillance tools are calibrated using real-world data to determine the rate of milestone attainment at different ages [9] and then administered accordingly.
In this work, we suggested a relatively simple new methodology for translating a milestone-based developmental surveillance scale into a single score, denoted as the Developmental Surveillance Score (DSS), that conveys a child's developmental status during a specified time period. Based on data from a national developmental surveillance program in Israel, we demonstrated that this score consistently captures known associations between the development and characteristics of the mother and child. Moreover, the score can be used to reveal and explore new associations, which may further improve our understanding of the factors that impact developmental delay. Finally, the score can be used to track individual children longitudinally, by describing the trajectory of their development over time. We showed that by clustering these trajectories, we can identify several typical patterns of development.
The focus of this work was on defining a straightforward surveillance score (in the sense that computing it as part of the surveillance workflow adds essentially no overhead over the current practice) and establishing its coherence and potential usefulness. Further work is required to refine this score, validate it using various data sets internationally, and derive from it explicit protocols.

Developmental Surveillance in Israel
Developmental surveillance (from birth to 6 years of age) in Israel is performed routinely (and free of charge) according to national standards by trained public health nurses in approximately 1000 maternal child health clinics (MCHCs). The collected data of approximately 70% of the Israeli population of this age group are documented in a single common database managed by the Israeli Ministry of Health. The developmental assessments include 59 milestones across 4 domains: personal-social, language, fine motor, and gross motor [9].
Parents are instructed to visit the MCHC after hospital discharge and then at ages of 1, 2,4,6,9,12,18,24,36,48, and 60 months. At each visit, a predefined group of age-related milestones is evaluated, according to the expected development at that age (denoted "age step"). Children may also be evaluated on milestones of a previous age step, in cases of a missed visit or a failure to attain milestones at the preceding visit.
The child's ability to attain each milestone is reported as observed in the clinic; although in cases of difficult attainments, this may by documented according to a parent's report. If the evaluated milestone was not attained by neither observation nor parental report, it is documented as unattained.

Study Cohort
This study included all children born between July 1, 2014, and September 1, 2021, who were followed at the MCHCs and had at least one developmental evaluation recorded during the study period. In most of the analyses, we excluded children born preterm (gestational age of <37 weeks)-the one exception is the analysis of gestational age. Additionally, children with missing gestational age were excluded, as well as visits without developmental data or without the child's age. The final cohort included 1,130,005 children in total, with 1,052,905 of them born on-term.

DSS Definition
Sudry et al [9] have recently introduced the Tipat Halav Israel Surveillance (THIS) developmental scale, a data-driven developmental scale comprising curves of attainment rate by age for each of the 59 milestones evaluated in the Israeli developmental surveillance program (the scale can be downloaded from [15]). Broadly, when a child fails to attain a milestone, the THIS developmental scale categorizes the severity of this failure into 1 of 4 categories, depending on how often children of the same age fail to attain this milestone. Accordingly, in this study, we defined the Discrete Milestone Attainment Score (DMAS) for a failed milestone as the numerical order of the failure severity: a score of 1, 2, 3, or 4 is assigned for failure occurring when <75%, 75% to 90%, 90% to 95%, or >95% of the children at the same age attain this milestone, respectively. For an attained milestone, the DMAS value is 0. If an milestones is attempted multiple times, it will be scored separately each time it is attempted. The total score for a set of milestones is the average DMAS over all milestones of all developmental domains.
More formally, for each milestone, the age thresholds for attainment by 75%, 90%, and 95% of the children were calculated [9]; we denoted these age thresholds for milestone t by t 75 , t 90 , and t 95 , respectively, and considered the 4 consecutive age brackets they define: where t 0 and t 100 are the minimal and maximal ages at which the milestone t is assessed, respectively.
For a milestone t evaluated at age a, we defined i such that a is in the bracket b i (i indicates the severity of failure): To avoid noncontinuity, we extended the above definition into a Linearized Milestone Attainment Score (LMAS), using a c function as follows: where a min and a max are the low and high ends of b i , respectively.
The definitions of DMAS and LMAS are graphically illustrated in Figure 1. In the remainder of this paper, we used the LMAS version of the score, unless otherwise noted. In practice, deciding which of the 2 to use depends on the use case. DMAS is straightforward to compute from the THIS scale, whereas LMAS offers finer resolution. For a set of milestones T, we defined the developmental surveillance score DSS(T) as the average of the individual milestone attainment scores: where a t is the age at which milestone t was assessed. See Multimedia Appendix 1 for a concrete example of computing the score.
The set of milestones used for calculating the DSS can be defined by the evaluation period and by the types of developmental domains. For example, when computing the fine-motor score for a child during the first year of life, we computed the score for each fine-motor milestone attempted by the child during this period and then the average of the scores. In particular, if a milestone was attempted multiple times during this period, all attempts were used for the calculation of the score. Determining the evaluation period is a delicate point, which depends on the DSS application. Herein, we considered a broad period of 1 year in the subpopulation analysis and visits during each MCHC-determined age bracket (typically, a single visit) when analyzing developmental trajectories.
In this study, we aggregated personal-social milestones with language milestones, denoting them as "language-social" milestones. This was motivated by the relatively small number of milestones in the social domain and the interdependence of development in these 2 domains.
In Multimedia Appendix 1, we described an alternative score definition, the q-score, which is motivated by the notion of developmental quotient and is based on a more formal statistical approach. As described there, these 2 approaches lead to a similar ranking of children according to the quantified developmental delay, that is, when asking for which of 2 given children there is a greater concern for developmental delay, the 2 approaches tend to give the same answer.

Associations Between Mother and Child Characteristics and the Developmental Score
We examined the relations between the DSS and the characteristics of the mother and child. The children's characteristics included sex, gestational age at birth, birth weight, birth order, and records of an existing developmental tracking.
When analyzing gestational age, we partitioned preterm births to extremely preterm (less than 27 weeks), very preterm (27-31 weeks), and late preterm (32-36 weeks) [16]. This was the only analysis that included preterm children.
Characteristics of mothers included age at delivery; level of education; and the result of postpartum depression (PPD) evaluation, using the Edinburgh Postnatal Depression Scale (EPDS). For the purpose of the analysis, mothers were considered as having symptoms of PPD if their EPDS score was ≥10 or if their score in question number 10 (self-harm) was other than 0 [17].
To test whether differences between score averages were significant, we used the Mann-Whitney U test [18].

Developmental Trajectory Vectors
We described the developmental trajectory of a child by the series of its DSS values at each age step from birth to 36 months of age. Each age step s has an associated set of milestones T(s). We further partitioned the milestones by their developmental domains, denoting by T(s, d) the subset of T(s) from domain d (where d can be either "language-social" or "motor"-an aggregation of fine-motor and gross-motor milestones). This allowed us to describe the trajectory per domain as the Developmental Trajectory Vectors (DTVs): where s i goes over the steps of 1-3 months, 3-6 months, 6-9 months, 9-12 months, 12-18 months, 18-24 months, and 24-36 months.
This representation yielded DTVs of length 7 for each child that was assessed at all age steps. For this analysis, we excluded children whose data was missing for 1 or more age steps, analyzing the remaining groups of 294,624 and 294,066 children in the motor and language-social domains, respectively.

DTVs Clustering
We used the k-means clustering [19] to identify distinct patterns of DTVs. In addition, for sensitivity analysis of the clustering method (see Multimedia Appendix 1), we examined an alternative clustering method using a Gaussian Mixture Model [20]. Cluster validity was assessed using the Calinski-Harabasz score [21] (see Multimedia Appendix 1).
The clustering was done using only 6 of the 7 DTV entries. This is because for each domain, there is one step that included only a single milestone (for motor milestones, the 12-18 months step; for language-social milestones, the 6-9 months step), which may reduce the reliability of the results. Nonetheless, when computing cluster centroids, all entries were taken into account.
Analyses were done using Python (version 3.6.7; Python Software Foundation) with the scikit-learn package (version 0.23.2).

Ethics Approval
The study protocol was approved by the Soroka University Medical Center institutional ethical committee (MHC-0014- 19) and was conducted in accordance with the principles of the Declaration of Helsinki. The need for informed consent was waived owing to the use of deidentified data. To assess the relations between the DSS and characteristics of the children or their mothers, we compared, for each domain, the average DSS of several subgroups during the first, second, and third years of life. Figure 2 shows that the average DSS was higher (worse) for children that were under designated developmental tracking, compared to the complementary group (Figure 2A). Higher DSS was evident in the following subgroups: male children ( Figure 2B), children whose mothers reported symptoms of PPD ( Figure 2C), and children of older mothers ( Figure 2D).  Figure 3A demonstrates the relation between the DSS and birth weight: children with birth weight of <2.5 kg or >4.5 kg had higher average DSS than children with normative birth weight (2.5-4.5 kg). Figure 3B shows that the DSS was negatively correlated to the gestational age at birth (eg, in the first year of life, Pearson r=-0.2 for gross motor milestones, -0.25 for fine motor milestones, and -0.18 for language-social milestones; P<.001). There were marked differences between preterm and on-term children, as well as between subgroups of extremely preterm, very preterm, moderate preterm, early term, and full-term children.  Figure 4A shows the association between the DSS and the mothers' level of education. The DSS tended to be higher among mothers with less formal education. In addition, the score appeared to be positively correlated with the child's birth order during the first year of life ( Figure 4B; Pearson r=0.02 for gross motor milestones, 0.03 for fine motor milestones, and 0.08 for language-social milestones; P<.001), with firstborn children having the least average score. This trend was maintained for the gross motor and language-social scores during the second year of life (Pearson r=0.03 for gross motor milestones, 0.01 for fine motor milestones, and 0.07 for language-social milestones; P<.001). Conversely, this correlation was evident during the third year of life only for fine-motor tasks (r=0.02; P<.001). Importantly, these correlations should be considered as affirmation for the trends suggested by the graphs-their relatively low values on these large cohorts certainly do not imply that the DSS "explains" in any way the measured characteristics. Note that all these graphs depict average values. For the most part, children attained the assessed milestones and received a score of 0. See Table S1 in Multimedia Appendix 1 for the median and IQR values of the DSS and Figures S1-S3 in Multimedia Appendix 1 for the same analysis using DMAS instead of LMAS. Figure 5 depicts the centroids derived from clustering of all children's DTVs into 4 clusters. Both motor DTVs and language-social DTVs exhibited similar patterns. There was a single cluster of children with near-zero DSS at all age steps. This cluster included the majority of children ("adequate"; motor DTVs: 199,078/294,624, 67.6%; language-social DTVs: 224,423/294,066, 76.3%). There was a single cluster of children who were "catching up"-their DSS was initially high but tended to decrease over time. There were clusters of "worsening" children whose scores tended to increase over time (2 clusters for language-social milestones and 1 for motor milestones). For motor milestones, there was also a cluster of children whose DSS increased at an early age but then decreased back to normal values and, so, did not conform to any of these 3 patterns. In the clusters depicting an increasing trajectory, there was an overrepresentation of male children relative to the "adequate" cluster. Specifically, in the motor domain, 50.7% (100,901/199,078) of the children in the "adequate" cluster were male, compared to 57.8% (5466/9456) in the "worsening" cluster. In the language-social domain, male children were 48.1% (107,970/224,423) of those in the "adequate" cluster, compared to 71.3% (7093/9952) and 62% (27,566/44,495) in the rapidly "worsening" and moderately "worsening" clusters, respectively. In addition, the "worsening" clusters had larger proportions of children that were born by cesarean section, had low birth weight, or were under developmental tracking.  In Multimedia Appendix 1, we demonstrate that qualitatively, these results were consistent over different range of clusters number, as well as when using an alternative clustering method.

Discussion
The goal of this study was to construct a DSS that can be used for comparative tracking of children's development, quantifying milestones attainment in a concise and straightforward way. We presented a simple methodology for calculating the DSS, a quantitative developmental surveillance score that aggregates age-dependent milestones results over a chosen time frame and domain into a single score. To demonstrate its coherence, we explored 2 main use cases for this score: comparing its value among subpopulations and using it to depict the developmental trajectory of individuals. We demonstrated that the DSS reflects known associations between developmental status and characteristics of the child and mother and its potential for suggesting possible new associations and insights, which may be a stepping stone for further research.
For some of these variables, the DSS suggests a possible association with developmental delays, depicting different score distributions among subgroups stratified by the variable, even within the normal range. For example, it is well established that low birth weight is associated with developmental delays [27,[29][30][31][32], yet the results herein suggest that this may also be true for birth weight within the lower normal range (2.5-3 kg) and for birth weight above the normal range (more than 4.5 kg). Similarly, although the scores of preterm children are higher than full-term children, there is a gradual decrease in the average score by the level of prematurity (extreme preterm, very preterm, and late preterm children), as well as a difference between early term and full-term children.
At the same time, some characteristics show a more complex behavior; for example, the DSS tends to be positively correlated with the child's order, yet for language-social tasks evaluated at 24-36 months of age, the correlation becomes negative. Indeed, although previous work generally associate primipara with lower risk for development delay [32,35,36], Oshima-Takane et al [37], who focused on language development at 21 and 24 months of age, observed higher language skills among second-born children.
Cluster analysis consistently identified 3 types of developmental trajectories: 1 cluster of children who succeed in attaining nearly all milestones, containing most of the children; 1 cluster of children who tend to fail early-age milestones but show improvement over time and succeed in attaining later milestones; and 1 or more clusters of children whose performance grow worse over time, with different clusters depicting different severities of failures. These clusters correspond to common types of developmental patterns observed in clinical practice; although, importantly, not all clusters can be categorized as 1 of these 3 types. Future work may use these clusters as class labels, in an attempt to predict the developmental trajectory type of a child at an early age and, accordingly, consider timely intervention when needed. This work has several limitations. Importantly, the main goal was to present the DSS and show that it is consistent with current knowledge on risk factors for developmental delay such as low birth weight, preterm birth, older maternal age, symptoms of PPD, or lower level of maternal education, as well as to suggest interesting new observations. It is not proposed as a screening tool, and although we demonstrated its rationale and coherence, we lacked a "ground truth" of developmental delay for validating the score against. Future work should aim to assess the score's potential contribution to the clinical workflow of developmental assessment, for example, by comparing it to developmental screening tools such as the Bayley [13] and Denver [8,11] scales, as well as to developmental outcomes beyond those in the current data set, such as a diagnosis of autism.
Such a comparison is also needed for the calibration of the method with respect to milestones and age windows used to derive the score. For example, deriving the score by averaging milestone attainment during a full year implicitly assumes that a single number can represent the developmental delay over this entire period. Conversely, calculating a new score per visit does not take into account valuable information from past evaluations.
Another limitation stems from the use of slightly different cohorts for each age group. As depicted in Table 1, the cohorts differ in size and some of the characteristics, which may introduce some bias to the comparisons of age groups. However, as most of the presented results compare stratified population groups, the existence of similar differences in each age group strengthens the derived observations.
The results described herein pertain to the milestones used in Israeli MCHCs and the age thresholds computed in the THIS developmental scale [9]. Generalizing these results to other settings can be done by adopting the same methodology but would require having, or constructing, a developmental scale that is suitable for that setting. With such a scale at hand, one can compute a DSS from milestone attainment data by comparing them to the age thresholds and defining the score accordingly.
Taken together, our results suggest the potential usefulness of incorporating the DSS into the developmental surveillance workflow. We envision it as being computed automatically once a child's electronic health record is updated with new milestone attainment results and compared to the child's trajectory of past achievements, as well as to the population's norm. In cases where the score deviates significantly on either count, the system would notify the nurse, possibly leading to a more thorough evaluation. When calibrated correctly, such a system could identify developmental delays in a timely manner and foster interventions for improving the prospective outcomes.