Validation strategies for subtypes in psychiatry: A systematic review of research on autism spectrum disorder

Heterogeneity within autism spectrum disorder (ASD) is recognized as a challenge to both biological and psychological research, as well as clinical practice. To reduce unexplained heterogeneity, subtyping techniques are often used to establish more homogeneous subtypes based on metrics of similarity and dissimilarity between people. We review the ASD literature to create a systematic overview of the subtyping procedures and subtype validation techniques that are used in this field. We conducted a systematic review of 156 articles (2001-June 2020) that subtyped participants (range N of studies = 17 – 20,658), of which some or all had an ASD diagnosis. We found a large diversity in (parametric and non-parametric) methods and (biological, psychological, demographic) variables used to establish subtypes. The majority of studies validated their subtype results using variables that were measured concurrently, but were not included in the subtyping procedure. Other investigations into subtypes ’ validity were rarer. In order to advance clinical research and the theoretical and clinical usefulness of identified subtypes, we propose a structured approach and present the SUbtyping VAlidation Checklist (SUVAC), a checklist for validating subtyping results.


Introduction
The characteristics of people diagnosed with Autism Spectrum Disorder (ASD) vary greatly, even though ASD is characterized by challenges in social interactions and communication, restrictive repetitive behaviors, and sensory sensitivities (American Psychiatric Association, 2013).Originally, a narrow category of children received an ASD diagnosis, when they were severely impaired in their social and communication skills, and could hardly bear changes in their environment (Kanner, 1943).Over the years, diagnostic criteria have changed, and now include a much wider spectrum (Wing & Potter, 2002).The prevalence has increased, from 4 cases in 10,000, to around one case in 100 (Elsabbagh et al., 2012;Fombonne, 2018), which is most likely due to widening of the criteria (Mottron & Bzdok, 2020) and an increase in recognition, rather than an increase in actual incidence, as ASD prevalence is stable across different ages (Brugha et al., 2011).For example, while ASD used to be diagnosed primarily in boys of European descent, there are many developments in increased diagnosis of ASD in girls (Lai, Lombardo, Auyeung, Chakrabarti, & Baron-Cohen, 2015), adults and elderly (Piven & Rabins, 2011), and non-Caucasian populations (Becerra et al., 2014).When more people qualify to receive an ASD diagnosis, the group of people with an ASD diagnosis will become more heterogeneous (Mottron & Bzdok, 2020).Increasing heterogeneity of the ASD population will come with even more difficulties to formulate straightforward clinical advice within support programmes.Heterogeneity between people with an ASD diagnosis is already causing difficulties in finding causes and interventions for this population (Happé, Ronald, & Plomin, 2006).In the current article we review the literature on subtyping people with a diagnosis of ASD, and specifically focus on what validation strategies researchers use to make sure that their subtypes are useful, reliable, and valid.
In the scientific (ASD) literature, the term heterogeneity is used in various ways.Some use the word heterogeneity to describe random variability between individuals (e.g., Georgiades, Szatmari, & Boyle, 2013).People vary in psychological traits (e.g., in personality or ability) and in biological characteristics (e.g., gene expression, brain morphology).For example, on two questionnaires, 50 people with an ASD diagnosis may obtain 50 different combinations of scores.Such random variability also complicates the search for causes, as a cause would be more readily identified if all people with an ASD diagnosis were identical.However, random variability is not the kind of heterogeneity we refer to here.
We define heterogeneity as the existence of subtypes that are qualitatively different.This can be in the psychological or biological domain, or a combination of both.For example, 50 people with an ASD diagnosis who fill in two questionnaires may form two subtypes, with 30 people obtaining a high score on the first questionnaire and a low score on the second, and 20 people obtaining a low score on the first questionnaire and a high score on the second.For the first subtype, genetic causes may be responsible for their difficulties, while environmental causes may be responsible for the second subtype's difficulties.These causes become much harder to identify without knowing that subtypes exist, and without knowing to which subtype people belong.This illustrates the importance of identifying valid subtypes, as a possible prerequisite for identifying causes.
In the ASD research realm, various attempts have been undertaken to tackle heterogeneity by establishing more homogeneous subtypes, to meet specific needs of specific subtypes.There are many reasons that subtyping analyses are desirable (Grzadzinski, Huerta, & Lord, 2013, Georgiades et al., 2013).First, if we can assign people with a high degree of certainty to subtypes, we can study what the prognosis is for people in different subtypes, and provide better information to people on what to expect later in life (Bohane, Maguire, & Richardson, 2017).Second, if a subtype can be identified that is homogeneous in the constellation of behaviors that people show, this could aid the search for biomarkers for these behaviors.If such biomarkers exist, these can be used in early diagnosis, and therefore early interventions, potentially leading to better outcomes.Differences in biomarkers between subtypes could also be the cause of subtype membership.Third, if we can assign people to subtypes, we can find out what kind of intervention works best for which subtype, and which intervention may even be disadvantageous for a particular subtype.
Prognosis, predictors of subtype membership and heterogeneity of intervention effects all relate to outcomes that are external to the subtypes themselves.This focus on the predictive value of a subtyping result is in line with recent recommendations on tying subtyping methods to predictive methods.Such methods can ensure that subtyping results will also have practical implications (Feczko et al., 2019), given that there are theoretical or clinically motivated reasons that there should be a relationship between the outcome to be predicted, and the constellation of variables used to form subtypes.
There are also ontological reasons to study subtypes, which are of intrinsic value regardless of external outcomes.A subtyping analysis can examine whether individual differences reflect subtypes, or individual differences reflect a dimension (Bernstein et al., 2010).A dimension on which people differ randomly may cause similar problems as undetected subtypes, and may cause researchers to presuppose the existence of subtypes (Widiger, 1992).A subtyping analysis would be required to discover the absence of subtypes.Second, subtyping analyses can examine established delineations between disorders.For example, delineations between ASD and conditions like schizophrenia and ADHD (Eack et al., 2013) can be studied, to examine whether the current delineations are optimal in assigning people to the best possible intervention, or whether an alternative delineation may better represent individual differences in the associated psychopathology.If a single diagnosis is found to be a combination of multiple subtypes, this could then lead to an evidence-based split of categories in diagnostic manuals (Brewin et al., 2017).

Subtyping in ASD
In this article, we review the literature on empirical subtyping of people with ASD.To our knowledge, there have been five past reviews of subtyping in ASD (Beglinger & Smith, 2001;DeBoth & Reynolds, 2017;Marquand, Wolfers, Mennes, Buitelaar, & Beckmann, 2016;Syriopoulou-Delli & Papaefstathiou, 2020;Wolfers et al., 2019).Beglinger and Smith (2001) provide an overview of the literature up to 2001, and include 17 different studies on subtypes of ASD.Although there are some discrepant results, their review of the evidence suggests that there are around four different subtypes that can be discerned in every study, with differences in results depending on what variables are included in the subtyping method.Generally, most results indicate a severity gradient, with subtypes that are ordered in the sense that one subtype is least affected, and one is most affected across different variables.DeBoth and Reynolds (2017) provided an overview of the literature on subtyping of people with an ASD diagnosis on the basis of sensory-based measures, and included eight articles.Their review of the evidence indicates that generally, there are three to five subtypes, depending on whether measures of both hyporeactivity and hyperreactivity were included.Marquand et al. (2016) provided a review of the literature on subtyping in psychiatry in a broader sense.For ASD, they discuss six recent articles, and highlight the diversity in used variables and resulting subtypes.Recently, Wolfers et al. (2019) provided a review of the literature on subtyping in ASD.In comparison to the present article, they included fewer articles that are relevant to this discussion (19, vs. 156 articles included in the present article).This difference in article inclusion is most probably due to the authors' decision to include a shorter time frame, use less comprehensive search terms, and restrict the search to a single database.Furthermore, their review recorded two validation strategies, while the present review distinguishes seven.Similarly, the Syriopoulou-Delli and Papaefstathiou (2020) review included an even smaller number of articles (10 articles).
One aspect of subtyping analyses that is understudied in each of these reviews is the way in which results are substantiated, which we call validation strategies.If an analysis finds four different subtypes, there is little information to either corroborate or contest the existence of those four subtypes, and it remains an open question whether the four subtypes are a chance finding.There is some implicit information in the methodological rigor of the study design, and the number of data points that were collected.However, a subtyping result in itself does not provide information on whether the results are generalizable to the broader population, replicable in other research, or useful to other researchers and clinicians in their thinking.
In the present article, the current state of validation in the literature on empirical subtyping studies in ASD is therefore reviewed.Since the review of Beglinger and Smith (2001), many articles have been published beyond the 17 that they included, using a wide variety of samples, variables, methods, and ways of validating the results.Also, we focus on empirical subtyping methods, excluding articles that use preset cutoffs to form subtypes.In contrast to the review of DeBoth and Reynolds (2017) that focused on sensory variables, our review does not focus on a single domain, but considers all types of variables that have been used to find subtypes in ASD.We systematically review the literature between 2001 and June 2020, including every study that uses an empirical method to find subtypes within a sample of autistic people.In contrast to Wolfers et al. (2019), we look at a more representative sample of articles that are relevant to this discussion, and focus on a wider range of validation measures.The main question we aim to answer is: Does the growing body of subtyping literature provide sufficient corroborating evidence to suggest valid and reliable subtypes?And if we can answer this question affirmatively, what are the subtypes that are most consistently supported by the literature?

Validation strategies for subtypes
In the rest of the review, we aim to identify applications of seven validation strategies in the literature, defined as follows.The first two validation strategies -"cross-method replication" and "subtype separation"-can be applied with a single data set, with a single set of measures.These strategies are depicted in Fig. 1.In "cross-method replication", subtypes are formed with two or more statistical methods, comparing the results.For example, hierarchical clustering and k-means clustering techniques can be applied to a single dataset, to see whether each technique results in the same number and score profile of subtypes.For example, in one study, four different methods were applied to a single dataset to establish whether the number of subtypes was stable across methods (Hu & Steinberg, 2009).The reasoning is that subtypes are clearly distinguishable when they can be detected with disparate statistical methods (Taylor, Asmundson, & Carleton, 2006).
One metric for the validity of subtypes is the certainty with which participants are assigned to different subtypes.We refer to this as "subtype separation", as it measures whether the subtypes are clearly separated and distinct, or whether there is overlap between subtypes.For these purposes, statisticians have developed indices like the "mean posterior probability of class membership", which quantifies the degree of certainty with which people are assigned to specific subtypes.The reasoning is that subtypes are more valid if people are consistently a member of one, and only one, subtype, rather than being a possible member of multiple subtypes.
The third and fourth validation strategies -"independent replication" and "temporal stability"-require extra data collection using the same measures, either testing new participants or testing participants a second time.These strategies are depicted in Fig. 2. In "independent replication", subtypes are constructed based on two different samples, using the same measures.The two samples are independent, and the initial subtyping result can be replicated.For example, a sample can be split into two, validating the results of the analysis of the first half on the second half.In one study with an intellectually disabled sample, the eight-subtype solution that was found for the first sample was validated in a second sample (Brown, Aman, & Lecavalier, 2004).The reasoning is that if subtypes exist in the population, analysis of any representative sample from this population should recover the same number and type of subtypes.
In "temporal stability", subtypes are formed at one measurement occasion, and established again at a later measurement occasion, using the same measures.We want to know whether subtype membership is stable over time, or whether participants switch between subtypes.To establish stability, participants can be retested after a number of years, and subtypes can be constructed once more with this data, comparing the results of the two subtyping analyses.For example, in one study, children with an ASD diagnosis that were subtyped at the time of diagnosis were retested and re-analyzed at age 6 with the same subtyping technique, to find that the children in the three subtypes at baseline were now divided over two subtypes, which did not correspond one-to-one with one of the subtypes at baseline (Georgiades et al., 2014).The reasoning is that if subtypes are valid to the point where we can find causal biomarkers for them, the number of subtypes should remain the same, and subtype membership should not vary too much over time.In a review of subtypes in the eating disorder literature, a consistent three subtypes were found across studies, but the complete absence of investigations into temporal stability was identified as an important limitation throughout (Wildes & Marcus, 2013).
The final three validation strategies -"external validation", "parallel validation", and "predictive validation"-require that data on more variables are collected, outside of the variables that are used in the subtyping procedure.These strategies are depicted in Fig. 3.In "external validation", subtypes are compared on variables that were not used in the construction of the subtypes, and that are theorized to be related to interindividual differences.For example, subtypes can be compared on demographic variables or other variables that should theoretically be different between subtypes, but were not used in the construction of the subtypes.This was done with subtypes constructed using age, cognitive abilities, and adaptive functioning, after which the subtypes were compared on the scores they obtained on a checklist of ASD behaviors (Bitsika, Sharpley, & Orapeleng, 2008).The reasoning is that differences between subtypes should not be limited to variables used to construct the subtypes.
In "parallel validation", subtypes are constructed with the same sample at the same measurement occasion, with different variables that are theoretically equivalent to the variables that are used in the subtyping.For example, latent trajectory subtypes were found to be the same in a longitudinal study of children with an ASD diagnosis, regardless of which measure of daily living skills was used (Bal, Kim, Cheong, & Lord, 2015).The reasoning is that this would indicate that not the chosen measurement instruments themselves are important in determining subtypes, but that the constructs that underlie the measurement instruments are important.
In "predictive validation", subtype membership is used to predict variables on a later measurement occasion.This is similar to both "external validation" and "temporal stability", as information is used on other variables than are used to form subtypes, and data is used from a later measurement occasion.If subtypes are found to differ on variables at a later measurement occasion, this is evidence that the subtypes are not only distinct, but also have prognostic value for the individual.The reasoning is that if subtypes are found to provide reliable predictions for future outcomes, this means that they are not only valid in the sense of describing real differences between subtypes, but are also clinically relevant.As noted in a review of OCD subtypes, using subtypes to predict treatment response is done in relatively few studies, even though also in OCD, results suggest that treatments need to be adjusted to the specific subtype (McKay et al., 2004).
The application of these validation strategies is far from identical across studies.Different studies use different indices to establish "subtype separation".Also, studies do not necessarily use the term "external validation" to describe comparisons between subtypes on additional outcome measures.However, all validation strategies that are used in the literature to corroborate the existence of subtypes can be classified as belonging to one of these seven.

Search strategy
The literature search strategy combined keywords related to ASD diagnoses (variations of autism, Asperger's and Pervasive Developmental Disorder) with keywords related to the different types of subtyping methods (exact search syntax in Appendix): parametric methods (variations of latent class analysis, mixture models, etc.), non-  parametric methods (variations of k-means, hierarchical clustering, etc.) and community detection methods (variations of community detection, cliques).Both PsycINFO and MEDLINE (on which PubMed is based) databases were searched, because these cover different portions of the literature (Wu, Aylward, Roberts, & Evans, 2012).In all articles, references to other subtyping analyses were inspected to make sure they were included if they were not found by the initial search.

Inclusion and exclusion criteria
There were seven inclusion criteria, relating to publication date, samples, measures, and analyses.The first inclusion criterion was publication after 2000.Two searches were conducted.The first was conducted in February 2018, and included papers published between January 2001 and February 2018.The second was an update in June 2020, and included papers published between February 2018 and June 2020.The second inclusion criterion was that living humans were studied as test subjects.The third inclusion criterion was that at least part of the sample had an ASD diagnosis.
The fourth inclusion criterion was that measurements were taken that related to the person with the ASD diagnosis.This for example excluded studies that measured the behavior of mothers of children with an ASD diagnosis.However, proxy ratings were included, i.e., ratings that mothers provided of the behavior of their children with an ASD diagnosis.The fifth inclusion criterion was that the subtyping method was used to assign people to subtypes.The sixth inclusion criterion was that an empirical statistical method of subtyping was used to assign people.This excluded studies that used predefined subtype descriptions to assign people, which might have been established on theoretical grounds or on earlier empirical work.Articles that featured only taxometric analyses (Bernstein et al., 2007;Meehl, 1995) -aimed at identifying whether there are two subtypes or no subtypes-were also not included.The seventh inclusion criterion was that an unsupervised method was used, i.e., a method that finds subtypes rather than a method to predict a particular outcome.This excluded, for example, support vector machines, and other classifiers.Studies were included that find novel approaches to adapt existing supervised learning methods to the unsupervised case.

Data recording
For the first main search, all data were recorded by one of two authors (JAR, MKD), with each checking the other's coding.Data for the update were recorded by the first author (JAR).Furthermore, a number of checks were performed, correcting any possible errors (e.g., 80% being coded as 0.8).Aside from article characteristics, like authors and publication date, we recorded data from each article on four levels: Sample characteristics, variable characteristics, analysis characteristics, and validation characteristics.The choice which data to record was based on earlier reviews of subtyping analyses (van Rooden et al., 2010, Beglinger & Smith, 2001, Marquand et al., 2016, DeBoth & Reynolds, 2017).

Sample characteristics
First, we recorded aspects of the sample that was used in the subtyping analysis.If the initial sample was larger than the sample that was analyzed in the subtyping analysis, we recorded the characteristics for the analyzed sample.We recorded the sample size, the percentage of males, the mean age and age range, the mean IQ and IQ range,1 the percentage of participants with an ASD diagnosis and the diagnostic manual the ASD diagnosis was based on.We recorded sample sizes because they can be influential in how many subtypes are found, and how precise the delineation of different subtypes is.How large samples need to be to detect subtypes is understudied (Dziak, Lanza, & Tan, 2014).We recorded the mean age and age range of the participants because of possible differences in subtyping between infants, children, adolescents, adults, and older adults.Some studies included a broader age range, spanning multiple developmental categories.In such studies, it is of interest to see whether the subtypes that were found do not simply reflect heterogeneity in developmental stage.
Lastly, we logged the percentage of the sample with an ASD diagnosis.Some studies might have included both a typically developing group, and an ASD group.Other studies might have included an ASD group, and a group with a different diagnosis, such as schizophrenia or ADHD.Other studies only included an ASD group.If there were multiple groups in the study, but only the ASD group was used in the subtyping analysis, we only recorded the ASD group.

Variable characteristics
We recorded the number of variables that were included in the subtyping analysis, which might be the number of questionnaires, the number of subscales, or the number of items.We also documented the type of variables.There are many different kinds of variables one can use to make subtypes that can be broadly categorized as demographic, psychological, and biological.Demographic variables may for example be age, sex, and level of education.Psychological variables may for Fig. 3. Illustration of the external validation, parallel replication, and predictive validity validation strategies.For external validation, emotional and brain outcomes would not be included in the formation of the subtypes.For predictive validation, outcomes would not be included in the formation of subtypes, and would be measured at a later measurement occasion.example be questionnaires, cognitive tests (McCrimmon, Schwean, Saklofske, Montgomery, & Brady, 2012), or symptom checklists (Klopper, Testa, Pantelis, & Skafidas, 2017).Biological variables may for example be gene expression measurements (Kong et al., 2013), facial features (Obafemi-Ajayi et al., 2015), or EEG measures (Hasenstab, Sugar, Telesca, Jeste, & S ¸entürk, 2016).

Analysis characteristics
We recorded the type of statistical subtyping procedure, the number of subtypes that were obtained, and the relative sizes of the different subtypes in percentages of the total sample, sorted from largest to smallest.

Validation characteristics
We recorded whether the seven validation procedures described in the introduction were followed.To determine whether "cross-method replication" was assessed, we logged whether multiple statistical subtyping methods were used to arrive at subtypes.To determine whether "subtype separation" was assessed, we recorded whether standardized metrics were computed that quantified how distinct subtypes were, or whether the posterior probabilities of subtype membership for the participants were computed.Standardized metrics are for example the Silhouette, Dunn, and Calinski-Harabasz indices.These metrices indicate whether the variation between subtypes is large in comparison to the variation within subtypes, which reflects how separable or differentiable subtypes are.Posterior probabilities of subtype membership also reflect how separable subtypes are: If every participant can be assigned to a particular subtype with a high probability, then subtypes are more distinct than when participants can possibly belong to two or more subtypes (Nagin, 2005).Posterior membership probabilities are not available for the traditional non-parametric subtyping methods.
To determine whether an "independent replication" was undertaken, we recorded whether the subtyping result was evaluated on a sample different from the one used to establish the subtype result.This could also have been done in a cross-validation setup, where the fitting sample (the "training set" in machine learning terms) and the evaluation sample (the "test set" in machine learning terms) switch roles.To determine whether "temporal validity" was assessed, we documented whether the subtyping analysis was performed at multiple measurement occasions, for all articles that had data on multiple measurement occasions.Latent transition analysis falls within this category, as subtypes are formed at two occasions and transitions between subtypes are modeled.We did not record latent growth curve analysis as assessing temporal stability, as it uses data from multiple measurement occasions once to form subtypes, which does not convey information on stability of subtype membership over time.
To determine whether "external validity" was assessed, we logged whether subtypes were subsequently compared on variables that were not used to define the subtypes.To determine whether "parallel validation" was assessed, we recorded whether a subtype analysis was run twice in the same article, with different variables.If the variables were not clearly in different domains as considered by the authors, we recorded this as an assessment of parallel validation.To determine whether "predictive validation" was assessed, we recorded whether subtypes were compared on variables that were not used to define the subtypes, like in external validation, but that were also measured at a later measurement occasion.
For each of the validation methods, we did not record to which degree the results were valid: This is a subjective decision, and depends on the context.Therefore, our goal was to record whether these steps towards validation of the results were taken, without judging whether they were successful.

Results
In Fig. 4, the PRISMA diagram is provided (final n = 156).The records that were initially not found in our search were identified in reference lists of other articles.
In total, the samples were not always completely independent between articles, as some articles extended a sample that was collected before, some articles performed a different analysis on the same sample, and some articles added an aspect or variable to an earlier subtyping analysis to answer new research questions.Therefore, the 156 articles that were reported on here do not correspond to 156 unique datasets.We excluded five articles that described a subtyping analysis that had already been performed with the same sample in a different article.
The majority of the articles that we included were recent, as half of the articles were published after 2016.The number of articles that meet our criteria has been steadily increasing (Fig. 5).

Results sample characteristics
There are large differences between studies in sample size, demographics and inclusion criteria.A brief summary of each of these aspects is given below (see Table in the Appendix for study details).

Sample size varies from tens to tens of thousands
The median sample size was 190.Sample size ranged from N = 17 adults for a pilot study of language skills in adults with ASD (Lewis, Woodyatt, & Murdoch, 2008) to N = 20,658 for an analysis of electronic health records (Lingren et al., 2016).32% of the samples was smaller than N = 100, 30% of the samples was between N = 100 and N = 300, and 38% was larger than N = 300.15% of the studies was smaller than N = 50; 16% of the studies was larger than N = 1000.The sample size is somewhat increasing over the time frame included in this study, although studies with fewer than 100 participants remain common, see Fig. 6.

The majority of the participants were male
The median percentage male was 80%, with half of the studies having a percentage of males between 73% and 87%.This indicates that the inclusion rate of women into studies of subtyping was in line with current estimates of the proportion of women with an ASD diagnosis (Lai et al., 2015; note that also population studies with a minority with an ASD diagnosis were included, see below).One study studied only women (Pohl, Cassidy, Auyeung, & Baron-Cohen, 2014).

The majority of subtyping analyses were in an all-ASD sample
For 63% of the analyses, all of the participants were diagnosed with ASD.Here, it should be noted that in studies with a mixed sample, inclusion into the subtyping analysis was leading.So, the articles coded as having a sample of whom 100% were diagnosed with ASD might have included comparison participants that were not included in the subtyping analysis.In 18% of the studies, less than half of the participants were diagnosed with ASD.Outliers were the studies where only 3-5% of the participants were diagnosed with ASD (McChesney & Toseeb, 2018, Nishimura, Takei, & Tsuchiya, 2019, Painter et al., 2018, Berlin, Lobato, Pinkos, Cerezo, & LeLeiko, 2011, Dyck, Piek, & Patrick, 2011).One of these (Berlin et al., 2011) reported ASD diagnosis status only for the subtype named "ASD".For 4% of the studies, percentage ASD diagnosis was missing.Most studies studied either only an ASD-diagnosed group, or an ASD-diagnosed group together with a typical comparison group.However, there were studies that had looked at diagnostic boundaries between ADHD and ASD (Dajani, Llabre, Nebel, Mostofsky, & Uddin, 2016;Rommelse, van der Meer, Hartman, & Buitelaar, 2016;van der Meer et al., 2012).Another study looked at children with Down syndrome, of which some had a comorbid ASD diagnosis (Ji, Capone, & Kaufmann, 2011).A number of articles studied a diagnostically diverse sample (Castro & Pinto, 2015;Lecavalier, 2006;Little, Dean, Tomchek, & Dunn, 2017).Also, in the literature from the period where there were still divisions in the DSM between disorders that fall under ASD, subtyping analysis was used to assess whether these divisions were valid (Verté et al., 2006).

Fewer than 20 variables are commonly used to construct subtypes
The median number of variables that were included in the subtyping  Note that the y-axis is on a log10 scale.Some random noise is added to publication year to prevent overlapping points.
analysis was eight.80% of studies included fewer than 20 variables.Larger numbers occurred sporadically, with 1350 as an absolute outlier (counts of ICD-codes over time; Doshi-Velez, Ge, & Kohane, 2014).Exceptions were a number of articles that performed latent class growth curve analyses, which typically focused on the progression on a single variable over time.These studies made up the majority of the 7% of the studies that only examined a single variable.

ASD characteristics are most frequently used to construct subtypes
The most frequently used variable for subtyping analyses were the Autism Diagnostic Interview -Review (ADI-R) and Autism Diagnostic Observation Schedule (ADOS).The ADI-R was used in 20 studies, although studies differed in whether subscale scores or individual items were used.The ADOS was used in 20 studies, but studies differed in whether multiple variables were used, or only a Calibrated Severity Score was entered into the analysis.Because the studies were predominantly children studies, the Vineland Adaptive Behavior Scales (VABS; 18 studies), Mullen Scales of Early Learning (10 studies) and Child Behavior Checklist (CBCL; 9 studies) were other popular choices.In total, 14 studies used variables related to sensory input (most already well-described in the specialized review mentioned in the introduction, DeBoth & Reynolds, 2017).There was generally a large diversity of variables that were used, both biological and psychological, with almost all studies having a unique set of variables included in the subtyping analysis.

Latent class analysis and hierarchical clustering are most popular
Hierarchical clustering was the most popular non-parametric method, used in 34% of the papers.k-means clustering was used in 17% of the papers.Latent Class Analysis was the most popular among the parametric methods, used in 36% of the papers.Note that under Latent Class Analysis, we subsume Latent Profile Analysis, which is applied in case of continuous measures.
A number of methods were hybrids of earlier developed methods or were otherwise too novel to fit in with the standard methods.For example, an ensemble of three methods was used in one article (Shen, Lee, Holden, & Shatkay, 2007).These five studies were coded as "Other".Three studies made use of Two-Step Clustering, a method that is specific to the SPSS software package.Factor mixture models were used in eight studies to answer the question whether individual differences could best be described by a number of subtypes, a dimension, or a number of subtypes within which individual differences could best be described by a dimension.Latent transition analysis, a method that is especially suited for stability analyses, was performed in three studies.Latent growth curve models were used in ten studies, particularly in young children.Multivariate latent growth curve models form a theoretically strong model, and were used in three studies.

Two to four subtypes are recovered in the vast majority of articles
The median number of subtypes was three, and 82% of all results indicated between two and four subtypes.11% found five, 3% found six.The largest number of subtypes was 16 (Stevens et al., 2019); these 16 subtypes were again analyzed to recover five higher-order subtypes.The lowest number of subtypes was one, which occurred in two articles (Kamp-Becker et al., 2010, Beauchamp, Rezzonico, & MacLeod, 2020), but was only one result among many analysis results in both cases.See Fig. 7 for the full distribution.

Substantive conclusions across articles are complicated by study heterogeneity
Because of the many differences between measures, participants' ages, sample compositions and diagnostic processes we discovered in the sample of studies, it seems premature to combine findings from studies.Subsets of studies that are more similar in terms of measures and samples become too small to draw strong conclusions.However, with the studies that have used the most popular measures (ADI-R, VABS, ADOS), we can get some impression of the stability of subtypes across more homogeneous sets of studies.
Seven different studies used variables from the ADI-R in child samples where 100% had a diagnosis of ASD (Bureau, Labbe, Croteau, & Mérette, 2008;Cholemkery, Medda, Lempp, & Freitag, 2016;Georgiades et al., 2014;Hu & Steinberg, 2009;Pichitpunpong et al., 2019;Shen et al., 2007;Verté et al., 2006).Across these studies, between two and five subtypes are retrieved, mirroring the results of the entire sample of studies.It is difficult to understand where the differences in number of subtypes come from -the number of subtypes seems unrelated to publication year, statistical method, and number of variables includedand the number of studies becomes too low to further stratify these studies.Across three of the studies (Cholemkery et al., 2016;Georgiades et al., 2014;Verté et al., 2006), the authors note that subtypes are primarily distinguished in terms of severity of symptoms, rather than that there are qualitative differences between subtypes.This is in contrast with studies that find four subtypes (Hu & Steinberg, 2009;Pichitpunpong et al., 2019), for which there is at least one qualitatively different subtype.
There seems to be some pattern when we look at studies that have applied latent growth curve models: Studies that have used single variables from the Vineland Adaptive Behavior Scales tend to find fewer subtypes (2; Farmer et al., 2018, Bal et al., 2015, Tomaszewski, Smith DaWalt, & Odom, 2019), than studies that have used single variables from the ADOS (4-5, Gotham, Pickles, & Lord, 2012, Venker et al., 2014, Visser et al., 2017).Although the number of subtypes is the same for the VABS studies, the interpretation is different across studies.The initially lower scoring subtypes either increase in score, decrease, or remain stable.For the ADOS studies, the interpretation is more consistent with each study identifying a "severe stable", a "moderate stable" and a "moderate improving" subtype.The other one to two subtypes differed across studies.We should be careful not to overinterpret these results considering the limited number of studies within each subset, but there seems to be potential in replications by different research groups, as this does give more insight into the robustness of subtyping results.

Results validation strategies
The prevalence of the various validation strategies is displayed in the left of Fig. 8.

Cross-method replication consists of familiar (parametric or nonparametric) pairs of methods
13% of articles made use of multiple subtyping methods.When this was the case, either multiple non-parametric methods were used, i.e., kmeans clustering and hierarchical clustering, or multiple parametric methods were used.The studies that were recorded to perform crossmethod replication with parametric methods were often of one of two kinds.The first kind were studies looking at whether interindividual differences are best described with a latent categorical or dimensional structure, for which latent class analyses are compared to factor mixture models and factor models (e.g., Kim et al., 2019;Uljarević et al., 2020).The second kind were studies in which subtypes at one measurement occasion -established with a latent class analysis-are compared to subtypes at a second measurement occasion -with a latent transition analysis.Arguably, this second kind is an example of a temporal stability validation strategy, rather than a cross-method replication.There was only one study that compared results from clustering methods from different traditions, namely Latent Class Analysis and k-means clustering (Uljarević, Frazier, et al., 2020).Apart from this, there were some technical studies that proposed novel methods and compared them to a default method (e.g., Zhao et al., 2018).

Subtype separation is variable, as different methods come with different metrics and indices
Subtype separation was investigated in 38% of the articles.The first way of establishing subtype separation we recorded was by computing an index of the difference between subtypes.Used indices included the Calinski-Harabasz index, Dunn index, Davies-Bouldin index, Silhouette index, Gap statistic (Cohen et al., 2017), pseudo-F, pseudo-T 2 , and cubic clustering criterion (Ben-Sasson et al., 2008;Lecavalier, 2006).Only some studies go into detail on the meaning of these indices for the validity of the subtypes (e.g., Asif et al., 2020).The second way of establishing subtype separation that was recorded assigned probabilities to people's subtype membership (Ausderau et al., 2014;Voorspoels, Rutten, Bartlema, Tuerlinckx, & Vanpaemel, 2018).

Independent replication is most commonly observed in samples big enough to split into two
Independent replication within a single article occurred in 9% of the articles.Some studies had a clear replication design.For example, in a study that made use of data from two schools, the subtyping results were independently replicated, by running the subtyping analysis on each school separately (Cohen et al., 2017).In both schools, two subtypes were identified, that could be interpreted in the same way, i.e., stable sleepers vs. unstable sleepers.The participants in each school were also classified using the subtyping solution from the other school.Other studies featured less direct replications.For example, subtyping results in a child sample were replicated in an adult sample (Lewis, Murdoch, & Woodyatt, 2007b).In this case, if the number of subtypes had not replicated, this could have suggested many things other than that the subtypes were not valid.A number of studies had a sample that was large enough to split into two, establishing an excellent form of independent replication, as the selection of replication participants is made randomly (Lombardo et al., 2016;Uljarević, Frazier, et al., 2020).

Assessment of temporal stability is rare, although it is recognized as a goal
Only 3% of studies performed an analysis of temporal stability.As mentioned above in the section on cross-method replication, these were primarily studies that used latent transition analysis to examine stability over time.One study that was particularly explicit in its goals of examining longitudinal stability looked at reading profiles at two measurement occasions, measured 30 months apart (Solari, Grimm, McIntyre, Zajic, & Mundy, 2019).A number of studies explicitly mention investigations of stability as one of the most urgent priorities.

External validation is common, but authors are rarely explicit about validity implications
The majority, 88%, of articles describe comparisons between subtypes on variables that were not used to construct the subtypes.By far most often, age was used to compare subtypes.For example, in research in infants, four subtypes that were defined using behaviors scored from a video were found to differ in the age of participants (Malvy et al., 2004).Sex was also frequently used to compare groups.For example, using various self-report measures and tasks measuring empathy to subtype participants, three classes were recovered, one of which was found to be primarily female (Grove, Baillie, Allison, Baron-Cohen, & Hoekstra, 2015).Diagnostic group is a third variable often used to compare subtypes, for example to validate subtypes that are constructed using biological variables (e.g., El-Ansary, Hassan, Daghestani, Al-Ayadhi, & Ben Bacha, 2020).Most articles do not discuss what it means for the validity of the subtypes, whether the subtypes are different on these additional variables or not (notable exceptions in Painter et al., 2018, Vaidya et al., 2020).

Parallel validation is primarily found in studies with multiple growth curves for multiple variables
6% of studies performed separate subtyping analyses with similar variables.Two articles used both the Social Responsiveness Scale (SRS), and the Social Communication Questionnaire (SCQ) to form subtypes (same data, Frazier et al., 2010, Frazier et al., 2012).One performed latent growth curve analyses for different measures of daily living skills, with the same subtypes appearing across measures (Bal et al., 2015).One study was unique in that latent transition analysis was not used to model different measurement occasions, but different variables (Spikol, McAteer, & Murphy, 2019); this was coded as parallel validation.The clearest form of parallel validation was found in a study where separate latent growth curves were fitted for four different measures measuring the same construct -symptom onsetthree of which were parentrated, one was examiner-rated (Ozonoff et al., 2018).Interestingly, the analysis indicated different numbers of subtypes for the two types of raters.In one study primarily concerning ADHD symptoms, separate community detection analyses were run for Attention and Hyperactivity measures (Cordova et al., 2020), which was coded as parallel validation, even those these constructs are somewhat different.

Predictive validation is uncommon
3% of studies used subtype assignments to make predictions over time.Two studies predicted diagnostic status at age four, using subtypes that were established at age two (Brennan, Barton, Chen, Green, & Fein, 2015;Wiggins, Robins, Adamson, Bakeman, & Henrich, 2012).One study modeled latent growth curves over multiple measurement occasions in early infancy, with which diagnostic status at 36 months was predicted (Nishimura et al., 2019).One study was arguably not predicting but was coded as such, as ASD diagnosis at the last measurement occasion of a latent growth curve model was predicted from subtype membership (Henry et al., 2018).

Most articles use one or two validation strategies
In the right of Fig. 8, we display how the frequency of validation strategies is distributed among articles.By far most articles used one or two validation strategies.Use of zero validation strategies mostly occurred in articles that were not trying to make a scientific contribution for ASD per se.For example, some articles use ASD data as an illustration for demonstrating a model-fitting procedure (e.g., Zheng, Hume, Able, Bishop, & Boyd, 2020).There are eight articles that have used four validation strategies (Ausderau et al., 2014;Chen et al., 2019;Chen et al., 2019;Cohen et al., 2017;Duffy & Als, 2019;Obafemi-Ajayi et al., 2015;Solari et al., 2019;Spikol et al., 2019;Uljarević, Frazier, et al., 2020).Ausderau et al. (2014) stands out, as this article is very explicit in the application and reasoning behind using different validators and different validating strategies.

Discussion
Much research has been done to establish whether there are subtypes within ASD, and to establish whether ASD can be distinguished from other conditions and typical functioning.This research is highly relevant, as different subtypes may require different interventions, different kinds of care, and may be influenced by different environmental and biological factors.The question is: Is there actually sufficient evidence for the existence of subtypes within ASD?Given the current status of subtyping research we believe that, for many results, there is too little evidence that the observed subtypes are valid and reliable.In general, few of the seven different validation strategies we discussed are applied in the ASD literature.So far, not one single study has been found to apply all seven strategies for validation.This is the case even though our coding of validation strategies was lenient, in the sense that many borderline cases were coded as providing validation.To make the search for subtypes, biomarkers, and tailored interventions truly valuable, it is crucial that researchers a) systematically gather additional variables, independent samples, and follow-up data to validate subtypes, b) preregister hypotheses on what outcomes they expect from these validation strategies, and c) explicitly report what results falsify or confirm the validity of a particular subtyping solution.We are well aware this is not an easy endeavor.
A similar conclusion was reached when Wolfers et al. (2019) focused on a smaller sample of studies, but with the inclusion of pattern classification methods.They particularly stress that more effort should be put towards identifying the biological basis of subtypes.Such a biological basis to subtypes would be one possible approach to link subtypes from multiple domains to each other, if they are found to share the same biological foundation.As described in this article, this need not be the only route to establishing clearly distinct subtypes that can be compared across domains.Also, biological differences do not need to underlie all differences between psychiatric subtypes.On what substrate differences arise depends on the level and domain on which subtyping variables are measured, and on the goal.Furthermore, if the goal is to predict epilepsy, biological factors will be crucial.If the goal is to predict happiness, biological factors will most likely not be sufficient.
Although more inclusion of independent replications within studies would be a great strength (Feczko et al., 2019), it is understandable that many samples within a single study are not large enough to split into two, without sacrificing the statistical power to detect different subtypes.However, between studies, replication of one's own or others' results is possible.In the current sample of studies, there are a number of studies that had such a setup, for example in the sensory studies (Ausderau et al., 2014;Ben-Sasson et al., 2008 , 2016).This provides the field with the possibility of assessing how replicable the number and composition of subtypes are.By performing a replication using the same measurement instruments and procedure as an existing subtyping study, one may add more to the subtyping literature on ASD than by providing yet another categorization using a novel combination of instruments.
One difficulty for the current state of the validity of subtypes, is that whether a particular result is seen as validation or invalidation is context-dependent.In some studies, correspondence of subtypes with diagnostic categories is seen as a validation of the subtyping result.For example, using gene expression as the subtyping variables, two subtypes were recovered in two studies.The two subtypes were found to correspond with the division into affected and unaffected siblings (Kong et al., 2013) or with the division into the ASD group and control sample (Oh, Kim, Kim, & Ahn, 2017).In contrast, for some other articles, a lack of correspondence between subtype and diagnostic group is seen as an invalidation of diagnostic labels.For example, using subscales of a communication checklist, three subtypes were recovered from a sample of children, which did not correspond one-to-one with the various DSM-IV diagnoses (Autism, Asperger's Syndrome, Pervasive Developmental Disorder -Not Otherwise Specified) that were assigned before the study (Verté et al., 2006).Similarly, using various cognitive measures to make subtypes, four subtypes were recovered that did not correspond one-toone with the diagnostic labels of ADHD and ASD (Rommelse et al., 2016).These articles use a lack of systematic differences in diagnostic group between subtypes to make a case against the diagnostic labels that are used.It is evident that one can argue both ways.Therefore, it is important that researchers clarify beforehand what result they expect.Preregistration of one's hypotheses and data-analysis plan, through platforms such as AsPredicted.orgor the Open Science Framework, are a promising way forward in increasing such transparency (Nosek et al., 2015).
In the discussion of validation, it is beneficial to separate confirmation and falsification of subtyping results.For almost all validation strategies, these require different study parameters or variables.For example, to confirm temporal stability, the researcher may measure the same participants after 10 years on the same variables.If subtypes are identical in type and membership, this provides strong confirming evidence.However, if subtypes are different, this does little for falsifying the earlier found subtypes.The initial result may have overfitted the data, but subtypes may also have changed over time due to developmental processes.To falsify temporal stability, the researcher may measure the same participants twice within a short time frame (weeks to a few months).If subtypes are different, the subtypes are probably too unstable to be useful, which can be counted as a falsification.However, if subtypes remain the same, this provides only weak confirming evidence for their temporal stability.For each strategy, the optimal study design depends on whether the goal is confirmation of subtype validity or subjecting it to possible falsification.
Independent replication can potentially provide the strongest falsification and confirmation which is why we consider this one the most valuable of validation strategies; stronger than for example crossmethod replication which provides little opportunity for falsification.Falsification does require that the independent replication is quite direct, as any difference in sampling or diagnostic practice can cause differences in the population from which the researcher is sampling, which in turn can result in true differences in subtypes.The cross-study results on the ADI-R and Vineland should be considered in this context.The convergence of results across Vineland studies provides interesting confirmatory evidence, but the lack of convergence between ADI-R studies does not falsify the subtypes found in any of the studies, as their samples may be representative of -perhaps subtly-different populations.
Some articles are explicit that they do not consider the external variables to provide validation, for example, stating that "classes are descriptively characterized using other phenotypic data" (Farmer et al., 2018).Other researchers make explicit that they consider statistically significant differences on other variables as evidence that the identified subtypes are valid, for example stating that "... comparisons involved an attempt to examine the validity of the clusters" (Brown et al., 2004).or "[t]he validity of the cluster solutions was appraised with data external to the cluster analysis."(Lecavalier, 2006).In the vast majority of the articles, no such reasoning is provided.As mentioned in the results section, most cases of "external validation" were related to descriptions of the subtypes in terms of sex and age.
Although we labelled any comparisons between subtypes on external variables as an attempt at "external validation", for the majority of these articles, we are unsure of the researchers' view on the theoretical implications.Arguably, differences in sex and age neither confirm nor falsify subtypes, even though differences there might suggest subtypes artificially created by the sampling process (e.g., when there are accidental differences between the populations from which older and younger participants are sampled).We would suggest that in the future, researchers are explicit about the theoretical role that external variables play in their analysis, for which a preregistered protocol would again be preferable.One question is whether external validation of subtypes teaches us anything beyond the correlation between subtyping variables and external variables.In other words, are subtypes more than the sum of their parts?We believe so: If we construct subtypes with variables A and B, and external variable C is correlated with neither A nor B, C can still differ between subtypes.Even though this is a theoretical possibility, it would be wise to select candidate external variables that are intermediately correlated with the subtyping variables.External variables that are uncorrelated with the subtyping variables are more likely to be irrelevant, and when external variables are too highly correlated with the subtyping variables, external validation becomes tautological.However, the selection of external variables should be based on a) clinical relevance, b) theoretical plausibility, and c) informativeness for the validity of subtypes, rather than on correlations alone.
Temporal stability is important to research because it matters whether mobility between subtypes is possible, whether there is development, or whether people will always stay in the same subtype: If there is possible mobility between subtypes or development, being part of a subtype with a negative outcome may be a malleable factor.How to calculate temporal stability is difficult to prescribe, even with just two measurement occasions T1 and T2.One could form subtypes and assign separately at T1 and T2, assign at T2 using the subtype specification from T1, explicitly model transitions between T1 and T2, or jointly analyze the data from T1 and T2.All these options require researchers to be explicit about their expectations.
There seems to be a latent and potentially false assumption that the subtypes that are found in some studies will map onto subtypes that are found in other studies.For some part, this may be true, as the most severely affected subtype in one study may well correspond to the most severely affected subtype in another study.However, due to the variety of measures that are used, this is not necessarily the case, and one subtype that is formed on the basis of sensory sensitivities may well be scattered over four different subtypes had the subtyping procedures been based on measures of language abilities.To clarify whether we are referring to the same subtype every time, more studies need to be done that administer multiple types of measures and perform the subtyping analysis for every domain.Then, we can establish whether subtypes are stable across domains, or whether different subtyping solutions are required for different domains.Relatedly, we need to know whether subtypes are stable within a single domain, or whether subtypes are specific to particular measures.This makes evident the need for what we call parallel validation.
Parallel validation is one of the least used forms of validation.This is perhaps because, although psychological and psychiatric theory is formulated on the level of constructs, the bottom-up approach focuses the researcher on subtype differences in the manifest measurement variables.Finding multiple variables that purport to measure the same exact construct is difficult.One strategy could be, if one has sufficient measurements that come from a unidimensional measurement instrument, to randomly select half of the variables, and perform the subtyping analysis on both halves.To our knowledge, such an approach has not been used, but would be valuable to lift discussions up from the level of measurement to the level of theory.
Subtype separation is, after external validation, one of the most used validation methods, but it is still only used in 38% of the studies.However, cluster indices for internal validation of subtypes are becoming increasingly acccessible.Researchers that use Mplus (Muthén & Muthén, 2017) or mclust in R (Scrucca, Fop, Murphy, & Raftery, 2016) are increasingly adding these indices, as they are part of the default output of these software packages.Also, R packages such as NbClust (Charrad, Ghazzali, Boiteau, Niknafs, & Charrad, 2014) offer a variety of indices to be computed and are freely available.Therefore, it seems that the lack of subtype separation may naturally disappear in the future.
When establishing whether a result is validated, it is important to distinguish the different types of similarities in subtyping results that can be achieved.Which one is most important depends on the theoretical background.Ideally, the number of subtypes is the same, subtype sizes are the same, and subtype variable means are the same; regardless of whether one looks at sample 1 or sample 2 in an independent replication, or measurement occasions 1 and 2 in longitudinal stability.There are however many nuances.If mean reaction times for the subtypes differ between measurement occasions, as all participants become older and slower, this is not necessarily an invalidation of longitudinal stability of the subtypes.Also, the relative sizes of subtypes may differ between populations.Therefore, validity is not straightforward, and differences and similarities between subtyping solutions should be interpreted in the light of other evidence.
In this review, we included any study that applied subtyping methods to at least some participants with a diagnosis of ASD.The way ASD is currently conceptualized, the distinctions are not clear-cut between for example ASD and ADHD, and between ASD and some specific personality disorders.Also, people with an ASD diagnosis may on many dimensions have overlapping scores with a non-ASD comparison population.Therefore, to fully capture what the role of the ASD diagnosis is within the hierarchical taxonomy of individual differences, and to discover what is specific to ASD and what is not, one would ideally include studies with other samples as well.This was our reason for also including samples that included other groups, and we included a study where as few as 3% of participants had an ASD diagnosis.
As mentioned in the introduction, developments in the definition of autism could affect the heterogeneity within the population diagnosed with ASD (Mottron & Bzdok, 2020), and by extension, the number of subtypes that would be found.An earlier meta-analysis has shown that, over time, the effects on several domains between groups diagnosed with ASD and comparison groups have been decreasing in size (Rødgaard, Jensen, Vergnes, Soulières, & Mottron, 2019).This could be due to the ASD diagnosis including more and more people who would not fit the prototypical definitions of autism as used in earlier versions of diagnostic manuals.Although our sample of studies included major shifts in diagnostic manuals -from DSM-IV to DSM-5 the most dramatic-the effect of these shifts on subtypes were not visible in our sample.This is most likely because the other differences between studies already made studies incomparable in this respect.A study with a large population sample to which criteria from multiple versions of diagnostic manuals are applied might be more appropriate to investigate these effects for the subtyping case.
Most important is the practical use of subtypes, which lies in the potential for specific prognoses, estimates of intervention efficiency, and biomarkers.However, predictive validation was among the least used validation strategies.We have only focused on unsupervised learning, i. e., methods that make empirical subtypes from variables, and have excluded supervised approaches, i.e., methods that make predictions from variables.As recently argued, in order to focus subtyping results on having predictive value, unsupervised methods may need to be combined with supervised methods (Feczko et al., 2019).In fact, two of the articles that we have included in this review used random forests, a supervised approach, to establish the similarities between participants, which were then used as input for an unsupervised analysis (Cordova et al., 2020;Feczko et al., 2018).Such combinations of unsupervised and supervised methods potentially form a valuable addition to the subtyping methods currently available, to increase the validity and practical usefulness of subtyping results.
In conclusion, we expect to have clarified where potential improvements lie in the validation of subtyping results when focusing on ASD.However, the same reasoning is also relevant for subtyping in other (clinical) groups.To move the field forward, we need guidelines and recommendations how to validate subtyping results.Below, we provide a subtyping validation checklist.The primary goal of this checklist is to improve the theoretical quality of subtyping results, which also means being clear in what constitutes a validation and an invalidation of a subtyping result.With a systematic approach, we can establish clinically meaningful subtypes that are distinct regardless of statistical method or choice of measurement instruments, replicable, stable over time, and predictive of later difficulties.

SUbtyping VAlidation Checklist (SUVAC)
To provide guidance for future researchers, we propose a checklist called the SUVAC (for SUbtyping VAlidation Checklist), in Table 1.The SUVAC serves several purposes.The first benefit is that researchers can use the SUVAC in designing their studies, so they can plan for additional variables for parallel validation or external validation, or extra measurement occasions for longitudinal stability.The second benefit is that future systematic reviewers and meta-analysts of subtyping analyses can also use the SUVAC to record different types of validation strategies that have been applied in other fields.The third benefit of the SUVAC is the benefit of common nomenclature.Although most studies used some form of "external validation", a minority of studies called it that explicitly.A lack of common understanding in these terms makes it difficult to evaluate what theoretical conclusions researchers draw from their comparisons.When every study uses the same terminology for "longitudinal stability", the field will more transparent in terms of which subtyping result is stable over time, and which subtyping results are not.
Not all steps are required in every context, and usefulness of subtypes cannot be ensured by following a simple series of steps.This is because validity may be context-dependent.The SUVAC should not be thought of as a checklist of quality -which one can pass or fail-but as a checklist of considerations when planning a subtyping study or evaluating a body of subtyping research.Each of these steps provides a source of evidence for or against the validity and practical usefulness of a subtyping solution.These considerations can provide a foothold to researchers who want to take on the complex task of validating subtypes.Additional testing of participants

Independent replication
Is the subtyping analysis repeated in a second sample of participants?
Temporal stability Is the subtyping analysis repeated with the same participants at a second measurement occasion?
Additional variables

External validation
Are participants from different subtypes compared on variables that were not used in the subtyping analysis?

Parallel validation
Is the subtyping analysis repeated with a second set of variables, that are purported to measure the same constructs as the variables used in the first subtyping analysis?
Additional testing of participants + Additional variables

Predictive validation
Are participants from different subtypes compared on variables that were not used in the subtyping analysis, and that were measured at a second measurement occasion?

General
All strategies Are predictions on the validation steps formulated before the analysis/preregistered?

Fig. 1 .
Fig. 1.Illustration of the cross-method replication and subtype separation validation strategies.

Fig. 2 .
Fig. 2. Illustration of the independent replication and temporal stability subtype validation strategies.

Fig. 4 .
Fig. 4. PRISMA diagram.Search 1 was conducted in February 2018, and included papers published between January 2001 and February 2018.Search 2 was an update in June 2020, and included papers published between February 2018 and June 2020.

Fig. 5 .
Fig. 5. Number of articles included per publication year.In orange, the number of articles published until June 2020 is plotted, so the data for 2020 is incomplete.

Fig. 6 .
Fig.6.Sample size by publication year.Note that the y-axis is on a log10 scale.Some random noise is added to publication year to prevent overlapping points.

Fig. 7 .
Fig. 7. Percentage of articles by number of subtypes recovered in the subtyping analysis.

Fig. 8 .
Fig. 8. Percentage of articles that have used the seven validation strategies (left) and the number of validation strategies that were used (right).Note: Crss = Crossmethod replication, Sep = Subtype separation, Ind = Independent replication, Stab = Temporal stability, Ext = External validation, Par = Parallel replication, Pred = Predictive validation.