Does the choice of Allostatic Load scoring algorithm matter for predicting age-related health outcomes?

Allostatic Load (AL) is posited to provide a measure of cumulative physiological dysregulation across multiple biological systems and demonstrates promise as a sub-clinical marker of overall health. Despite the large heterogeneity of measures employed in the literature to represent AL, few studies have investigated the impact of different AL scoring systems in predicting health. This study uses data for 4477 participants aged 50+ years participating in the Irish Longitudinal Study on Ageing (TILDA) to compare the utility of 14 different scoring algorithms that have been used to operationalise AL (i.e. count-based high-risk quartiles, deciles, two-tailed cut-points, z-scores, system-weighted indices, clinical cut-points, sex-specific scores, and incorporating medication usage). Model fit was assessed using R2, Bayesian Information Criterion (BIC), and the area under the Receiver Operating Characteristic curve (AUC). The measure incorporating medications predicted walking speed and SRH marginally better than others. In general, AL was not predictive of grip strength. Overall, the results suggest that the choice of AL scoring algorithm exerts a relatively modest influence in predicting a number of important health outcomes.


Introduction
Allostatic Load (AL) is posited to represent a sub-clinical measure of physiological wear and tear resulting from chronic exposure to life course stressors (McEwen and Stellar, 1993). In recent decades, the framework has contributed to an enhanced interdisciplinary understanding of how social, environmental, and psychological factors impact physiological functioning and shape health disparities (Beckie, 2012;Merkin et al., 2009;Upchurch et al., 2015). Despite this apparent utility, AL is beset by a number of methodological and conceptual difficulties that have hampered its potential clinical utility. These include: (1) the ongoing failure to agree a core set of biomarkers that define the construct, and (2) a plethora of different AL scoring algorithms, limiting our ability to compare results across studies. The former issue, concerning the heterogeneity of biomarkers used across studies, has been discussed at length in a number of recent reviews (Johnson et al., 2017;Juster et al., 2010), but the latter issue has received rather less attention in the literature, and is arguably just as important.
1.1. Different scoring systems for AL Seeman et al. (1997) provided the first operational definition of AL using a high-functioning sample of adults aged 60 years and older. They employed 10 biomarkers across the neuroendocrine, immune, metabolic and cardiovascular systems, and summed the number of parameters for which an individual had values in the highest risk quartile, based on the sample distribution (Seeman et al., 1997). This measure predicted physical and cognitive decline within the MacArthur Studies of Successful Ageing in the US, and greater incidence of cardiovascular disease over a three-year follow-up. The authors concluded that AL might perform even better in general population studies. The original scoring algorithm for AL remains a popular method today, with the aforementioned systematic review (Johnson et al., 2017) reporting that 73 % of included studies utilised this approach, despite the potential loss of information from the full risk spectrum.
Lack of consensus around an agreed biomarker panel (Johnson et al., 2017;Juster et al., 2010;Szanton et al., 2005) and scoring system (Beckie, 2012;Seplaki et al., 2005) has led to a proliferation of different means for characterising AL. Considerable heterogeneity in AL calculation exists across studies utilising the MacArthur cohort alone, including: the original count of high-risk quartiles (Seeman et al., 1997Weinstein et al., 2003), recursive partitioning (Gruenewald et al., 2006), and use of second-order terms . Alternate AL formulations from other international studies include: a sum https://doi.org/10.1016/j.psyneuen.2020.104789 Received 6 April 2020; Received in revised form 24 June 2020; Accepted 25 June 2020 of z-scores, which may be considered more informative as it retains the continuous properties of the data (Hawkley et al., 2011;Levine and Crimmins, 2014), or the use of decile cut-points which consider only the top end of the risk distribution. Two-tailed approaches (i.e. top and bottom quartiles or deciles) for some (Hwang et al., 2014;LeBron et al., 2019) or for all biomarkers (Seplaki et al., 2006) have also been employed to account for the fact that biological risk may be non-linear.
Canonical correlation, a technique that measures the associations between sets of inter-connected variables (Karlamangla et al., 2002), and recursive partitioning, a decision-tree technique (Gruenewald et al., 2006;Singer et al., 2004), have been criticised for incorporating information on health outcomes in their derivation, which can result in the over-fitting of models and limits replication across other datasets (Seplaki et al., 2005). Clinical cut-points have also been used (Allen et al., 2019;Borrell et al., 2010;Crimmins et al., 2009;Glei et al., 2013;Petrovic et al., 2016;Vasunilashorn et al., 2013); but, the lack of universally agreed values (e.g. cut-points for parasympathetic or immunological markers) potentially limits their application. In addition, such definitions of risk remove the potential utility of AL as a subclinical disease measure for identifying risk prior to the emergence of the clinical phenotype. More sophisticated attempts to develop AL measures include multivariate methods, such as grade of membership (GOM) which assesses how well an individual's scores on the set of biomarkers used to define AL correspond to a set of predefined archetypal profiles (Li et al., 2019;Seplaki et al., 2006). Interestingly, the predictive utility of AL measures calculated using GOM and count based methods are not dissimilar (Li et al., 2019). These dissimilarities, and also how the use of such complex algorithms would not be practical in clinical settings, may explain why more elaborate measures have not been adopted more widely.

Does the choice of AL scoring method matter for predicting health?
Methodological disparities in the scoring of the AL index across studies not only makes comparison of results challenging, but also hampers progress towards the use of AL as an early diagnostic screener and / or therapeutic target. This is an important theoretical and empirical matter that has not been subjected to the type of systematic investigation one might expect. Our review of the literature identified only two studies to date (Li et al., 2019;Seplaki et al., 2005) that have explicitly examined this issue. Seplaki et al. (2005) compared the predictive utility of 9 different AL scoring algorithms (including count of high-risk quartiles/deciles, GOM, z-scores, and two-tailed quartiles/deciles, derived from two biomarker panels) in identifying poor health outcomes. Results were remarkably similar irrespective of the method used to calculate AL. They did, however, suggest that two-tailed formulations, or measures that retain the continuous properties of the biological variables, might be preferred going forward. A separate investigation by Li et al. (2019) involving the NHANES cohort compared the utility of 5 scoring algorithms (i.e. count of high-risk quartiles, z-scores, logistic regression, factor analysis, GOM) in predicting self-reported health, diabetes and hypertension. The best predictive performance was provided by a measure utilising the standardised coefficients from multivariable logistic regressions to weight individual biomarkers. However, similar to canonical correlation, health outcomes were used in the development of the scoring algorithm, which hampers replication across other populations. Li et al. (2019) recommended the original method of high-risk quartiles as a good alternative.
1.3. Should we incorporate sex differences in the calculation of the al index?
To date, despite evidence of sex differences within the stress response (Juster et al., 2019;Verma et al., 2011), and in the downstream biological parameters associated with the stress response (Freire et al., 2020;Goldman et al., 2004;Santos-Lozada and Howard, 2018;Stoney et al., 1988;Yang and Kozloski, 2011), there is no consensus regarding the incorporation of sex-specific risk values in the calculation of AL measures. Seplaki et al. (2006) were the first to suggest deriving an AL score using sex-specific risk definitions, with a number of others adopting this approach since (Castagné et al., 2018;Christensen et al., 2018Christensen et al., , 2019Duru et al., 2012;Gustafsson et al., 2012;Robertson et al., 2015). Notwithstanding the potential import of biological sex differences, a recent review of NHANES revealed that only 1/21 AL studies incorporated sex-specific risk definitions (Duong et al., 2017), and to date, no study has compared the predictive utility of an AL measures including and excluding sex-specific risk definitions with alternate calculations.

Should we incorporate medications in the calculation of the AL Index?
Whether and how doctor-prescribed medication should be incorporated into the development of the AL index is another bone of contention (Lipowicz et al., 2014). Medications can act on the biological systems to reduce observed values, reducing risk of disease . On the other hand, wear and tear may already have occurred if there is a need to medicate, and this excess risk should be accounted for when calculating the score. Varied approaches to the resolution of this problem exist including: scoring an individual at high risk if taking a medication which deflates values on the biomarker (Seeman et al., 2014); adding 'medication use' as a covariate in regression models (Piazza et al., 2018); or increasing individual biomarker values in an attempt at increasing prediction accuracy (e.g. systolic and diastolic blood pressure adjusted by adding 10mmHG and 5mmHG, respectively if taking anti-hypertensives) (Robertson and Watts, 2016). To date, no study has compared the predictive utility of AL measures including medication use compared with alternate formulations.

The present study
The present study builds upon the work of Seplaki et al. (2005) and Li et al. (2019) in several important ways. Firstly, we use a sample from a nationally representative cohort of community-dwelling older persons, whereas Seplaki et al. (2005) used a militaristic sample with heavy male bias, and Li et al. (2019) used an all-female sample. Secondly, we use a number of objective measures of physical health functioning (i.e. walking speed, grip strength) as the criterion variables in the analysis. Thirdly, we incorporate sex-specific and clinical risk definitions. Finally, we examine whether included medication usage improves the predictive accuracy of AL.

Sample
This analysis used data from the first wave of The Irish Longitudinal Study on Ageing (TILDA), a nationally representative prospective study of over 8175 persons aged 50 years and over living in the Republic of Ireland. Sampling involved a three-stage selection process with the Irish Geodirectory as the sampling frame, where residential addresses were divided into geographical clusters, of which 640 were selected based on area level socio-economic status and location. Additional information on the sampling frame and study design is available in detail elsewhere (Whelan and Savva, 2013). Data was collected through computer assisted personal interview (CAPI) carried out by a trained interviewer, a leave behind self-completion questionnaire (SCQ), and a comprehensive clinic-based, nurse-administered health assessment. These assessments included the collection of blood samples for biomarker assessment and a battery of cognitive and physical tests. Ethical approval for the study was provided by the Faculty of Health Science Research Ethics board in Trinity College Dublin, while informed consent is obtained from all respondents during data collection. In total, 5894 people completed either the centre or home-based health assessment. Respondents who were missing on any of the biomarkers were excluded for this study, leaving a total analytical case base of 4,477. A flow chart summarising inclusion criterion for analysis is provided in Supplementary Fig. 1.

Physiological parameters
A total of 12 biomarkers across the cardiovascular, metabolic, renal and immune systems were used to construct the various AL indices examined in this study. Biomarkers were selected to reflect the most commonly chosen systems in the AL literature (Juster et al., 2010). Two of the biomarkers -pulse wave velocity (PWV) and cystatin C (Cysc)are relatively novel with respect to AL formulation but their inclusion was theoretically motivated. For example, there is evidence to suggest that acute and chronic stress may induce arterial dysfunction as indexed using PWV (Logan et al., 2012;Matsumura et al., 2019). Similarly, Cysc may serve as a preferred/complementary biomarker of kidney functioning compared with creatinine in older cohorts as it is less affected by declining muscle mass (Canney et al., 2018). The component biomarkers and a description of their primary function is provided in Supplementary Table 1.
In terms of the cardiovascular system, the average of two measurements of seated Systolic Blood pressure (SBP), Diastolic Blood pressure (DBP) and Resting Heart Rate (RHR) were obtained separated by a 1-minute interval using an automatic digital oscillometric blood pressure monitor (OMRONTM, M10-IT). An average of two measurements between the carotid and femoral arteries was obtained using a Vicorder system, representing PWV (Skidmore Medical Ltd, Bristol, UK), a non-invasive gold standard method (Whelan and Savva, 2013). Metabolic system biomarkers included waist-hip ratio (WHR), body mass index (BMI) glycosylated haemoglobin (HbA1c), total cholesterol (TChol), and high-density lipoprotein (HDL). Weight was measured using a SECA electronic floor scales, and height was measured using a SECA 240 wall mounted measuring rod. BMI was calculated by dividing weight in kgs by height in metres squared. SECA measuring tapes were used to record WHR with measurements taken to the nearest millimetre. TChol was obtained from non-fasting blood samples, with venepuncture performed with a butterfly and a green 21-gauge Vacutainer needle and stored in EDTA tubes. A direct determination of HDL was performed using PEG-modified enzymes and dextran sulphate. HbA1c reflects blood sugar levels and metabolic functioning in the 8-12 weeks prior to blood sampling and were analysed by reverse-phase action exchange chromatography using an ADAMS A1c HA-8180 V analyser.
C-Reactive Protein (CRP) serves as our singular marker of immunological dysregulation as it was the only inflammatory marker available in the full TILDA cohort at baseline. It was measured in nonfasting blood serum on a Roche Cobas c 701 analyser using an immunoturbidimetric assay for the in vitro quantitative determination of CRP. Detection limits ranged from 0.3-−350 mg/l. The two renal markers included Creatinine (Crea) and Cysc, which were measured simultaneously from frozen plasma. Crea was measured using an enzymatic method traceable to isotope-dilution mass spectrometry (Roche Creatinine plus ver.2, Roche Diagnostics, Basel Switzerland), whilst Cysc was measured using a second-generation particle enhanced immunoturbidimetric assay (Roche Tina-quant™) on a Roche Cobas 701 analyzer. This assay has a measuring range of 0.40-6.80 mg/L and is traceable to the European reference standard material (ERM-DA471/ IFCC). Log transformations were applied to CRP and PWV to account for right-skewed distributions.

Calculation of summary measures
High-risk quartiles (Classic Method): To re-create the original and most commonly employed scoring algorithm across studies (Seeman et al., 1997), dichotomous indicator variables were created for each biomarker. HDL was reverse coded such that higher values represented increased risk. Empirically defined high-risk thresholds were distinguished based on the distribution of that biomarker in the sample; "1" was assigned to values falling above the 75th percentile of the distribution for each marker, and "0" was assigned to values below this threshold. These biomarkers were summed to create an index, with higher scores reflecting higher AL burden. Two-tailed high-risk quartiles: Seplaki et al. (2005) suggested that including risk factors at both high and low ends of the risk continuum may be more informative than simply using high-risk quartiles. To re-create this score, a value of "1" was assigned to values above the 75th percentile and below the 25th percentile of the distribution, and a value of "0" was assigned for values that fell intermediate of these thresholds. A sum of these biomarkers represented two-tailed physiological dysregulation.
Following the above methodologies, two further algorithms were created employing decile cut-points, to measure risk at more extreme levels of the distribution. High-risk Deciles: "1" was assigned for values above the 90th percentile, "0" was assigned to all values below this threshold, and all biomarkers were then summed. Two-tailed highrisk deciles: "1" was assigned for values above the 90th percentile and below 10th percentile for the measure incorporating both ends of the distribution, and "0" assigned to all values that lay intermediate between these thresholds, and scores on the individual biomarkers were then summed.
Z-scores: To create a score that retained the continuous properties of each physiological variable, each biomarker was standardised to a mean of 0 and a standard deviation of 1, and then summed to generate an overall AL score expressed in standard deviation units (Daly et al., 2019;Hawkley et al., 2011;Levine and Crimmins, 2014). System Weighted: It is common in the AL literature for some biological systems to be more heavily represented than others, of which metabolic system biomarkers tend to be most populous (Juster et al., 2010). A systemweighted score was therefore created to reduce this bias. Firstly, biomarkers were dichotomised following the classic methodology (Seeman et al., 1997). Secondly, each dysregulated biomarker was weighted according to the total number of biomarkers per system. System risk indices were computed as the proportion of individual biomarkers for each system for which participant values fell into high-risk quartiles (Gruenewald et al., 2012). For example, in the present study, the cardiovascular system was comprised of four biomarkers (SBP, DBP, RHR, PWV). Therefore, the sum of dysregulated biomarkers within this system were divided by 4 and expressed as a proportion ranging from 0.0−1.0. As CRP is the only immunological biomarker, no scaling occurred, hence "0" indicated no risk, whilst "1" indicated high risk. Finally, an AL score was computed as the sum of these weighted systems. This methodology has been used previously and ensures that each system is equally represented in the overall AL index (Gruenewald et al., 2012;Piazza et al., 2018;Read and Grundy, 2014).
Sex-specific cut-points: To account for variability in biomarker values according to sex, each of these 6 scoring algorithms were recreated using sex-specific cut-points (Table 2). Therefore, to create the sex-specific count of high-risk quartiles measure, individuals falling above the 75th percentile were identified using the sample distributions for men and women separately, and scores were then pooled.
High-risk quartiles (incorporating medications): To allow for the possibility that medication use was masking high values, prescription drugs were incorporated into an additional algorithm based on the classic methodology, with individuals taking medication automatically reclassified as high-risk across affected biomarkers. In TILDA, medication usage was ascertained as part of the household interview, where respondents were asked to retrieve the medicinal packaging of any regularly taken medications and interviewers transcribed the brand/ generic name of the prescription medication. The medications data was then coded in-house by pharmacologists according to Anatomic Therapeutic Classification (ATC) codes (Whelan and Savva, 2013). HbA1c was coded as high risk if individuals were currently prescribed medications of codes for insulin or analogues (A10), SBP was classified as high risk if individuals were taking anti-hypertensive medication (C02, C03, C09). RHR was recoded as high risk if taking beta-blockers (C07) or calcium blockers (C08) and finally, HDL was recoded as high risk if prescribed lipid modifying agents (C10).
Clinical: Finally, we derived a measure of AL using recognised risk definitions according to clinical guidelines (Allen et al., 2019;Borrell et al., 2010;Gruenewald et al., 2012;Prag and Richards, 2018). Clinical cut-points for each biomarker are provided in Supplementary Table 2.

Health outcomes
We utilised two objectively measured health outcomeswalking speed and grip strength (described below) -and to enable comparison with Seplaki et al. (2005), a general self-reported health measure was included.

Walking speed
In older adults, walking speed has been shown to be a good indicator of overall physical function (Viccaro et al., 2011), and a predictor for hospitalisation (Studenski et al., 2003) and higher morbidity (Guralnik et al., 2000). In TILDA, gait measurements were taken using a 4.88-metre computerised walkway with embedded pressure sensors (GAITRite, CIR Systems Inc., New York, NY). Participants completed two walks at their normal walking speed. The average of the two readings represents the overall walking speed expressed in centimetres travelled per second (cm/s).

Grip strength
Low grip strength has been robustly associated with functional decline and mortality in older adults (Bohannon, 2008) and serves as a proxy for the overall strength of the musculoskeletal system. Two measures were taken from the dominant hand using a Baseline® Hydraulic Hand Dynamometer, and the highest of the two readings in kilograms (kgs) per square inch represented the measure of grip strength used in the analysis.

Self-rated health
Self-rated health (SRH) is a commonly assessed item in health surveys (Manor et al., 2000), and can provide insight into the perceptions that an individual has of their own health status relevant to their peers. SRH has previously been associated with decreased physical performance (Perez-Zepeda et al., 2016) and increased mortality risk (Jylha, 2009). Previous studies have reported that lower SRH predicts higher AL (Vie et al., 2014;von Thiele et al., 2006). SRH was obtained during the CAPI with the item "In general, compared to other people your age, would you say your health is…" rated on a five-point scale of excellent, very good, good, fair, and poor.

Statistical analysis
All analyses were conducted in Stata (v.15;StataCorp, 2017), using TILDA dataset v.1.8.0. The outcome measures were regressed separately on each of the 14 AL scoring algorithms, adjusting for age and sex, using ordinary least squares regression with respect to walking speed and grip strength, and ordinal logistic regression for SRH. Overall model fit was assessed using the proportion of variance explained (R 2 for OLS, pseudo R 2 for ordinal logistic regression) and the Bayesian Information Criterion (BIC). An obvious advantage of using information measures is that one can compare the goodness of fit of non-nested models (Williams, 2019) as the different AL scoring algorithms are essentially just reformulations of the same underlying set of 12 component biomarkers. The extent to which one model is preferred over another depends on the magnitude of the difference between the information measures. Higher values of R 2 and lower values of BIC indicate better model fit. BIC values were obtained using the FITSTAT package in Stata (Scott Long and Freese, 2014). Following guidelines proposed by Raftery (1995), we noted a BIC difference > 10 to indicate a better fitting model than the standard scoring algorithm. To reduce the large number of potential contrasts between the different AL scoring algorithms, we take as our reference, the overall (i.e. non-sex-specific) quartile-based risk score (hereinafter referred to as classic method) as it is the one that has been predominantly used in the AL literature.
In order to provide a comparison with Seplaki et al. (2005), binary indicator variables were created for each of the health outcomes using cut-points guided by previous literature. Grip strength was dichotomised at < 37 kg for men, and < 21 kg for women (Sallinen et al., 2010). Walking speed was dichotomised at < 120 cm/s . SRH was recoded as "1" if respondents rated their health as fair/poor, "0" otherwise (Seplaki et al., 2005). Each of the binary health outcomes were treated as dependent variables in separate logistic regression models, controlling for age and sex. The area under the Receiver Operator Characteristic curve (AUC) was calculated for each health outcome for each of the 14 different AL scoring algorithms using the ROCTAB procedure (Pepe et al., 2009). Higher AUC values indicate greater classification accuracy. Comparison of AUC estimates to test the statistical significance of the difference between the classic AL algorithm and all alternative algorithms were investigated using the ROC-COMP procedure (Cleves, 2002). Finally, as age is the strongest predictor of health, we also examined which of the AL algorithms was most strongly correlated with chronological age.

Results
Table 1reports sample characteristics. The mean age of the sample was 61.8 years and female respondents accounted for 53.7 % of the sample. Mean walking speed was 136.1 cm/s (SD = 20.4), and mean grip strength was 27.5 kgs (SD = 9.88). 89.0 % of the sample reported that they were in excellent/very good/good health. In total, 45.4 % of the sample were taking prescribed medications that potentially impacted values of biomarkers included in this study. Respondents included in the study were younger than those who were missing biomarker data, more educated and had, on average, a lower average count of chronic diseases. No sampling weights were applied however, as the primary focus of this study was to compare the predictive utility of various AL scoring algorithms and not to infer the relationship of AL and the health outcomes to the population. The relationship between age and each of the AL scoring algorithms was assessed using Spearman's correlations, reported in Supplementary Table 3. The measure incorporating medications was most strongly correlated with age (r = 0.34), whilst the two-tailed measures revealed weak correlations with age (e.g. overall quartile (r = 0.05), overall decile based (r = 0.06)). Univariate statistics and percentile cut-points for each of the 12 component biomarkers defining the AL construct are presented in Table 2. Two-sample t-tests revealed marked differences in biomarker values according to sex. Specifically, male respondents were characterised by significantly higher mean values for 9/12 biomarkers, whilst female respondents had significantly higher mean values for the latter 3 (RHR, CRP, Total Cholesterol). Supplementary Table 4 presents univariate statistics in respect of each of the 14 AL scoring algorithms used in the present study. Tables 3-5report the model fit statistics for each AL scoring algorithm and the difference in model fit relative to the classic scoring algorithm for walking speed, grip strength and SRH, respectively. In this context, higher values for R 2 and lower values for BIC indicate a better fitting model compared with the standard model. In general, the results indicate that the choice of scoring algorithm has relatively modest effects on the proportion of variance explained in the health outcomes.

Walking speed
Looking first at the results for walking speed, Table 3 shows that the proportion of variance explained by the classic AL scoring algorithm (including age and sex) was 20.3 %, compared with 17.8 % for the worst (overall two-tailed quartiles and overall two-tailed deciles) and  Note: Adjusted for age and sex. R 2 = proportion of variance explained. R 2 Diff = gain in R 2 compared with the classic AL scoring algorithm. BIC = Bayesian Information Criterion. BIC Diff = Difference in BIC compared with the classic AL scoring algorithm. *p < 0.05, **p < 0.01, *** p < 0.001. a Substantial decrease in BIC compared with the classic AL scoring algorithm. S. McLoughlin, et al. Psychoneuroendocrinology 120 (2020) 104789 21.2 % for the best (high-risk quartiles incorporating medications) fitting models. The sex-specific high-risk quartiles, high-risk decile measures, weighted measures, sex-specific z-scores, and the algorithm incorporating medications led to a strong improvement in model fit relative to the classic AL scoring algorithm, according to the BIC guidelines proposed by Raftery (1995). However, it should be acknowledged that the gain in R 2 was typically very modest and amounted to, at most, +0.9 %. The measures calibrated according to sex performed marginally better when predicting walking speed than the overall measures. The two-tailed measures by contrast performed poorly, with lower values of R 2 and large increases in BIC compared with the classic algorithm.

Grip strength
For grip strength, Table 4 shows that the proportion of variance explained by the classic AL scoring algorithm (including age and sex) was 60.8 %, and this value did not vary substantially across the other measures; +0.06 % for the best fitting model (overall two-tailed deciles) compared with the classic model. Two-tailed decile measures, clinical and z-scores were the only AL scoring algorithms that exhibited statistically significant associations with grip strength independently of age and sex. The strength of these relationships were small, however, and the direction of these relationships differed; negative associations were noted for overall (β = −0.16, 95 % CI: −0.27, −0.06) and sexspecific (β = −0.11, 95 % CI: −0.02, −0.01) two-tailed deciles, whilst positive associations were noted for the clinical measure (β = 0.13, 95 % CI: 0.02, 0.23), as well as the overall (β = 0.05, 95 % CI: 0.01, 0.09) and sex-specific (β = 0.05, 95 % CI: 0.01, 0.09) z-score measures. Table 5 shows that the classic AL measure accounted for 1.1 % of the variance in SRH, compared with 0.02 % for the worst fitting models (overall high-risk deciles, two-tailed high-risk decile measures) and 1.9 % for the best fitting model (high-risk quartiles incorporating medications). The sex-specific high-risk quartile measure, overall weighted measure, the measure incorporating medications, and the clinical measure led to strong improvements in BIC values compared to the classic AL index, whilst the two-tailed measures performed substantially worse.

Area under the receiver operating characteristic curve (AUC)
Supplementary Table 5 reports the AUC estimates for each of the AL scoring algorithms separately for each of the three binary health outcomes. Results were consistent with what was observed when using the continuous measures. Notably, the two-tailed measures (quartiles and deciles) performed worse than the classic algorithm when predicting walking speed and SRH, whilst all measures predicted poor grip strength to the same degree of accuracy as the classic score (Fig. 1). Although a minority of AL algorithms reached statistical significance with grip strength, no substantial differences in AUC were noted between these and the classic algorithm. In addition, the count of highrisk quartiles measure incorporating medications predicted fair/ poor SRH better than the classic measure.

Discussion
The present study assessed the explanatory utility of a multitude of AL scoring algorithms for predicting a number of objective and subjective age-related health outcomes in an older adult population residing in Ireland. In accordance with the findings from a previous study involving a Taiwanese sample (Seplaki et al., 2005), we found that the choice of AL scoring algorithm has a relatively modest impact in terms of variance explanation and classification accuracy in the prediction of a number of important health outcomes. Although the differences between the 14 scoring algorithms were not pronounced in this study, there were, nevertheless, some subtle nuances observed.
AL was strongly associated with walking speed across all 14 scoring algorithms independently of age and sex, supporting prior findings regarding the predictive utility of AL for functional decline (Karlamangla et al., 2002;Read and Grundy, 2014;Singer et al., 2004;Szanton et al., 2005). Although the differences were small when compared to the classic scoring algorithm, the count of high-risk quartiles incorporating  medications performed marginally better. In contrast, few significant associations were observed between measures of AL and grip strength, and no algorithm fit the data significantly better than the classic method. Although few have investigated this relationship directly, grip strength shows strong age-related decline and is an important component of many frailty indices that have been found to be negatively associated with AL (Gruenewald et al., 2012;Szanton et al., 2005). The null associations of the majority of AL scoring algorithms, and the varied direction of the relationship noted for those algorithms which did statistically predict grip strength is surprising, but not without precedent. Freire et al. (2020) for example, arrived at the rather counter-intuitive conclusion that higher AL burden (classic measure) was associated with higher grip strength in a small sample of older community-dwelling adults in Brazil (n = 256), yet they hypothesised this contradictory result was due to survival bias. Perhaps the predominantly null results arise as there are other potential confounders that we are not controlling for in this analysis, which may help preserve muscle strength at older ages (e.g. height, occupational class etc).
In respect of SRH, results were more variable across scoring algorithms, although differences were small in absolute terms. Similar to walking speed, the algorithm incorporating medication use was the best predictor of SRH in terms of model fit. This pattern of results begs the obvious question as to whether the better predictive performance of a measure incorporating medication when using a SRH measure is simply a reflection of the individual being more aware of their 'health state' by virtue of consciously taking prescribed medications on a regular basis.

Use of one vs two-tailed risk
Notably, the AL scoring algorithms incorporating risk in both tails of the distribution of each biomarker underperformed relative to the classic one-tailed measure in this study. These findings align with results from the Hawaii Personality and Health Cohort (n = 470), where two-tailed measures of AL were less effective than one-tailed measures in predicting SRH and depressive symptoms. In stark contrast, however, Seplaki et al. (2005) found that two-tailed measures of risk accounted for more variance across self-reported health measures than the classic AL index. This discrepant pattern of results across studies could be explained by differences in the biomarker panels employed to represent AL. Seplaki et al.'s (2005) study included primary mediators of the stress response such as cortisol, in which both high and low values can characterize certain syndromes and diseases (Fries et al., 2005) whereas, similar to the present study, Hampson et al. (2009) did not include any markers of neuroendocrine dysregulation. Moreover, in the context of the current study, it is difficult to see how low scores for HbA1c or cystatin C would increase risk of disease, which reinforces the need to think critically about whether risk on particular biomarkers is best conceptualised as linear or curvilinear when developing AL indices.

The importance of considering sex differences
Beyond the objective of comparing the predictive utility of various AL algorithms, this study found significant differences across mean biomarker values between male and female individuals. Furthermore, the AL algorithms which were calibrated relative to sex generally performed marginally better in terms of model fit than those derived from the pooled sample with respect to walking speed and SRH. Acknowledging that men and women may react differently to stress, both psychologically and biologically (Stroud et al., 2002); we recommend that future investigations continue to examine the use of sexspecific cut-points, particularly where the distributions on the component biomarkers differ according to sex, or indeed, where there might be large sex differences (Juster et al., 2016), in the outcome measure (e.g. depression).

The importance of considering medications
The vexed question of whether to include medications in the calculation of the AL measure is a recurring feature in many critiques of the AL literature (Duong et al., 2017;Howard and Sparks, 2016;Rodriquez et al., 2019). This issue has not always been afforded the attention it arguably deserves in many empirical studies, presumably because many studies either do not capture this information, rely on self-report, or perhaps due to operational difficulties in how to treat the data if using continuous measures (e.g. z-scores). We found that including medications led to small improvements in performance relative to the classic measure with respect to walking speed and SRH. It was also the algorithm which was most strongly correlated with age. Nevertheless, we applied only one of a number of potential techniques for including medications in the development of the AL index, and future studies may wish to consider alternative approaches such as including medications as a covariate (Piazza et al., 2018) or adjusting biomarker values upward to adjust for their deflationary effect on the biomarker of interest (Robertson and Watts, 2016).

Strengths and limitations
This study is among the first to examine and compare the effects of various AL scoring algorithms in a community-dwelling cohort of older persons. It benefits from having a large sample pulled from a nationally representative cohort, and gold standard measures of objective physiological functioning measured using trained nurses according to standard operating protocols. It advances the current knowledge base through the inclusion of these objective markers of physical fitness, by considering sex as an important variable in the determination of AL scores, and by accounting for medication use as a potential indicator of physiological wear and tear. Despite these strengths, this study does not have any neuroendocrine markers at baseline which is an important limitation as they are hypothesised to play a central role in the stress response. Furthermore, whilst TILDA is a nationally representative cohort study, the sample employed in this paper is not generalisable to other populations. Similarly, the results from this study are not generalisable to studies employing biomarkers different to those used here. It should also be acknowledged that although significant associations were found between measures of AL and objective health outcomes in this study, the estimated models control only for the effects of age and sex. Instead, this study provides evidence for the construction of a summary score from individual biological components to predict poor health outcomes, without assuming a causal or directional relationship.

Conclusion
This study was motivated by the lack of empirical insight into the effects the choice of AL scoring algorithm has on the predictive utility of poor health outcomes. Seeman et al. (2010) claimed that their original method did not represent a gold standard, yet the results of this study suggest that this classic method performs well. We therefore echo the conclusion of others (Berger et al., 2018;Li et al., 2019), that given that this is the technique used in the vast majority of papers, perhaps the notion of harmonising international work around this scoring algorithm should be considered. Nonetheless, the findings of this paper can help allay researchers concerns that the choice of scoring algorithm makes a large difference to the results, but whether the composition of biomarkers employed to reflect physiological dysregulation does remains to be elucidated.