Machine Learning for Extraction of Image Features Associated with Progression of Geographic Atrophy

: Background: Several studies have investigated various features and models in order to understand the growth and progression of the ocular disease geographic atrophy (GA). Commonly assessed features include age, sex, smoking, alcohol consumption, sedentary lifestyle, hypertension, and diabetes. There have been inconsistencies regarding which features correlate with GA progression. Chief amongst these inconsistencies is whether the investigated features are readily available for analysis across various ophthalmic institutions. Methods:In this study, we focused our attention on the association of fundus autofluorescence (FAF) imaging features and GA progression. Our method included feature extraction using radiomic processes and feature ranking by machine learning incorporating the algorithm XGBoost to determine the best-ranked features. This led to the development of an image-based linear mixed-effects model, which was designed to account for slope change based on within-subject variability and inter-eye correlation. Metrics used to assess the linear mixed-effects model included marginal and conditional R 2 , Pearson’s correlation coefficient ( r ), root mean square error (RMSE), mean error (ME), mean absolute error (MAE), mean absolute deviation (MAD), the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and loglikelihood. Results: We developed a linear mixed-effects model with 15 image-based features. The model results were as follows: R 2 = 0.96, r = 0.981, RMSE = 1.32, ME = − 7.3 × 10 − 15 , MAE = 0.94, MAD = 0.999, AIC = 2084.93, BIC = 2169.97, and log likelihood = − 1022.46. Conclusions: The advantage of our method is that it relies on the inherent properties of the image itself, rather than the availability of clinical or demographic data. Thus, the image features discovered in this study are universally and readily available across the board.


Introduction
Several studies to date have investigated various features and models to understand the growth and progression of geographic atrophy (GA) in age-related macular degeneration.Some commonly assessed clinical and demographic features have included age, sex, smoking, alcohol consumption, and sedentary lifestyle, with outcomes suggesting mixed results with respect to the association of these variables with GA.Other features have been similarly explored, including hypertension, diabetes, hyperlipidaemia, and body mass index, although these explorations have not generally been fruitful [1][2][3][4].These features have been used in several different models to explain both the progression of the disease and the variability in progression between patients.
Pfau et al. (2019) used shape-descriptive features in a linear mixed-effects model with a two-level random effect of eye and patient [5].Liefers et al. ( 2020) also included many shape-based feature associations in their linear regression model [6].For GA spatial prediction, Niu et al. (2016) used the random forest and 19 quantitative image features which characterised the shape and size of GA, as well as its axial location (e.g., area, maximum radius, eccentricity, etc.) [7].Pfau et al. (2020) also assessed spatial prediction using a mixedeffects logistic model that included features such as retinal pigment epithelium (RPE)-based diagnostic groups, pixel-wise locations, and follow-up time.Schmidt-Erfurth et al. (2020) used a linear regression model to determine an association between GA progression and HRF [7][8][9].For prediction of GA visual function, Künzel et al. (2020) used the linear regression model along with more clinical features, such as best-corrected VA, while Pfau et al. (2020) used the random forest approach along with image features and patient reliability indices to estimate retinal sensitivity in GA progression [10,11].These publications used a combination of metrics in post-estimation evaluations, including the mean absolute error (MAE), the mean square error (MSE), and the coefficient of determination (R 2 ), in evaluating the overall fit of their respective models.These papers used various modelling and machine learning (ML) techniques in estimating GA progression; however, none of these techniques was justified based on clinical assumptions or demonstrated the fundamental reasoning behind their choice of modelling.
For this study, the linear mixed-effects model was used for several reasons.Firstly, in our previous publications, the linear progression model was shown to be most representative of GA progression using real clinical presentations.In our previous work, we demonstrated that the sigmoidal growth curve may best represent GA progression from start to finish.However, the research was based on a limited number of case studies of patients with the highest number of clinical presentations, where the aim was to more accurately estimate the lower and upper tail ends of the sigmoidal growth curve [12,13].Future iterations of this work will involve the collection of additional, real-time clinical data to explain those lower and upper tail ends.Such data will include nascent GA.However, the focus of this study was on creating a prediction model based on the real clinical presentations and thus narrowed on the peak growth rates of GA progression (i.e., the linear portion of the sigmoidal growth curve).Furthermore, given that peak growth occurs in the linear portion of the sigmoidal growth curve, there were practical reasons for focusing simply on the linear portion.
The reason for the choice of the mixed-effects model was two-fold: firstly, patientlevel variation needs to be accounted for given that there is high variability in terms of GA progression from patient to patient; secondly, as discussed by Arslan et al. (2021), inter-eye correlation may affect the results if treated as a single probability distribution [13].Ying et al. (2017) outlined potential avenues in which inter-eye correlation could be tackled while highlighting their advantages and disadvantages [14].The most promising of these methods is the use of the mixed-effects model to ensure that the regression coefficients remain unbiased and that the variance estimators are correct and thus that the model average is not affected by correlations between the left and right eye.
An additional motive for the selection of the mixed-effects model was that it is suitable for time-series and hierarchical datasets given that it is a multilevel model.Biological and medical data are complex, and care needs to be taken to avoid spurious or inflated associations.There are several possible causes of confounding, including population structure (the existence of major subgroups in a population), cryptic relatedness (the existence of small groups of related individuals), and environmental factors (environmental differences between subpopulations or geographic locations) [15][16][17][18].Several methods have been suggested to control for these confounders, one of which includes mixed-effects modelling, where a set of random effects is fitted for each individual [18].Mixed-effects models are well suited for analysis of biological/medical data [19] and are flexible and powerful statistical models for controlling stratification, relatedness, and confounding factors [20][21][22].
We have seen in the literature some models utilising image-based features, with particular focus on shape features, as highlighted in the preceding paragraph.The current study also utilised image features, specifically FAF-based features, in modelling GA progression.The analysis applied a process referred to as radiomics-a quantitative feature extraction approach specifically designed for the extraction of data and metrics from medical and clinical images [23].However, the radiomic process can lead to the extraction of large numbers of features.This can make the process of univariate feature selection incredibly strenuous when traditional statistical techniques are used.Furthermore, while we tried to identify the most suitable features for the model, we also wished to select the features that were ranked in order of highest importance to lowest importance to determine the features most relevant to the disease of interest.To do this using traditional statistical techniques would involve the assignment of weights to each variable using the Relative Importance of Variables (RIV) method [24].
The RIV method ranks predictor variables by weights: predictors with larger weights (i.e., close to 1) are considered the most important features, while those with the smallest weights (i.e., close to 0) are considered to be the least important [25].The null hypothesis for an RIV analysis is that the parameter estimates for all variables are identical and have the same level of importance.To estimate the RIV of variable x j , the sum of all Akaike weights is required (i.e., the AIC) across all models in the set in which j occurs; the sum of w + (j) reflects the importance of the variable.This sum is denoted as a numerical value between 0 and 1.The larger the sum of w + (j) (i.e., the closer it is to 1), the more important the variable is relative to the other variables tested.Using w + (j), all the variables can be ranked in order of their importance.To obtain these weights, the effect size of each variable needs to be based on model-averaged estimates.In other words, to ensure an accurate reflection of the importance of one variable versus another, a combination of models is required which contains all prospective variables in equal proportions across all models, allowing each variable to be tested on an equal footing.Otherwise, if one variable were to be found more frequently across the test models as compared to the others, this may inadvertently give the more frequently occurring variable the advantage and cause it to rank higher than the others.The process of looking at all potential combinations of variables, creating a new model for each combination of variables, and then assessing each model iteratively while modifying the rank of each variable is laborious and time consuming.For example, from past experience, the process of looking at a combination of 19 variables using the RIV method could take up to several days.
In this paper, over 100 variables were considered, and combinatorial assessment with the RIV method would mean that a machine would need to be on for several days without a break in order to rank the variables in order of importance.This time-intensive process can now be replaced with efficient and expedient ML processes.Thus, in this paper, feature extraction was conducted using radiomics, while feature selection and weighting was conducted using the ML approach XGBoost.The XGBoost algorithm is a gradientdescent learning method which uses a decision tree in order to rank features in terms of their importance and relevance to the outcome.This method is computationally faster and has several advantages, including the removal of redundant features.This removal of redundant features is an additional bonus, as decision trees are typically immune to multicollinearity, meaning that no further analyses need to be performed to manually determine collinear features during model development.When a decision tree is faced with, say, three variables that are highly correlated and collinear, it will split between the three and identify the one with the greatest performance, thus eliminating the other correlated variables.
In the remainder of this paper, we discuss in greater detail feature extraction, feature selection, linear mixed-effects models, and the most suitable model based on model diagnostics, such as Q-Q plotting, which assesses the assumptions of normality.While this paper presents the most suitable model, it is important to note that this suitability is based on the current available dataset, which predominantly includes image-based features.Image-based features are but one component of a larger picture of a complex disease, and thus the model presented in this paper should be considered a preliminary fit until additional data become available.This FAF-based linear mixed-effects model could be improved upon in the future with new data, which could include features extracted from other imaging modalities, such as SD-OCT, along with clinical information, such as genetics.The latter would create a more personalised prediction model.

Study Design
A retrospective analysis was conducted as a case study using anonymised data from patients who attended the retina clinics at the Royal Victorian Eye and Ear Hospital (RVEEH).The study was approved by the Human Research Ethics Committee of the RVEEH.The study was conducted in accordance with the International Conference on Harmonisation Guidelines for Good Clinical Practice and the tenets of the Declaration of Helsinki.Ethical approval was provided by the Human Research Ethics Committee (HREC: project no.95/283H/15) of the RVEEH.Written informed consent was obtained from all participants.

Feature Extraction
The objective of the radiomic process is to obtain large amounts of data extracted and quantified from medical images.Radiomic features were extracted using pyradiomics (v3.0: https://pyradiomics.readthedocs.io/en/latest/,accessed on 19 March 2024), which is a Python package supported by the US National Cancer Institute [26].There is a total of 107 (Table 1) relevant radiomic features for FAF images (all 2D features) [26].Examples of these procedures include the following:  Radiomic features were extracted for all GA regions of interest, including lesions and early-stage, intermediate-stage, and late-stage hyperfluorescence areas.To extract these features, the binarised masks created using the techniques described by Arslan et al. (2021;2023) were utilised [27,28].These binarised masks were coupled with their original FAF images.When radiomic features were being extracted, first, the original image was inputted and then the binarised masks highlighted the areas within the original image from which information needed to be extracted.See Figure 1 for an illustration of this process.
In addition to the radiomic features, clustering results from Arslan et al. (2023) were also included as additional variables [27].Specifically, the variables named "early_cluster" (i.e., early-stage hyperfluorescence clustering), "late_cluster" (late-stage hyperfluorescence clustering), and "lesion_cluster" (lesion clustering) were extracted and added to the database.Intermediate-stage clustering was not included as an additional variable given that distinguishable clustering groups were not determined for this GA feature.This created a total of 431 (i.e., 4 GA areas × 107 radiomic features + 3 cluster variables) features to evaluate.In addition to the radiomic features, clustering results from Arslan et al. (2023) were also included as additional variables [27].Specifically, the variables named "early_cluster" (i.e., early-stage hyperfluorescence clustering), "late_cluster" (late-stage hyperfluorescence clustering), and "lesion_cluster" (lesion clustering) were extracted and added to the database.Intermediate-stage clustering was not included as an additional variable given that distinguishable clustering groups were not determined for this GA feature.This created a total of 431 (i.e., 4 GA areas × 107 radiomic features + 3 cluster variables) features to The original image was coupled with the binarised output.This mask indicated the regions within the original image from which features should be extracted.In the above example, features were being extracted for the late-stage hyperfluorescence regions of GA.GA: Geographic atrophy.

Feature Selection
Feature selection is the process of ranking the variables in terms of importance relative to the outcome of interest (i.e., the total area of GA in mm 2 /year).Feature selection was conducted using an ML approach and Python (v3.6), namely, XGBoost (v1.4.0: https://xgboost.readthedocs.io/en/latest/index.html, accessed on 19 March 2024).XG-Boost (eXtreme Gradient Boosting) is a gradient-descent learning method which uses a decision tree in order to rank features in terms of their importance and relevance to an outcome.XGBoost works by identifying the most appropriate "split points".It calculates the importance of these split points relative to the outcome measure and calculates overall how well these split points perform across different iterations.In some ways, this process is similar to the RIV method previously highlighted in the Introduction section, although XGBoost has several advantages.Its greatest advantage is that it can combine many weak learners and combine the weak learners into strong learners.Weak learners refer to a single decision tree which, if used by itself, would be only marginally better than randomly guessing an outcome.For example, if we were to look at which factors contribute to the successful graduation of high-school students and, amongst the many potential factors involved, decided to solely focus on familial factors (e.g., the age and assistance of parents), then we would have a subpar model that does not comprehensively cover all the potential features that could contribute to the success of students.Therefore, any model based on these weak learners would be no better than guessing how well a student is going to perform.However, if we were to combine many weak learners (e.g., the hours a student spends studying, the age and assistance of their parents, the financial stability of the family, tutoring that the student might receive, the child's work ethic, extracurricular or volunteer work outside of school hours, etc.), then we would have more information that we could assess collectively and use to create better and more comprehensive suggestions as to which of these features is best suited to the outcome of interest.The plethora of information with strong learners leads to better success rates than with individual weak learners [29].
In addition to the combination of weak learners to build strong learners, XGBoost is also capable of building decision trees in parallel rather than sequentially, as other decision trees typically do.This parallel computing format in assessing the problem leads to the efficiency and speed for which XGBoost is known and has been popularised.Furthermore, XGBoost has the ability to leverage residual patterns that help in creating prediction errors which are ten times lower than those of other decision trees, such as the random forest.This leverage of residuals, along with its ability to control model complexity and reduce overfitting via built-in regularisation, contributes to its predictive performance.The added benefit of using XGBoost is its ability to remove redundant features automatically, and thus later in the modelling process there is no need to conduct multicollinearity assessment.XGBoost is also capable of handling missing values with ease [29].
Despite several advantages, XGBoost is still prone to variability and uncertainty, as highlighted in [12,13], because it is a data-dependent method.Therefore, for completeness, the XGBoost method was run five times (i.e., five-fold cross-validation).The five-fold crossvalidation method was used here to ensure that the top selected features were consistently selected.Additionally, to ensure that the topmost important and relevant features were the focus, XGBoost was programmed to select and rank the top 30 of the 431 image features.The objective of selecting the top 30 was to ensure that the selected model was a parsimonious model.

Mixed-Effects Model
The development of the linear mixed-effects model and all subsequent statistical analyses were performed using the statistical software R version 4.0.0 [30].Mixed-effects models are used to describe relationships between response and predictor variables in data that are grouped based on one or more classifications [31].Mixed-effects models explicitly specify the mean and covariance structure, incorporating two types of parameters: fixed and random effects [32,33].Fixed effects refer to predictors that affect a response variable.
Random effects, however, refer to effects on a response variable generated by variation within and among levels of a predictor variable [32].Population structure is the fixed effect in a mixed-effects model, while relatedness among individuals is incorporated as a variance-covariance structure of the random effect [34].Mixed-effects models have gained considerable popularity and are considered useful in the analysis of longitudinal data, modelling of complex clustered data, penalised log likelihood, etc. [19,35].
There are advantages to using mixed models in medical applications.Medical data are often clustered (e.g., a study may be recorded at several centres, hospitals, etc.).The design of medical studies can be described as hierarchical, and wider inferences can be made by fitting the clustering effect as random.Repeated measurements are also common in medicine-based studies, and it is not uncommon for several observations to be missing.The advantage of using mixed-effects models in such studies is that allowances can be made for missing data and the hierarchical clustering [36].
The data for this study were longitudinal and consisted of repeated observations by individual subjects over time.The interest lies in the effects that are common and different among all individuals [37].The mixed-effects model allows the capture of among-subject variations.In this thesis, the linear mixed-effects model was used as (1) it was confirmed that the clinical presentations can be represented by the linear regression model and ( 2) it is a useful model that can account for inter-eye correlation, as highlighted in [13].Linear mixed-effects models are an extension of regular linear models.Traditional linear models only use a single random term, the residual error.Linear mixed-effects models allow the specification of more than one random term-a useful feature, as it is more accurate to think of an effect coming from a specific normal distribution rather than a fixed value [38].
With N independent sampling units (i.e., patients), the linear mixed-effects model for the ith person may be written as follows: where Y i represents the response variable for the ith person, X i is an n i × p design matrix for the p-vector of fixed effect β, and Z i is an n i × q design matrix associated with the q-vector of random effects u i that represents subject-specific regression coefficients.The error term, ε i , is assumed to be normally distributed with mean zero and to be independent of the random effects [39].The use of linear mixed-effects models counters the multiple drawbacks that are normally associated with traditional random-effects modelling, such as the following [40]: -Deficiencies in statistical power with the use of repeated observations; - The lack of adaptability due to missing data; -Disparate methods for treating continuous and categorical responses; -Dubious methods for modelling heteroscedasticity and non-spherical error variance.
There were multiple measurements for each subject (i.e., patient) in the data; thus, we needed to incorporate random effects into the model to account for the variation in outcomes.To account for within-subject dependencies, a subject-specific latent variable (i.e., a random effect) had to be included in the model.Typically, an additional random effect is included for each regression coefficient that is expected to vary among subjects.For example, in dose-response settings, one may account for baseline heterogeneity through a random intercept and for heterogeneity in susceptibility through a random slope, with these two factors being potentially correlated [41].To account for this heterogeneity, the random effects used across all our tested models included subjects (i.e., an ID variable) and eyes (i.e., the left and right eye).These are represented in the analysis as (subject|eye).The use of the random effect eye accounts for the random intercept.The random effect subject accounts for the random slope.The lme4 (v1.1-27) package created by Bates and Maechler and maintained by Ben Bolker was used in the creation and analysis of the linear mixedeffects models [42].Specifically, for the linear mixed-effects model, the lmer() function was used.

Measure of Outcome
Naturally, the main focus is on the total area of growth in mm 2 /year (i.e., Y i ).In [13], the popular square-root transformation was also tested, and it was confirmed that the transformation did not impact the residuals of the linear model.However, for completeness, the transformation was again assessed in the presence of image features to see if it had an impact on the predictive power of the linear mixed-effects model and/or its model diagnostics, in contrast to including residuals.In addition to the square-root transformation, another rarely assessed transformation was also investigated-the log transformation.The log transformation has been claimed to be a transformation better suited for use in GA modelling.Brader et al. (2015) investigated the progression rate and pattern of incident GA.Using subjects who developed GA in untreated eyes in the Complications of Age-Related Macular Degeneration Prevention Trial (CAPT; ClinicalTrials.gov(10 June 2021) identifier: NCT00000167), their findings contradicted previous studies that used the square-root transformation by suggesting that the log transformation is more useful in the assessment of GA progression and that the square-root transformation failed to eliminate the dependence on initial size.Their log-transformed growth model, which they suggest best fits their data (not available), yielded an enlargement rate of 56% per year [43].
While the log transformation appears to be a suitable surrogate candidate in place of the commonly used square-root transformation, the standard log(Y i ) transformation would be insufficient in the context of GA.Sometimes, with small GA lesion areas, the log transformation will produce a negative rather than a positive output, which is not suitable given that the model is supposed to predict GA growth over time.Perhaps this may provide an explanation as to why there are not that many studies assessing the log(Y i ) transformation.In place of the more commonly used log transformation, in this study, the log(Y i + 1) transformation was used instead.This transformation is suitable for rightskewed data, such as GA total areas, and the inclusion of the additive constant "1" ensures that the outcome will always be positive and never negative.In this study, three linear mixed-effects models with three outcome measures were tested: the original scale, the square-root-transformed scale, and the log(Y i + 1) scale.

Model Diagnostics and Selection
Information-based criteria, such as the AIC and the Schwartz or Bayesian Information Criterion (BIC) were used in the model selection process.Although a plethora of information criteria are available for model comparisons, all other criteria are modifications or generalisations of the AIC/BIC [44].The AIC and BIC criteria are defined as follows [45]: where 2 ↕ θ2 − ↕ θ1 is the likelihood ratio test statistic that is asymptotically distributed as χ 2 with p 2 − p 1 degrees of freedom.Along with the AIC and the BIC, the log likelihood was included (along with the degrees of freedom) in the results as well, as a model with a maximum likelihood is considered to fit the data the best.AIC and BIC theories have the same objective: to find the best model via comparison.However, each theory has a different motivation.While the AIC compares models using a measure of similarity of the expected predictive performance, the BIC compares the probabilities that each of the models tested is the true model [44].The main idea behind the criteria is to compare models based on their maximised log-likelihood values, while penalising them based on the number of parameters.The model with the smallest AIC/BIC value is deemed the best [46].Additionally, in finding the smallest AIC and BIC values, the model chosen needs to fit the data.For this reason, we again measured R 2 , the coefficient of determination [47].R 2 is a proportionate reduction in the total variation associated with the use of a predictor variable, X.The larger the R 2 value, the more the total variation in Y is reduced by introducing the predictor variable, X. R 2 values occur in the range of 0 ≤ R 2 ≤ 1.Values closer to 1 indicate a better fit.For mixed-effects models, R 2 can be categorised into two types: marginal R 2 and conditional R 2 .Marginal R 2 accounts for variance explained by fixed factors: and conditional R 2 is concerned with variance explained by both fixed and random factors [15]: where Once a suitable model was identified and fitted, we then proceeded to check whether the key assumptions of the model were satisfied.These assumptions included (i) linearity, (ii) the homoscedasticity or constancy of the error variance, and (iii) the normality of the errors.Discrepancies between the assumed model and the data could be identified by studying the residuals (also known as the error components).The residuals represent the differences between observed and predicted values per the assumed model.Visual aids, such as residual plots, help identify whether the assumptions of the model have been met.Typically, a good residual plot would be one with an even distribution or symmetry; those that contain distinguishable patterns, such as clustering to one side of the plot, usually indicate a violation of the model's assumptions and warrant a further review of the model (e.g., the transformation of the outcome variable or the exploration of additional variables) [48].Normal probability plots additionally allowed us to determine the fit of our model.
Using both residual plots and normal probability plots, we could identify any unusual or outlying observations based on large deviations in the observed Y values from the fitted line.Inferences from the model can be potentially influenced by only a few cases in the data.The fitted model may reflect the unusual characteristics of those cases rather than the overall relationship between the dependent and independent variables [49].

Prediction Accuracy Using Forecasting Errors
Predictions of future events are called forecasts, and the act of making predictions is called forecasting.In attempting to forecast future events, a forecaster must rely on and analyse past data.Past data allow the identification of a pattern which can be extrapolated or extended into the future in order to prepare a forecast.Forecasting techniques rely on the assumption that the patterns which have been identified in the past will continue in the future.Good predictions cannot be expected unless this assumption is valid.That said, all forecasting situations involve a certain degree of uncertainty.There may be an irregular component which may be substantial and cause fluctuations in the data.Hence, we reviewed forecasting errors in an attempt to ascertain whether an irregular component can be so large as to completely invalidate any forecasting technique or whether a forecasting technique might not be capable of accurately predicting the trend, seasonal, or cyclical components of data, thus rendering the technique inappropriate [50].
Several forecast errors can identify whether a forecasting technique does or does not match the pattern of the data.To measure forecasting errors, we used the following notations [51]: -Ŷt : The forecast generated for the ith patient at time t (i.e., the fitted/predicted value); -Y t : The observed value for the ith patient at time t; -Y t : The mean (or centred) value of the outcome measure; -e t = Y t − Ŷt : The forecast error for the ith patient at a particular time t.
The first metric used to determine forecast quality was the mean error (ME), which is simply the average of past errors: This metric captures whether the forecasting process, on average, tends to underforecast (i.e., ME is positive) or over-forecast (i.e., ME is negative); it is, in fact, a metric of bias.We therefore needed other metrics to assess forecasting accuracy that could capture the proximity between the predictions produced using our model and the actual observed values.An ideal or preferred ME would be as close to zero as possible.
The second metric used was the MAE, which is the absolute value between a forecasted value and an observed value.While the ME can capture the tendency to over-or underforecast, the MAE tells us how large (or small) a forecasting error we can expect on average.The values for this metric ranges are [0, ∞); the smaller the value, the better.The MAE makes it easier to interpret errors and minimises the weight given to outliers that could skew results.It is expressed as follows: The third metric used in determining forecasting accuracy was the mean absolute deviation (MAD).MAD uses the absolute error to make sure negative and positive errors add up and conceptualises the amount of error and is robust against outliers.Similar to the preceding metrics, the range of values is [0, ∞), and the smaller the value, the better the result.
The fourth metric used to assess forecasting accuracy was the root mean square error (RMSE).This measure squares errors to sum positive and negative ones.The RMSE is used as an estimate of the standard deviation: The values for RMSE occur in the range of [0, ∞).The smaller the RMSE, the better.RMSE measurements are in the same unit as that of the forecast demand.If, for example, we were using kg as the unit, then the RMSE would be measured in kg as well.In our case, our units of measure are mm 2 /year (original scale), mm/year (square-root-transformed scale), and % enlargement rate (log(Y i + 1) transformation scale).

Feature Extraction and Selection
All 431 features were extracted across all patient data.These extracted features were added to the existing data collected using RegionFinder (i.e., the total area of GA in mm 2 /year) and exported into a standard comma-separated values (CSV) file.Using a five-fold cross-validation method, the top 30 features were extracted (Figure 2).A close-up of the feature ranking plot can be found in Figure 3.In the context of feature extraction, this meant that the top 30 most highly ranked features were extracted across all five folds.The objective of the five-fold cross-validation process here was to determine whether the topranked features were consistently the same across all five cross-validations, which would confirm the importance of these features relative to the output of GA total area (mm 2 /year).For completeness, if there were some features that appeared in one or two cross-validations but not all, those features were still listed as potential candidates for consideration in the modelling process just to ensure that all potential top variables were observed in the model.The final complete list of features tested for modelling can be found in Table 2.The feature "lesion_elongation" was always the top-ranked feature across all five cross-validations, with "lesion_elongation" representing the largest and second-largest principal component axes of lesions.While the clustering variables "early_cluster", "late_cluster", and "lesion_cluster" did not appear in the top-ranked variables across any of the cross-validations, they were still considered and manually tested given our interest in determining whether there is an association with the new cluster groups and GA progression.The cluster variables were assessed in the traditional univariate linear mixed-effects model style.In a univariate stage, all three clustering variables showed statistical significance with a threshold of p < 0.05.However, there were slight discrepancies, with some features appearing once or twice in the five-fold tests.For completeness, all variables which appeared across all cross-validations were included in the model assessment process.
These top features were then manually tested in linear mixed-effects models with the original scale (total area mm 2 /year) with the random effects of (eye|subject).This originally involved the addition of all the variables of interest identified in Table 2 in a singular model, and the significance of the variables was assessed using the p < 0.05 threshold.Then, one by one, variables that were not significant against this threshold were eliminated and the model was rerun.This process continued until all the features tested in the original model were statistically significant.Once these base statistically significant variables had been identified, they were then tested in various combinations, as shown in Table 3.These combinations were observed to see which combined sets of variables produced the most promising modelling results.These combinations were tested against our various modelling and forecasting metrics: R 2 M , R 2 C , Pearson's r, RMSE, ME, MAE, MAD, AIC, BIC, and log likelihood (Table 4).Furthermore, normal Q-Q plots (Figure 4) and predicted vs. output plots (Figure 5) show that all these various combinations did meet assumptions of normality and had good correlations between expected and observed values.However, the most suitable combination belonged to a model named model 1.2, which contained the variables "lesion_elongation", "lesion_minoraxislength", "lesion_meshvolume", "lesion_contrast", "e_hyp_busyness", "lesion_10percentile", "lesion_sphericity", "lesion_ cluster", "e_hyp_sumsquares", "lesion_correlation", "late_stage_clustershade", "lesion_ complexity", "late_stage_minoraxislength", "late_stage_dependencevariance", and "late_ stage_smalldependencelowgray".).Therefore, in the succeeding sections, these variables were further tested using (1) the original, square-root-transformed, and log( +1)-transformed outcomes, and (2) all of these model outcomes were tested using another five-fold cross-validation method similar to the one applied to the linear mixed-effects model.The transformations were tested to see if they improved the fit of the variables to the model.

Model Selection
The models described above were used to test which of the top featured variables would be most suitable when put into a linear mixed-effects model.These top features   ).Therefore, in the succeeding sections, these variables were further tested using (1) the original, square-root-transformed, and log( +1)-transformed outcomes, and (2) all of these model outcomes were tested using another five-fold cross-validation method similar to the one applied to the linear mixed-effects model.The transformations were tested to see if they improved the fit of the variables to the model.

Model Selection
The models described above were used to test which of the top featured variables ).Therefore, in the succeeding sections, these variables were further tested using (1) the original, square-root-transformed, and log(Y i + 1)-transformed outcomes, and (2) all of these model outcomes were tested using another five-fold cross-validation method similar to the one applied to the linear mixed-effects model.The transformations were tested to see if they improved the fit of the variables to the model.

Model Selection
The models described above were used to test which of the top featured variables would be most suitable when put into a linear mixed-effects model.These top features were determined using the original scale (total area mm 2 /year) as the outcome of interest with the random effects of (eye|subject).Initially, assessment of the most suitable variables was performed using the entirety of the dataset.However, to minimise any overfitting, the features identified in Section 3.1 were then evaluated using the original scale using the cross-validation approach.The process was then repeated for the two additional outcomes of square-root transformation and log(Y i + 1) transformation.The features identified in model 1.2 were tested against the new and transformed outcomes to see if they maintained any statistical significance with the transformation.Most features still maintained their statistical significance in the squareroot-transformed model (i.e., the model named 1.2sq), with only "late_stage_clustershade" and "lesion_complexity" losing statistical significance with the square-root-transformed outcome.However, with the log(Y i + 1) transformation (i.e., the model named 1.2log), the variable list reduced somewhat, with only "lesion_minoraxislength", "lesion_meshvolume", "le-sion_contrast", "e_hyp_busyness", "lesion_10percentile", "lesion_sphericity", "lesion_cluster", and "late_stage_smalldependencelowgray" maintaining their statistical significance with the threshold of p < 0.05.A complete list of the final variables used across the three different linear mixed-effects models can be found in Table 5.The models 1.2sq and 1.2log also underwent five-fold cross-validation for a complete comparison (Table 6).At first glance, the performance of the square-root-transformation model quantitatively appears to supersede those of the original scaled and log-transformed models.Additionally, the model produces the most tight-fitting normal Q-Q plots relative to the other two models (Figure 7) and has a good residual plot (Figure 13).However, upon closer inspection of Table 6, it will be noticed that, save for the forecasting metrics of RMSE, ME, MAE, and MAD, all the other metrics have identical results across all five folds of cross-validations.This suggests that the square-root-transformed model is most likely an overfitted model [13].In our earlier publication, we demonstrated the effect of the square-root transformation on GA growth areas and progression modelling in the absence of features.The transformation had a flattening effect and a tendency to "tighten" the data points.The overfitting may be the result of this effect, and thus a square-root-transformed model most likely is not suitable.Furthermore, we can see with the correlation plot of predicted and observed outputs (Figure 10) that there is a curvature at the lower tail end of the plot, suggesting that a perfectly linear relationship between predicted vs. output values is absent.Interestingly, while the marginal and conditional R 2 , the correlation coefficient (r), and the AIC/BIC/loglikelihood metrics were consistently the same across all five folds, the varying metrics of RMSE, ME, MAE, and MAD were considerably larger relative to the same values of the original scaled model, 1.2.Thus, given the identical results produced across the five-fold cross-validation, along with the predicted vs. output plot not behaving in the expected manner, the square-root transformation was deemed inappropriate in this study.For the log-transformed variables, the marginal and conditional R 2 values were as promising as the square-root transformation.Additionally, the values did differ between cross-validations, suggesting consistency in outcomes but no overfitting.The AIC/BIC and log likelihood values were also very promising, as they presented the smallest values out of all the models, which usually would suggest that the log-transformed model would be the most promising.The Q-Q plot (Figure 8) and the residual plot (Figure 14) also suggested that the residuals were normally distributed.However, the forecasting errors for the log transformation were the largest out of the three modelling types.Furthermore, the correlation plot for the log-transformed model (Figure 11) produced very similar patterns between predicted and observed values to the square-root-transformed model.
As a result, the original scaled model, 1.2, was deemed to be the most appropriate (represented in Equation (10), where β 1 , β 2 . . ., β 15 are the fixed coefficients for variables X 1 , X 2 , . . . ,X 15 highlighted in model 1.2 and Z 1 γ 1 + Z 2 γ 2 represent the random effects): Generally, the quantitative values were very good, with high values for the conditional and marginal R 2 as well as the correlation coefficient.The forecasting errors were smallest with the original scaled model as compared to the transformed models.The average R 2 M for the original scaled model was 0.84 ± 0.01.For R 2 C , the average was 0.95 ± 0.002, the r average was 0.97 ± 0.006, the RMSE was 1.44 ± 0.06, the ME was 0.09 ± 0.15, the MAE was 1.04 ± 0.06, the MAD was 3.52 ± 0.67, the AIC was 1718.41 ± 11.08, the BIC was 1798.99 ± 11.20, and the log likelihood was −839.21 ± 5.54.While the AIC/BIC and log likelihood were highest for the original scaled model, this was overshadowed by the consistency of the visualisation results.The normal Q-Q plot (Figure 6) may not have been as tight fitting as for the square-root-and log-transformed models, but it was still in line with the assumptions of normality, and, coupled with the residual plots (Figure 12), this further validated that the assumptions of normality were met.Furthermore, the original scaled model was the only one that produced visually correct correlation plots (Figure 9), with the predicted and observed values having a strong linear relationship.

024, 4, FOR PEER REVIEW 19
was 1798.99 ± 11.20, and the log likelihood was −839.21 ± 5.54.While the AIC/BIC and log likelihood were highest for the original scaled model, this was overshadowed by the consistency of the visualisation results.The normal Q-Q plot (Figure 6) may not have been as tight fitting as for the square-root-and log-transformed models, but it was still in line with the assumptions of normality, and, coupled with the residual plots (Figure 12), this further validated that the assumptions of normality were met.Furthermore, the original scaled model was the only one that produced visually correct correlation plots (Figure 9), with the predicted and observed values having a strong linear relationship.

Overall Findings
This paper describes the evaluation of the linear mixed-effects model using F based image features as a prediction model for GA growth.The choice of the linear mix effects model stemmed from several factors.Firstly, in our previous studies, it was c  A-E) of the linear mixed-effects model with the log(Y i + 1) scale of enlargement rate/year.Generally, there is an even distribution of residuals, and the slight cluster on the left-hand side is spread out.

Overall Findings
This paper describes the evaluation of the linear mixed-effects model using FAF-based image features as a prediction model for GA growth.The choice of the linear mixed-effects model stemmed from several factors.Firstly, in our previous studies, it was confirmed that the linear model, representing the high growth part of the progression of GA, offers the best description of trends relative to other regression models, such as the quadratic model without the linear term and the logarithmic model.Furthermore, the mixed-effects model is a hierarchical model that is suitable for time-series data, such as measurements used for GA progression analysis.
The mixed-effects model can also account for changes in slopes (i.e., variability in patient progression) by incorporating random effects.In the models presented, subject and eye were considered to be the most appropriate random effects to account for changes relative to each individual while also accounting for inter-eye correlation.The sigmoidal function was hypothesised and investigated in [13] with the limited data available and remains a subject for future research.In this study, the linear mixed-effects model was used to describe the peak growth portion (the middle part) of the sigmoidal function.This is because current available clinical datasets are best described by the linear part of the growth curve.However, future extension of this research will require new data (i.e., nascent GA estimates to explain the lower tail end and additional follow-up to explain the upper tail end).The sigmoidal hypothesis was tested using several case studies with the most data points.However, with an expanded real-time dataset, future explorations could include the sigmoidal mixed-effects model.
The features used for the linear mixed-effects model were image features extracted from FAF images using radiomics and the binarised masks created.The binarised masks included lesions, as well as the various stages of hyperfluorescence development, including early-stage, intermediate-stage, and late-stage hyperfluorescence [27].A total of 107 radiomic features were extracted per feature, which meant a total of 428 radiomic feature extractions.These were coupled with cluster-based categories determined in our prior work on cluster analysis [27].The variable "lesion_cluster" had three categories: preliminary lesion formation, intermediary lesion formation, and complete lesion formation.The variable "early_cluster" also had three categories: early-stage hyperfluorescence complete coverage, early-stage hyperfluorescence partial coverage, and early-stage hyperfluorescence proximal coverage.The variable "late_cluster" had its own three categories, including late-stage hyperfluorescence droplet scatter, late-stage hyperfluorescence halo scatter, and late-stage hyperfluorescence halo.Intermediate-stage clustering groups were not included, as the cluster groups were not distinguishable.With these three additional variables, the total assessable features were 431.First, XGBoost was utilised to identify the top highly ranked features.These features were then evaluated in the linear mixed-effects model (with the outcome as total area mm 2 /year) in a variety of combinations.The combination of variables that boded well against the multitude of metrics was "lesion_elongation", "lesion_minoraxislength", "lesion_meshvolume", "lesion_contrast", "e_hyp_busyness", "lesion_10percentile", "lesion_sphericity", "lesion_cluster", "e_hyp_sumsquares", "le-sion_correlation", "late_stage_clustershade", "lesion_complexity", "late_stage_ minoraxislength", "late_stage_dependencevariance", and "late_stage_smalldependencelowgray".
Here, a quick explanation of the aforementioned variables is presented.Lesion elongation refers to the largest and second-largest principal component axes of lesions.Lesion minor axis length is the second-largest length of axis for lesions.Lesion mesh volume refers to the total volume of lesions calculated based on a triangle mesh.Lesion contrast refers to the local intensity variation of lesions.Early-stage hyperfluorescence busyness measures the degree of change between early hyperfluorescence pixels and neighbouring pixels.Lesion 10th percentile refers to the 10th percentile of voxels in lesions.Lesion sphericity refers to the measure of roundness of lesions.Lesion cluster is the late-stage clustering category [27].Early-stage hyperfluorescence sum of squares refers to the intensity of neighbouring pixels around early hyperfluorescence.Lesion correlation measures the coarseness of lesion textures.Lesion cluster shade measures uniformity or skewness in gray-level occurrences in lesions.Lesion complexity determines if there are any drastic changes in gray-level intensities within lesions.Late-stage hyperfluorescence minor axis length is the secondlargest length of axis for late hyperfluorescence.Late-stage hyperfluorescence-dependence variance is the measure of variances within late hyperfluorescence regions.Finally, the late-stage hyperfluorescence small dependence low gray-level emphasis variable measures lower gray-level values [26].
The majority of features found suitable for GA progression modelling were features around lesion-based parameters.Perhaps this is not unusual given that the model presented focuses on the linear portion of the sigmoidal function, rather than the whole sigmoidal function itself.It would be assumed that, as the disease is at its peak progression stage, more lesion-based parameters would be more intuitive in predicting the direction and progression of the disease.

Testing Models with Different Transformations
The variables were tested in a five-fold cross-validation approach in three different types of linear mixed-effects models: (1) the original scaled total GA area of mm 2 /year, (2) the square-root-transformed area of mm/year, and (3) the log(Y i + 1)-transformed area of enlargement rate/year.Using a combination of quantitative and visualisation techniques, the original scaled linear mixed-effects model proved to be the most suitable for the current data.
In this work, we investigated the popular square-root transformation (1) when assessing the most suitable growth rate in the absence of features and (2) when assessing the growth rate with image features.In [13], it was confirmed that the square-root transformation did not have an impact on the assumptions of normality, as the linear model with and without the square-root transformation had residuals that were normally distributed.In this study, the normal Q-Q plot for the square-root transformation at first appeared to be a tighter and better fit than the original scale.The residual plots also showed a relatively uniform distribution of residuals, which is indicative of a good fit.However, visual assessments aside, the square-root-transformation model's quantitative metrics were questionable.Despite running the model using the five-fold cross-validation process, most of the model diagnostic metrics remained identical, including the marginal and conditional R 2 , the Pearson's correlation coefficient r, the AIC/BIC, and the log likelihood.This would suggest that the square-root transformation might actually create an overfitted model and is thus most likely to be rigid and inflexible in its application outside of the current available dataset.Furthermore, the square-root transformation produced poorer outcomes in terms of forecasting errors, having much greater values for RMSE, ME, MAE, and MAD as compared to the original scale.The analysis for the square-root transformation is a good illustration of why a combination of techniques needs to be utilised when assessing any model; a greater range of assessment methods improves confidence in the choice of model most suitable for the available data.
This study also investigated the log(Y i + 1) transformation.It appears that this alternative has not been explored previously in the literature.The use of the log transformation with an additive constant is more suitable in GA progression analysis.This is due to the small sizes of some lesions which, if subjected to the traditional log transformation, would be converted to negative values.However, the inclusion of the constant "1" prevents this, ensuring that all transformed values remain positive.
The log(Y i + 1) transformation at first appeared to be a promising alternative to the more popular square-root transformation.As with the square-root model, it produced normal Q-Q plots that had a tight fit around the 45-degree line which validated that the residuals were normally distributed.It had an evenly distributed residual plot, further validating that the assumptions of normality for the model had been met.Advantageously, the log(Y i + 1) transformation did not have identical model fit values across its five-fold cross-validations to the square-root transformation.There were slight differences between each iteration, but these were consistent enough for confidence in the results.Furthermore, amongst all the models, the log(Y i + 1) model had the smallest AIC, BIC, and log-likelihood values.If these values were used in isolation of other metrics of interest, the log(Y i + 1)transformed model would have been considered the top model in this study.However, the log(Y i + 1) transformation had two primary downfalls: (1) it had an identical predicted vs. observed plot to that of the square-root transformation, which did not support confidence in its predictive abilities, and (2) it had the largest forecasting errors of the three modelling types.Thus, while the log(Y i + 1) transformation may be promising and could be explored with models that include additional features beyond FAF image features, in the context of this study, the log(Y i + 1) transformation was not appropriate.
If the log(Y i + 1) transformation were to produce improved results in future studies, there would be the question of how to interpret its meaning.In this study, the log(Y i + 1) transformation outcome was referenced as the enlargement rate per year.However, this may still create confusion in a clinical setting.A way to gain clarity in future studies would be to (1) create a model with the log(Y i + 1) transformation and then (2), once the prediction had been made by the model, the prediction could be reverse-transformed using the formula exp(Y i ) − 1.The final output would still be the familiar total area of mm 2 /year, and the transformation could be used safely to develop a highly predictive model.If, in the current study, the log(Y i + 1) model had been the most capable of the three modelling types, this approach of reverse transformation would have been used to predict GA as total area mm 2 /year.

The Most Suitable Model
The original scaled linear mixed-effects model was the most suitable model in this study.It was the model that ticked all the boxes, including having good quantitative model fit metrics and small forecasting errors and meeting the assumptions of normality.Although the residual plots for the original scaled model were near perfect, some minor clustering/clumping to the left-hand side of the plots was notable.This might be suggestive of further modifications being required to improve the performance of the model.Transformations are typically one approach to improve the spread of residuals; however, this study confirmed that transformations did not yield the benefits that would be normally expected.Other possible considerations for model improvement could be the augmentation of the current available data.For example, in this case, we were limited to the use of predominantly FAF image features.But GA is a complex disease, and image features alone do not cover every facet of this progressive ocular condition.Thus, the current model and its residual plots could be improved with the addition of new image or clinical variables.Alternatively, there may be interaction terms that have not been explored in this case that could also help improve the model.It should be noted that interaction terms can also introduce complexities and are not always readily interpretable.Essentially, a small clumping of residuals suggests that something is missing and needs to be included in the model for completeness, and, given the current singular-modality nature of the available dataset, some necessary improvements are to be expected.
The cluster variables "early_stage", "late_stage", and "lesion_stage" were explored in the analyses, even though they did not appear in the top highly ranked features when using XGBoost.Their inclusion was still assessed to see if these newly defined cluster variables can be used to predict GA progression.Given that the variables did not appear in the top-ranked features, they were manually and individually assessed in a traditional univariate mixed-effects model fashion to see if, in isolation of other features, there was any statistical significance with any of the cluster variables.
The early-stage and late-stage hyperfluorescence and lesion cluster variables all showed statistical significance in a univariate mixed-effects model analysis.However, when placed in a model with other covariates, only lesion clustering maintained its significance.This could possibly suggest that the hyperfluorescence clusters can, in part, explain GA manifestation and progression.However, these clusters may be more suitable for explaining earlier stages of GA.Here, the lesion cluster variable was included in the model, suggesting that lesion clustering is the most useful out of the clustering variables to explain GA progression once GA lesions have appeared.It is hypothesised that early-stage and late-stage hyperfluorescence clustering variables could instead be used to model when a GA lesion will appear, and thus these clusters may be more useful for the lower tail end of the sigmoidal curve or nascent GA stages.Thus, the clustering variables could be reconsidered when we assess the sigmoidal mixed-effects model in future extensions of the work reported in this study.

Comparisons with the Literature
The closest comparable model is one based on the paper by Pfau et al. (2019) [5].In their paper, they assessed shape descriptors in 296 eyes from 201 patients using FAF and IR images and a linear mixed-effects model.These shape descriptors included lesion area, perimeter, and circularity, all of which were normalised using the square-root transformation.Pfau and colleagues used the coefficient of determination, R 2 , as their outcome measure, and also used the Pearson's r correlation coefficient.They assessed the model in two clinically relevant scenarios: (1) where the patient was previously unknown and (2) where prediction was made using previous observations from the patient.They also conducted their analyses by stratifying lesion sizes into quartiles.In their first scenario model, they found that the square-root of the lesion area, the square root of the perimeter, the focality index, the hyperfluorescence phenotype, the square root of the circularity, and the minimum and maximum ferret diameters created a model with an R 2 value of 0.244.For their Scenario 2 model, which included the same variables but with prior knowledge of the patient, the R 2 value increased to 0.391.Künzel et al. (2020) focused on the association of VRQoL and visual function/structural biomarkers in GA in 87 patients with FAF, IR, and SD-OCT imaging [10].Similar to Pfau et al. (2019), they too square-root-transformed their GA size.They fitted a linear regression model to the complete dataset at baseline using LASSO regression and compared it to the linear mixed-effects model, in which the patient was the random effect.They produced an R 2 of 0.32.
In comparison, the linear mixed-effects model produced the following results across the five-fold cross-validations.The average R 2 M value for the original scaled model was 0.84 ± 0.01.For R 2 C , the average was 0.95 ± 0.002, the r average was 0.97 ± 0.006, the RMSE was 1.44 ± 0.06, the ME was 0.09 ± 0.15, the MAE was 1.04 ± 0.06, the MAD was 3.52 ± 0.67, the AIC was 1718.41 ± 11.08, the BIC was 1798.99 ± 11.20, and the log likelihood was −839.21 ± 5.54.The metrics used in this study were more comprehensive in scope as compared to previously published papers.The transformations were also more extensive than in other studies, and the focus regarding transformations was not solely on the square-root transformation.The log transformation was considered in depth, and the transformation was conducted considering small GA areas, which was not an objective in previous studies surveyed by the authors.Both the marginal and conditional R 2 values in this study are higher than those presented in the literature for similar linear mixed-effects models.Furthermore, the model presented here, in its original and untransformed scale, proved to produce smaller forecasting errors than the transformed scale.Therefore, while future recommendations are made in the following paragraph to improve the current dynamics and coverage of the linear mixed-effects model presented here, it has still proven to be a much more accurate model when compared with others in the literature.

Limitations and Future Work
In this paper, the linear mixed-effects model was tested extensively using radiomics for feature extraction.The information elicited from the data could also be used to eventually develop a sound and consistent biophysical model.In [12], we briefly touched on biophysical models and the advantage of using such models in medicine.This study demonstrated the value of using biophysical models as well as statistical models.We showed how purely statistical models can introduce uncertainty in performance, with variations in the covariates and parameters used.We have also seen how overfitting can easily occur, even with models that initially seem to tick all the boxes for model diagnostics.These variabilities and uncertainties are due to the reliance of statistical and ML models on sample data for training rather than on physical properties of the disease itself and their relationships.The data-dependent parameters for training a statistical prediction model will change when the training dataset changes.We assume that the training sample is representative of the test dataset.The cross-validation process is designed with the aim of minimising overfitting and lowering uncertainty.
Problems relating to training and testing are generally not issues for biophysical models (apart from model calibration, in some cases, if required), but biophysical models also have some limitations.A disadvantage associated with biophysical models is that their development can take a considerable amount of time based on anthropic observations and hypothesis testing, whereas more timely predictions are possible with standard statistical models.A correctly formulated biophysical model is based on fundamental theories and assumptions and can apply to all datasets, whether large or small, without the need for training data every time.Most biophysical models have only a small number of covariates, as the derivation of non-linear models is not a trivial exercise.Statistical models, in contrast, may have a very large number of covariates and parameters that are beyond the scope of many biophysical models.While the derivation of biophysical models was not in the scope of this thesis, future work could involve the development of biophysical GA progression models to complement or enhance machine learning models.
One of the first objectives for future work is to address the extraction of new image features.The features in this study were extracted using XGBoost against the original scaled total area (in units of mm 2 /year).While these features were modified accordingly based on model diagnostics when used for the square-root-and log(Y i + 1)-transformed equivalent models, the features extracted were not specifically run with these outcomes in mind, and thus other features that were not included in the top-ranked features might have been more relevant to these outcomes.However, this limitation is another demonstration of how future models could be augmented by a greater contribution from biophysical modelling, which also helps with the "explainability" of model outputs.The authors have found that the use of convolution filtering in machine learning has, in the past, enabled more physically plausible feature extraction based on local shape or texture in an image, and this would complement time-based mixed-effects models based on first-and second-order statistics [28,52].
There are also other complexities with statistical datasets that are often ignored.An interesting complexity is referred to as the suppression effect.Using the definition by Conger (1974) [53], a suppressor variable is defined as one which increases the validity of another variable simply by its inclusion in a multivariate model.This effect may be due to dataset complexities, such as a small and unbalanced sample size, or the presence of interactions which have not yet been explored.This was briefly shown in the analyses here when, in isolation, the cluster variables showed statistical significance but lost their significance or changed their significance in the presence of other variables.
A strength of this study was the identification of potentially invariant image features associated with GA progression that allowed decoupling of the prediction model from dependence on new statistical samples for training data.For example, the use of an image feature, such as lesion elongation, as a metric for GA progression, is an attribute associated with a physical model rather than a statistical model.The mixed-effects model also introduced time dependence, which is not often associated with machine learning analysis.In some future situations, it may be valuable to incorporate clinical data explicitly into the model for more comprehensive analysis.Clinical data could be used in combination with image-based features to improve our understanding of GA progression.Because potential confounders may not be readily identified by imaging alone, the inclusion of clinical variables may reveal valuable additional information (relating to age, gender, or comorbidities).

Conclusions
Radiomic features were extracted from FAF images which described the shape, size, and intensities of GA features, including lesions, early-stage hyperfluorescence, intermediatestage hyperfluorescence, and finally late-stage hyperfluorescence.These features were then ranked, and the top 30 most important features were selected for the modelling process using XGBoost, a machine learning algorithm.There were a number of reasons for choosing XGBoost over more traditional statistical techniques for ranking variables.Firstly, the parallel computing process of XGBoost means that many analyses are performed simultaneously, which contributes to an increase in computational speed and efficiency.Secondly, XGBoost has the ability to take weak learners and convert them into strong learners and thus produce more accurate results.Thirdly, as XGBoost is a gradient-descent decision tree, it sifts through and "splits" the data and it will automatically remove redundant variables (i.e., those which are highly correlated), thus preventing the need to check for multicollinearity at later stages of analysis.These features are also reasons why XGBoost has become a popular ML algorithm.
Following the ranking of the image features, the features were then tested in three different linear mixed-effects models: (1) the original scaled outcome model, (2) the squareroot-transformed outcome model, and (3) the log-transformed [i.e., log(Y i + 1)] outcome model.While, in [13], we showed that the popular square-root transformation had similar residual outcomes to that of the original scaled linear model, the transformation was tested here again to determine if the transformation improved the model fitting in the presence of image features.The square-root transformation at first appeared to have been a better fit when combined with the image features.However, with large forecasting errors, discrepancies in predicted vs. observed values, and model fit values that did not change at all with each iteration of the cross-validation, it was concluded that the square-root transformation was not suitable and most likely was producing an overfitted model.
The log transformation appeared to be a positive alternative at first.It was stipulated that the lack of success or minimal use of this transformation in the past could have been because the log(Y i ) transformation was being used instead of the log transformation with the additive constant of "1".The former log transformation can create negative values with small GA areas.However, with the inclusion of 1, this problem is avoided.Unlike its square-root-transformed counterpart, values were slightly different between each iteration of the cross-validation, and model fit values, such as AIC, BIC, and log likelihood, were promising.However, the log transformation also suffered from forecasting errors and had larger variability in its predictions as compared to the original scaled linear mixed-effects model.Nonetheless, the new log(Y i + 1) transformation should be considered in future studies and compared to the more commonly used square-root transformation.The main disadvantage of this new log transformation is its interpretability.Previous publications have used the enlargement rate per year for the standard log transformation, and this publication followed suit.However, with the new log transformation, to make results more interpretable, one could run a reverse transformation following the model prediction.This would mean that one would first conduct and run the model with the log(Y i + 1) transformation.Once the prediction has been produced, then one could apply a reverse transform using exp(Y i ) − 1.

Figure 1 .
Figure 1.Process of radiomic feature extraction.The original image was coupled with the binarised output.This mask indicated the regions within the original image from which features should be extracted.In the above example, features were being extracted for the late-stage hyperfluorescence regions of GA.GA: Geographic atrophy.

Figure 1 .
Figure 1.Process of radiomic feature extraction.The original image was coupled with the binarised output.This mask indicated the regions within the original image from which features should be extracted.In the above example, features were being extracted for the late-stage hyperfluorescence regions of GA.GA: Geographic atrophy.
the variance calculated from the fixed-effects component; u = the number of random factors in the model; σ 2 l = the variance component of the lth random factor; and σ 2 e + σ 2 d = the sum of the additive dispersion component and the distribution-specific variance.

BioMedInformatics 2024, 4 ,Figure 2 .
Figure 2. Five-fold cross-validation (A-E) of top 30 ranked fundus autofluorescence radiomic image features.Most top-ranked features were consistently found across all five cross-validation sets.However, there were slight discrepancies, with some features appearing once or twice in the fivefold tests.For completeness, all variables which appeared across all cross-validations were included in the model assessment process.

Figure 2 .
Figure 2. Five-fold cross-validation (A-E) of top 30 ranked fundus autofluorescence radiomic image features.Most top-ranked features were consistently found across all five cross-validation sets.However, there were slight discrepancies, with some features appearing once or twice in the five-fold tests.For completeness, all variables which appeared across all cross-validations were included in the model assessment process.

Figure 3 .
Figure3.Close-up of the cross-validation 1 feature ranking process.Features beginning with "lesion_" in their name represent lesion-related features, while "e_hyp_" refers to early-stage hyperfluorescence, "int_hyp_" refers to intermediate-stage hyperfluorescence, and "late_hyp_" refers to late-stage hyperfluorescence features.The feature "lesion_elongation" was consistently the top-ranked feature across all five cross-validations.

Figure 4 .
Figure 4. Normal Q-Q plots for linear mixed-effects models used to test various combinations of features of interest.All plots have very similar outputs and are generally well established across the line except for the tail ends.The assumption of normality, irrespective of the combination of variables, was supported by these results.

Figure 5 .
Figure 5. Observed vs. predicted plots for linear mixed-effects models used to test various combinations of features of interest.All plots have very similar outputs and show high correlations between the outputs and predictions.This combination of variables had one of the highest conditional and marginal  values ( = 0.83,  = 0.96), a high correlation coefficient ( = 0.981; p < 0.001]), some of the smallest forecasting errors (RMSE = 1.32,ME = −7.3× 10 −15 , MAE = 0.94, and MAD = 0.999), and the lowest criterion values (AIC = 2084.93,BIC = 2169.97,and log likehood = −1022.46).Therefore, in the succeeding sections, these variables were further tested using (1) the original, square-root-transformed, and log( +1)-transformed outcomes, and (2) all of these model outcomes were tested using another five-fold cross-validation method similar to the one applied to the linear mixed-effects model.The transformations were tested to see if they improved the fit of the variables to the model.

Figure 4 .
Figure 4. Normal Q-Q plots for linear mixed-effects models used to test various combinations of features of interest.All plots have very similar outputs and are generally well established across the line except for the tail ends.The assumption of normality, irrespective of the combination of variables, was supported by these results.

Figure 4 .
Figure 4. Normal Q-Q plots for linear mixed-effects models used to test various combinations of features of interest.All plots have very similar outputs and are generally well established across the line except for the tail ends.The assumption of normality, irrespective of the combination of variables, was supported by these results.

Figure 5 .
Figure 5. Observed vs. predicted plots for linear mixed-effects models used to test various combinations of features of interest.All plots have very similar outputs and show high correlations between the outputs and predictions.This combination of variables had one of the highest conditional and marginal  values ( = 0.83,  = 0.96), a high correlation coefficient ( = 0.981; p < 0.001]), some of the smallest forecasting errors (RMSE = 1.32,ME = −7.3× 10 −15 , MAE = 0.94, and MAD = 0.999), and the lowest criterion values (AIC = 2084.93,BIC = 2169.97,and log likehood = −1022.46).Therefore, in the succeeding sections, these variables were further tested using (1) the original, square-root-transformed, and log( +1)-transformed outcomes, and (2) all of these model outcomes were tested using another five-fold cross-validation method similar to the one applied to the linear mixed-effects model.The transformations were tested to see if they improved the fit of the variables to the model.

Figure 5 .
Figure 5. Observed vs. predicted plots for linear mixed-effects models used to test various combinations of features of interest.All plots have very similar outputs and show high correlations between the outputs and predictions.This combination of variables had one of the highest conditional and marginal R 2 values (R 2 M = 0.83, R 2 C = 0.96), a high correlation coefficient (r = 0.981; p < 0.001]), some of the smallest forecasting errors (RMSE = 1.32,ME = −7.3× 10 −15 , MAE = 0.94, and MAD = 0.999), and the lowest criterion values (AIC = 2084.93,BIC = 2169.97,and log likehood = −1022.46).Therefore, in the succeeding sections, these variables were further tested

Figure 6 .
Figure 6.Normal Q-Q plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the original scale of mm 2 /year.The plots show a general adherence to the assumption of normality.

Figure 6 .
Figure 6.Normal Q-Q plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the original scale of mm 2 /year.The plots show a general adherence to the assumption of normality.

Figure 7 .
Figure 7. Normal Q-Q plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the square-root scale of mm/year.The plot shows a much tighter fit of residuals around the 45degree line.

Figure 7 .
Figure 7. Normal Q-Q plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the square-root scale of mm/year.The plot shows a much tighter fit of residuals around the 45-degree line.

Figure 8 .
Figure 8. Normal Q-Q plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the log( +1) scale of enlargement rate/year.Generally, there was a good adherence to assumptions of normality.

Figure 8 .
Figure 8. Normal Q-Q plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the log(Y i + 1) scale of enlargement rate/year.Generally, there was a good adherence to assumptions of normality.

Figure 9 .
Figure 9. Predicted vs. observed plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the original scale of mm 2 /year.There were minimal differences between the predicted and observed values, with a strong linear relationship.

Figure 9 .
Figure 9. Predicted vs. observed plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the original scale of mm 2 /year.There were minimal differences between the predicted and observed values, with a strong linear relationship.

Figure 10 .
Figure 10.Predicted vs. observed plots for five-fold (A-E) cross-validation of the linear mixed-effects model with the square-root scale of mm/year.There is a curvature at the lower tail end, and the relationship is not as strong as that of the original scaled model.

Figure 10 .
Figure 10.Predicted vs. observed plots for five-fold (A-E) cross-validation of the linear mixed-effects model with the square-root scale of mm/year.There is a curvature at the lower tail end, and the relationship is not as strong as that of the original scaled model.

Figure 11 .
Figure 11.Predicted vs. observed plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the log( +1) scale of enlargement rate/year.There is a curvature at the lower tail end, and the relationship is not as strong as that of the original scaled model.

Figure 11 .
Figure 11.Predicted vs. observed plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the log(Y i + 1) scale of enlargement rate/year.There is a curvature at the lower tail end, and the relationship is not as strong as that of the original scaled model.

Figure 12 .
Figure 12.Residual plots for five-fold cross-validation (A-E) of the linear mixed-effects model w the original scale of mm 2 /year.Generally, there is an even distribution of residuals, although a sl cluster appears on the left-hand side.

Figure 12 .
Figure 12.Residual plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the original scale of mm 2 /year.Generally, there is an even distribution of residuals, although a slight cluster appears on the left-hand side.

Figure 13 .Figure 13 .
Figure 13.Residual plots for five-fold cross-validation (A-E) of the linear mixed-effects model w the square-root scale of mm/year.Generally, there is an even distribution of residuals, and the sl cluster on the left-hand side is spread out.

Figure 14 .
Figure 14.Residuals plots for five-fold cross-validation (A-E) of the linear mixed-effects model w the log( +1) scale of enlargement rate/year.Generally, there is an even distribution of residuals, the slight cluster on the left-hand side is spread out.

Figure 14 .
Figure 14.Residuals plots for five-fold cross-validation (A-E) of the linear mixed-effects model with the log(Y i + 1) scale of enlargement rate/year.Generally, there is an even distribution of residuals, and the slight cluster on the left-hand side is spread out.

Table 1 .
Radiomic and clustering features extracted from FAF images.More information regarding each radiomic feature can be found at https://pyradiomics.readthedocs.io/en/latest/features.html (accessed on 12 August 2021).Clustering features were the result of our previous work[27].Gray-Level Non-Uniformity, Dependence Entropy, Small Dependence Low Gray-Level Emphasis, Gray-Level Variance, Dependence Non-Uniformity Normalised, Large Dependence High Gray-Level Emphasis, Large Dependence Emphasis, Large Dependence Low Gray-Level Emphasis, Small Dependence High Gray-Level Emphasis, Dependence Variance, High Gray-Level Emphasis, Dependence Non-Uniformity, Low Gray-Level Emphasis, Small Dependence Emphasis

Table 3 .
Top combinations of features tested for eight linear mixed-effects models for the original scale-GA total area (mm 2 /year).

Table 4 .
Results from the eight linear mixed-effects models from Table3ranked in order of most suitable to least suitable.Root mean square error; ME: Mean error; MAE: Mean absolute error; MAD: Mean absolute deviation; AIC: Akaike Information Criterion; BIC: Bayesian Information Criterion; logLik: Log likelihood; DF: Degrees of freedom.

Table 5 .
The different features used in the original scale (model 1.2), along with its square-root-(model 1.2sq) and log(Y i + 1) (model 1.2log)-transformed forms.
M : Marginal R 2 ; R 2 C : Conditional R 2 ; r: Pearson's correlation coefficient with the accompanying p-value; RMSE: Root mean square error; ME: Mean error; MAE: Mean absolute error; MAD: Mean absolute deviation; AIC: Akaike Information Criterion; BIC: Bayesian Information Criterion; logLik: Log likelihood; DF: Degrees of freedom.