Small increases in agent-based model complexity can result in large increases in required calibration data

Agent-based models (ABMs) are widely used to model coupled natural-human systems. Descriptive models require careful calibration with observed data. However, ABMs are often not calibrated in a statistical sense. Here we examine the impact of data record structure on the calibration of an ABM for housing abandonment in the presence of flood risk. Using a perfect model experiment, we examine the impact of data record structures on (i) model calibration and (ii) the ability to distinguish a model with inter-agent interactions from one without. We show how limited data sets may not constrain a model with just four parameters. This indicates that many ABMs may require informative prior distributions to be descriptive. We also illustrate how spatially-aggregated data can be insufficient to identify the correct model structure. This emphasizes the need for utilizing independent lines of evidence to select sound and informative priors.

Models can be designed to address different questions about the modeled system, including prediction, explanation, and demonstration [Epstein, 2008]. Marks [2011] propose a classification of simulation models as demonstrative or descriptive based on the model's purpose. Demonstrative ABMs are used to illustrate that patterns of interest can be produced through local-level rules and interactions. Descriptive ABMs are intended to reproduce observed phenomena for the purpose of explanation, prediction, or both. The descriptive model category includes both simpler, "strategic" models, intended and more complex, "tactical" models [Holling, 1966]. Early ABMs, such as the pioneering work on segregation [Schelling, 1971], were primarily demonstrative Ostrom, 2006, Marks, 2011]. Over time, there has been an increase in descriptive models [Janssen and Ostrom, 2006], the most famous of which is arguably the Artificial Anasazi Model [Dean et al., 2000].
Both demonstrative and descriptive models require tests to ensure that the model works as intended. Descriptive models also benefit from a comparison of model output against observations [Marks, 2011]. Simulations of hindcasts demonstrate that the model reproduces historical data, though this is not the same thing as demonstrating that the model is a reproduction of system dynamics [Oreskes et al., 1994], as all models are approximations of real processes [Box, 1976].
This observation that models are only capable of approximating, rather than reproducing real system dynamics, shows the importance of model calibration, the process of selecting model structures and parameter values, for descriptive modeling. One common approach to calibration is to tune model parameters until model outputs are close to the empirical data [Kelley and Evans, 2011, Schwarz and Ernst, 2009, van der Vaart et al., 2015, but these procedures can lead to overfitting the model to the calibration data due to neglecting the conditional and stochastic aspects of data-generation and observation [Brown et al., 2005].
ABM calibration can be complicated by path-dependence and nonlinearities resulting from feedbacks, which increase the conditional nature of observations. To avoid overfitting and account for the stochastic elements of a model, another approach is to choose a model structure and parameter values which are most probable given the observations and prior information about system dynamics [Jaynes, 2003, Robert, 2007. Another complication of ABM development is the risk of over-parameterization. Over-parameterized models, which include variables and dynamics outside of what is supported by the evidence, may be an additional reason for overfitting observations and generalize less well for the purposes of prediction while adding minimal theoretical benefit. However, whether descriptive ABMs are intended to be used for explanation or prediction, these features suggest a need for quantification of model and parametric uncertainty, as observed patterns may be contingent on stochastic forcings or particular initial conditions. In particular, probabilistic calibration takes into account the stochastic nature of the data by modeling the likelihood of the observations given the external forcings or internal interactions and feedbacks.
This discussion leads to a simple overarching question: How much data is required to probabilistically calibrate agent-based models? We address this question using a Bayesian approach to uncertainty quantification, based on the Bayesian interpretation of parameter values as random variables. We focus on three questions. First, how do varying dataset sizes affect the statistical calibration of an ABM? Increasing complexity of agent decision rules (in the sense of the number of parameters) and including agent-agent and agent-environment interactions and feedbacks (in the sense of emergence) can reduce the ability to constrain model parameters or test hypotheses, particularly when data may be relatively scarce. Second, can we distinguish between models with varying levels of complexity, either in terms of high-dimensional decision rules or the types of agent interactions with each other and the environment? Third, how are calibration and hypothesis-testing affected by the use of spatially-aggregated data (as opposed to observations of individual agents), which may be all that are available due to data-collecting limitations or considerations of anonymity?
For a concrete example, we focus on the particular problem of modeling housing abandonment under flood risks, following the excellent work of Tonn and Guikema [2017]. Housing abandonment poses potentially severe economic problems for settlements along rivers and coastlines [Fowler et al., 2017]. Residents who haven't experienced flooding themselves may abandon their homes if their neighbors do due to depreciating values or anticipation of future flooding. An associated ABM, and two nested submodels with fewer interactions and feedbacks, are illustrated by the influence diagrams in Figure 1.

Models
We analyze two ABMs (represented by the influence diagram in Figure 1). In both models, agents decide to vacate their homes using a probabilistic decision process (logistic regression), as opposed to maximizing utility or using heuristics (which are more common in ABMs in certain application areas, such as land use [Groeneveld et al., 2017]. Once a house is abandoned, there is a chance that it is occupied by a new agent in a subsequent year. In the first ("simple") model, the probability of housing abandonment is determined only by the frequency of experienced floods over the previous ten years, is the "no-interactions" model, and is determined by three parameters: the logistic regression intercept β 0 , the logistic regression coefficient for flood frequency β 1 , and the probability that vacant houses are filled by a new agent α. Denote by s i t the state of parcel i, where s i t = 1 if the parcel is occupied and 0 if the parcel is vacant. The probability that this parcel switches from occupied to abandoned is where r t is the frequency of flooding events for the previous ten years. The probability that this parcel switches from abandoned to occupied is constant, In the second ("spatial-interactions") model, we include an additional logistic regression covariate, the fraction of neighboring plots which are vacant at the start of time t. As a result, this model has four parameters, including the coefficient for this neighboring-vacancy covariate. These state transition equations give both models a Markovian structure.
Of course, these models could be further expanded, for example by including housing market dynamics (see Figure 1). In this study, we focus for clarity just on the two models, as we are interested in understanding the process and results of statistical calibration for simple ABMs.

Data
We use the models described in Section 2.1 in a perfect model experiment (see, for example, Olson et al. [2013] or Reed and Kollat [2012], so that the data-generating process and model parameters are known. We generated pseudo-observations for the perfect model experiment using the model with spatial interactions, to see if we could successfully test for this effect. The pseudo-observations are generated using the spatial-interactions model for an artificial riparian settlement and realizations of annual flood height maxima from a gen-

Flood Exposure
Abandon Decision

Vacant Nearby Lots
House Prices Figure 1: Influence diagrams of three nested ABMs for housing abandonment decisions under evolving flood pressure. The black components form a basic model without interactions (the "no interaction" model), in which abandonment decisions are based only on floods experienced by each agent. The blue and black components form a model with spatial interactions (the "spatial interactions" model), in which agents move due to experienced floods and the proportion of neighboring lots which are vacant. For context, we also show a more complex model (with the red additions) with spatial and economic interactions (the "economic interactions" model), in which housing market dynamics are affected by abandonments and floods and the abandonment decision includes housing values. eralized extreme value distribution. The parcel return periods and river heights are shown in Figure 2. The additional dynamic mechanism resulting from spatial interactions leads to increased probabilities of parcel abandonment for all return periods across realizations of the stochastic process, even for parcels that are far from floods ( Figure 3).
Parcel residency was initialized by assuming that each parcel had a 99% probability of having a resident in year 0. We used varying combinations of observed years and parcels (see Figure 2). The combinations were 10, 25, and 50 years, and 25, 50, and 100 parcels.
Annual maxima river heights were simulated from a generalized extreme value distribution with location parameter 865, scale parameter 11, and shape parameter 0.02. Data-generating parameter values were -6 for the logistic intercept, 20 for the local-flood coefficient, 4 for the neighboring-vacancy coefficient, and 0.01 for the vacancy-fill probability.
As data may not be available in individualized forms, we examine the power of data for forms. In the individual case, the data set contains observations of each observed parcel at each time. In the aggregate case, we observe the total number of abandoned parcels at each time.

Calibration
We use a Bayesian framework for model calibration, based on Bayes' Theorem [Bayes, 1763]: where p(θ|y) is the posterior density, p(y|θ) is the data likelihood, and p(θ) is the prior.
We constructed the priors that were used in this study (Table 1)  for the Bayesian inversion. MCMC is an extremely general method for sampling from the posterior distribution and has previously been used for calibrating ABMs [Keith and Spring, 2013]. We use 150,000 Metropolis-Hastings [Hastings, 1970] iterations after a preliminary adaptive run [Vihola, 2012] of 30,000 iterations, which is used to estimate the starting value of the production run as well as the MCMC sampling distribution. The preliminary run is initialized at the maximum-likelihood estimate.
While our choosen method of Metropolis-Hastings MCMC has the ability to produce high-fidelity approximations to the full joint probability distribution of the model parameters [Robert, 2014], it may be computationally intractable for complex models featuring long runtimes or high-dimensional parameter spaces. An additional complication is the need to specify a statistical likelihood function, which may be difficult for particular applications.
In general, there is a trade-off between computational speed and accuracy of the resulting parameter distributions. Some alternative approaches to statistical calibration of ABMs, which are aimed at reducing computational requirements or likelihood specification, include statistical emulation [Lamperti et al., 2018, Oyebamiji et al., 2017, particle filtering [Kattwinkel and Reichert, 2017], and approximate Bayesian computation [Fabretti, 2018, Sirén et al., 2018, van der Vaart et al., 2015. While these methods reduce the computational burden, they come at a cost of potentially severe statistical approximations that can influence the parameter estimates [Frazier et al., 2018, Künsch, 2013, Robert et al., 2011, Singh et al., 2018.

Model Selection
We estimate the marginal likelihoods for each model using the bridge sampling [Meng and Wing, 1996]. The importance density is a truncated multivariate normal with mean and covariance derived from the MCMC output. The truncation occurs along the vacancy fill probability dimension, to ensure that this parameter only takes values between zero and one. We use 5,000 posterior and importance samples in the bridge sampling estimator, which results in standard errors [Fruhwirth-Schnatter, 2004] for the log-marginal likelihoods of orders of magnitude smaller than 1e-3.

Calibration
The structure of the data (individual-parcel versus spatially-aggregated) strongly influences the final shape of the posterior distribution, both due to the number of data points and the different likelihood function specifications. Figure 4 shows the result of updating the prior distributions (specified in Table 1) with 50 years of pseudo-observations of 100 parcels.   To validate the calibrated model, we analyze the hindcasting ability of the posterior predictive distribution (shown in Figure 6). While the three-parameter no-interactions model is well constrained by smaller data sets, the poor fit of the posterior predictive distribution compared to the pseudo-observations for increased amounts of data reveals the missing abandonment dynamic mechanism. Without spatial interactions, the no-interactions model calibration results in a higher sensitivity to experienced flooding to account for the data, which results in an overestimate of the number of abandoned parcels in later years. Meanwhile, the spatial-interactions model, which has one additional parameter, requires more data to constrain the model (25 observed parcels is insufficient with up to 50 years of data), but, once constrained, fits the pseudo-observations better than the no-interactions model. In general, having a larger spatial domain/numbers of agents facilitates calibration more than having a longer data record.
While Figure 6 might appear to show that calibration with aggregate data results in a better fit to the observations, this is likely an artifact of two different components of the modeling process. First, the likelihood functions used for the aggregate data calibration was different than in the individual-data case (Poisson vs. binomial), which will change the shape of the posterior distribution. Second, in the aggregate case, we calibrated the model directly against the aggregated counts shown in Figure 6, while in the individual-data case, the expected state of each parcel was used. These two differences make it difficult to compare the quality of the hindcast, as they are structurally different calibration procedures.
We would expect that in the latter case, there is greater uncertainty about the total count of abandoned parcels. We do not view this is not a flaw with the individual-data procedure, and it likely better reflects the true underlying uncertainty, given the complex dynamics in the model.

Model Selection
More complex ABMs can be thought of as being constructed by adding new interactions and feedbacks to simpler ABMs, as illustrated in Figure 1. This allows us to view this type of model selection as hypothesis testing for the presence of additional feedback mechanisms [Cottineau et al., 2015]. One standard method of comparing the fit of Bayesian models to data is by computing Bayes factors [Kass and Raftery, 1995]. The Bayes factor is the ratio of marginal likelihoods of two models (the integral of the data likelihood over the posterior).
Posterior model structural probabilities can be calculated by combining prior beliefs about the relative probability of the competing models with the Bayes factor.
One important consideration when using Bayes factors is the role of the prior in the computation [Robert, 2007], particularly when they are used for point-null hypothesis testing.
Here, we use the same priors for corresponding parameters to reduce this effect. For our perfect model experiment, we would expect additional (in terms of the number of observations) and spatially explicit (rather than aggregated) data to improve the ability to distinguish between the data-generating spatial-interactions model and the simpler nointeractions model. In Figure 7, we show the log-Bayes factors (along with thresholds for evidence levels proposed by Kass and Raftery [1995]) to summarize the evidence for the spatial-interactions model versus the no-interactions model. We neglect the case with 25 observed parcels due to unreasonably high estimates, likely due to the ill-constrained spatialinteractions model (however, the hindcasts in Figure 6 shows the qualitatively better fit of the no-interactions model for this data set, particularly in the individual-data scenario). For individual-parcel data, with more than 25 observed parcels, there is at least strong evidence  Kass and Raftery [1995]. The marginal likelihoods for each model were estimated using bridge sampling [Meng and Wing, 1996] with 5000 samples and a truncated multivariate normal importance density.
for the spatial-interactions model no matter how long the parcels were observed, which confirms the qualitative assessment (on the summary statistic of total abandoned parcels) obtained by comparing the hindcasts in Figures 6c and 6d.
On the other hand, when aggregated data is used for calibration, there is essentially no quantitative evidence for the spatial-interactions model. This is the case whether we compare the models using Bayes factors or a predictive information criterion such as the Watanabe-Akaike information criterion ( [Watanabe, 2010, Vehtari et al., 2017, estimated (along with standard errors of the differences) using 10,000 posterior samples. Predictive model comparison methods avoid the direct influence of the prior on the comparison and allows for an intuitive comparison between models which have different parameterizations [Gelfand and Ghosh, 1998]. The one-standard error range of the difference in WAIC between the spatial-interactions and the no-interactions model is between -2 and 2, which can be interpreted as no difference in support between the two models [Burnham and Anderson, 2004]. However, a qualitative assessment obtained by comparing Figures 6a and 6b might lead a modeler to conclude that the spatial-interactions model fits the observations better than the no-interactions model. This suggests that hindcasting can serve an important supporting role to quantitative model selection.

DISCUSSION
Probabilistic calibration is an important component of the descriptive agent-based modeling process due to the influence of stochastic noise via path-dependence and feedback loops (as illustrated in Figure 5). However, as our results illustrate, each additional parameter can considerably increase the calibration data requirements. Trying to include every hypothesized feedback mechanism in the final model choice, without supporting evidence, can pose problems from statistical as well as a decision-theoretical points of view [Box, 1976, Jaynes, 2003, Robert, 2007. Starting with a simple model and adding complexity when supported by the data can produce more skillful hindcasts, projections, and more powerful insights [Box, 1979, Holling, 1966].
An additional concern is the specification of prior distributions. When less data is available (particularly in summarized or aggregated form), that data will have less power to update the prior distributions.This suggests that priors should be as informative as possible (with a strong warning that priors ought not to be more informative that can be supported).
While we did not take prior correlations between parameters into account for this experiment, good priors for real-world problems will include prior information about correlations between parameters.
One approach to creating informed priors which include information about the relationships between parameters is probabilistic inversion [Kraan andCooke, 2000, Fuller et al., 2017], in which expert assessments (or, as an alternative, the results of judgement and decision-making or economic experiments) can be used to update more generic priors in a way which is consistent with those assessments or experimental results. This allows the survey or experimental participants to provide information directly about outcomes rather than about model parameters, and allows for a separation of the data involved in the prior construction and Bayesian updating processes.

AUTHOR CONTRIBUTIONS STATEMENT
V.S. and K.K. conceptualized the research. V.S. wrote the model and analysis codes. V.S. and K.K. designed the figures and wrote the paper.

ADDITIONAL INFORMATION
Competing interests: The authors are not aware of competing interests.
Supplemental Material for "Small increases in agent-based model complexity can result in large increases in required calibration data" Combining the conditional probabilities in Equations (1) and (2) yields a recursive updating equation, p(s i t = 0) p(s i t = 1) = 1 − P 01 P 01 P 10 1 − P 10 p(s i t−1 = 0) p(s i t−1 = 1) .
Denoting P t = 1 − P 01 t P 01 P 10 t 1 − P 10 t , we can write the unconditional probabilities for s t i (unconditional as we view the initial probability of vacancy as fixed, p(s i 0 = 1) = 0.99) as . ( The probability distribution which defines the likelihood function depends on whether we assume we have individual-parcel or aggregated data. In the individual-parcel case, we know the occupancy state of each parcel at each time. We use a binomial distribution (which models the probability of a Boolean outcome) at each time using the unconditional probabilities defined by Equation (3) for each parcel, and assume that the likelihoods for each parcel at each time are independent, as the dependence on neighboring parcels (which only exists in the spatial-interactions model) is captured in Equation (3) via v t i . For the aggregated case, we only know the total number of vacant parcels at each time, and we compute the expected number of vacant parcels by summing the vacancy probability over each parcel, λ t = i p(s i t ). We then model this count using a Poisson distribution with parameter λ t .